긍정적 사고, 음식의 절제, 규칙적인 운동

konlpy

[NLPY] twitter-korean-text - 트위터에서 만든 오픈소스 한국어 처리기 2017.08.08
[python] Wordcloud 만들기 2017.07.11
파이썬으로 영어와 한국어 텍스트 다루기 2017.06.21
[python] konlpy 하다가 그래프에 한글 안나올때, 어제는 안되고 오늘은 되네 2017.06.16
[python] 파이썬으로 영어와 한국어 텍스트 다루기, 문서 전처리 2017.06.15
konlpy 한국어 처리 패키지 2017.06.15

[NLPY] twitter-korean-text - 트위터에서 만든 오픈소스 한국어 처리기

홍반장水_ 2017. 8. 8. 15:20

2017. 8. 8. 15:20

twitter-korean-text - 트위터에서 만든 오픈소스 한국어 처리기

https://github.com/twitter/twitter-korean-text

트위터에서 만든 오픈소스 한국어 처리기

2017년 4.4 버전 이후의 개발은 http://openkoreantext.org 에서 진행됩니다.
We now started an official fork at http://openkoreantext.org as of early 2017. All the development after version 4.4 will be done in open-korean-text.

Scala/Java library to process Korean text with a Java wrapper. twitter-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.

스칼라로 쓰여진 한국어 처리기입니다. 현재 텍스트 정규화와 형태소 분석, 스테밍을 지원하고 있습니다. 짧은 트윗은 물론이고 긴 글도 처리할 수 있습니다. 개발에 참여하시고 싶은 분은 Google Forum에 가입해 주세요. 사용법을 알고자 하시는 초보부터 코드에 참여하고 싶으신 분들까지 모두 환영합니다.

twitter-korean-text의 목표는 빅데이터 등에서 간단한 한국어 처리를 통해 색인어를 추출하는 데에 있습니다. 완전한 수준의 형태소 분석을 지향하지는 않습니다.

twitter-korean-text는 normalization, tokenization, stemming, phrase extraction 이렇게 네가지 기능을 지원합니다.

정규화 normalization (입니닼ㅋㅋ -> 입니다 ㅋㅋ, 샤릉해 -> 사랑해)

한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ -> 한국어를 처리하는 예시입니다 ㅋㅋ

토큰화 tokenization

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하는Verb, 예시Noun, 입Adjective, 니다Eomi ㅋㅋKoreanParticle

어근화 stemming (입니다 -> 이다)

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하다Verb, 예시Noun, 이다Adjective, ㅋㅋKoreanParticle

어구 추출 phrase extraction

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어, 처리, 예시, 처리하는 예시

Introductory Presentation: Google Slides

Try it here

Gunja Agrawal kindly created a test API webpage for this project: http://gunjaagrawal.com/langhack/

Gunja Agrawal님이 만들어주신 테스트 웹 페이지 입니다. http://gunjaagrawal.com/langhack/

Opensourced here: twitter-korean-tokenizer-api

API

scaladoc

mavendoc

Maven

To include this in your Maven-based JVM project, add the following lines to your pom.xml:

Maven을 이용할 경우 pom.xml에 다음의 내용을 추가하시면 됩니다:

  <dependency>
    <groupId>com.twitter.penguin</groupId>
    <artifactId>korean-text</artifactId>
    <version>4.4</version>
  </dependency>

The maven site is available here http://twitter.github.io/twitter-korean-text/ and scaladocs are here http://twitter.github.io/twitter-korean-text/scaladocs/

Support for other languages.

.net

modamoda kindly offered a .net wrapper: https://github.com/modamoda/TwitterKoreanProcessorCS

node.js

Ch0p kindly offered a node.js wrapper: twtkrjs

Youngrok Kim kindly offered a node.js wrapper: node-twitter-korean-text

Python

Baeg-il Kim kindly offered a Python version: https://github.com/cedar101/twitter-korean-py

Jaepil Jeong kindly offered a Python wrapper: https://github.com/jaepil/twkorean

Python Korean NLP project KoNLPy now includes twitter-korean-text. 파이썬에서 쉬운 활용이 가능한 KoNLPy 패키지에 twkorean이 포함되었습니다.

Ruby

jun85664396 kindly offered a Ruby wrapper: twitter-korean-text-ruby

This provides access to com.twitter.penguin.korean.TwitterKoreanProcessorJava (Java wrapper).

Jaehyun Shin kindly offered a Ruby wrapper: twitter-korean-text-ruby

This provides access to com.twitter.penguin.korean.TwitterKoreanProcessor (Original Scala Class).

Elastic Search

socurites's Korean analyzer for elasticsearch based on twitter-korean-text: tkt-elasticsearch

Get the source 소스를 원하시는 경우

Clone the git repo and build using maven.

Git 전체를 클론하고 Maven을 이용하여 빌드합니다.

git clone https://github.com/twitter/twitter-korean-text.git
cd twitter-korean-text
mvn compile

Open 'pom.xml' from your favorite IDE.

Usage 사용 방법

You can find these examples in examples folder.

examples 폴더에 사용 방법 예제 파일이 있습니다.

from Scala

import com.twitter.penguin.korean.TwitterKoreanProcessor
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor.KoreanPhrase
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer.KoreanToken

object ScalaTwitterKoreanTextExample {
  def main(args: Array[String]) {
    val text = "한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ #한국어"

    // Normalize
    val normalized: CharSequence = TwitterKoreanProcessor.normalize(text)
    println(normalized)
    // 한국어를 처리하는 예시입니다ㅋㅋ #한국어

    // Tokenize
    val tokens: Seq[KoreanToken] = TwitterKoreanProcessor.tokenize(normalized)
    println(tokens)
    // List(한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하는(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 입니(Adjective: 12, 2), 다(Eomi: 14, 1), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4))

    // Stemming
    val stemmed: Seq[KoreanToken] = TwitterKoreanProcessor.stem(tokens)

    println(stemmed)
    // List(한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4))

    // Phrase extraction
    val phrases: Seq[KoreanPhrase] = TwitterKoreanProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)
    println(phrases)
    // List(한국어(Noun: 0, 3), 처리(Noun: 5, 2), 처리하는 예시(Noun: 5, 7), 예시(Noun: 10, 2), #한국어(Hashtag: 18, 4))
  }
}

from Java

import java.util.List;

import scala.collection.Seq;

import com.twitter.penguin.korean.TwitterKoreanProcessor;
import com.twitter.penguin.korean.TwitterKoreanProcessorJava;
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor;
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer;

public class JavaTwitterKoreanTextExample {
  public static void main(String[] args) {
    String text = "한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ #한국어";

    // Normalize
    CharSequence normalized = TwitterKoreanProcessorJava.normalize(text);
    System.out.println(normalized);
    // 한국어를 처리하는 예시입니다ㅋㅋ #한국어


    // Tokenize
    Seq<KoreanTokenizer.KoreanToken> tokens = TwitterKoreanProcessorJava.tokenize(normalized);
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(tokens));
    // [한국어, 를, 처리, 하는, 예시, 입니, 다, ㅋㅋ, #한국어]
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens));
    // [한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하는(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 입니(Adjective: 12, 2), 다(Eomi: 14, 1), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4)]


    // Stemming
    Seq<KoreanTokenizer.KoreanToken> stemmed = TwitterKoreanProcessorJava.stem(tokens);
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(stemmed));
    // [한국어, 를, 처리, 하다, 예시, 이다, ㅋㅋ, #한국어]
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(stemmed));
    // [한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4)]


    // Phrase extraction
    List<KoreanPhraseExtractor.KoreanPhrase> phrases = TwitterKoreanProcessorJava.extractPhrases(tokens, true, true);
    System.out.println(phrases);
    // [한국어(Noun: 0, 3), 처리(Noun: 5, 2), 처리하는 예시(Noun: 5, 7), 예시(Noun: 10, 2), #한국어(Hashtag: 18, 4)]

  }
}

Basics

TwitterKoreanProcessor.scala is the central object that provides the interface for all the features.

TwitterKoreanProcessor.scala에 지원하는 모든 기능을 모아 두었습니다.

Running Tests

mvn test will run our unit tests

모든 유닛 테스트를 실행하려면 mvn test를 이용해 주세요.

Tools

We provide tools for quality assurance and test resources. They can be found under src/main/scala/com/twitter/penguin/korean/qa and src/main/scala/com/twitter/penguin/korean/tools.

Contribution

Refer to the general contribution guide. We will add this project-specific contribution guide later.

설치 및 수정하는 방법 상세 안내

Performance 처리 속도

Tested on Intel i7 2.3 Ghz

Initial loading time (초기 로딩 시간): 2~4 sec

Average time per parsing a chunk (평균 어절 처리 시간): 0.12 ms

Tweets (Avg length ~50 chars)

Tweets	100K	200K	300K	400K	500K	600K	700K	800K	900K	1M
Time in Seconds	57.59	112.09	165.05	218.11	270.54	328.52	381.09	439.71	492.94	542.12
Average per tweet: 0.54212 ms

Benchmark test by KoNLPy

From http://konlpy.org/ko/v0.4.2/morph/

Author(s)

Will Hohyon Ryu (유호현): https://github.com/nlpenguin | https://twitter.com/NLPenguin

저작자표시 (새창열림)

'프로그래밍 > AI_DeepLearning' 카테고리의 다른 글

[Chatbot] Chatfuel - Build a Facebook bot without coding. (0)	2017.08.09
[ChatBot] RiveScript - test https://play.rivescript.com/s/F12LdIGLZI (0)	2017.08.08
[Chatbot] http://mindmap.ai/ - 마인드맵으로 만드는 인공지능 챗봇플랫폼 (0)	2017.08.08
[Chatbot] 챗봇이 실패하는 이유 & 성공적인 챗봇을 만드는 방법 (0)	2017.08.08
[Netflix] Vectorflow - https://github.com/Netflix/vectorflow (0)	2017.08.04

[python] Wordcloud 만들기

홍반장水_ 2017. 7. 11. 10:11

2017. 7. 11. 10:11

 Wordcloud 만들기 

from collections import Counter
from konlpy.tag import Twitter
import pytagcloud
 
f = open('blog_data.txt')
data = f.read()
 
nlp = Twitter()
nouns = nlp.nouns(data)
 
count = Counter(nouns)
tags2 = count.most_common(40)
taglist = pytagcloud.make_tags(tags2, maxsize=80)
pytagcloud.create_tag_image(taglist, 'wordcloud.jpg', size=(900, 600), fontname='korean', rectangular=False)
 
f.close()

저작자표시 (새창열림)

'프로그래밍 > Python' 카테고리의 다른 글

[python] 단어 임베딩의 원리와 gensim.word2vec 사용법¶ (0)	2017.07.12
[python] gensim.models.Word2Vec.train (0)	2017.07.12
[Python] Join, Split 리스트를 문자열로, 문자열을 리스트로 변환 (0)	2017.07.07
[python] gensim + word2vec 모델 만들어서 사용하기 (0)	2017.06.28
[Python] Flask 설치 (0)	2017.06.28

파이썬으로 영어와 한국어 텍스트 다루기

홍반장水_ 2017. 6. 21. 10:39

2017. 6. 21. 10:39

파이썬으로 영어와 한국어 텍스트 다루기

- https://www.lucypark.kr/courses/2015-dm/text-mining.html

Terminologies

English	한국어	Description
Document	문서	-
Corpus	말뭉치	A set of documents
Token	토큰	Meaningful elements in a text such as words or phrases or symbols
Morphemes	형태소	Smallest meaningful unit in a language
POS	품사	Part-of-speech (ex: Nouns)

Text analysis process

전처리는 아래의 세부 과정으로 다시 한 번 나뉜다.

Load text
Tokenize text (ex: stemming, morph analyzing)
Tag tokens (ex: POS, NER)
Token(Feature) selection and/or filter/rank tokens (ex: stopword removal, TF-IDF)
...and so on (ex: calculate word/document similarities, cluster documents)

Useful Python Packages for Text Mining and NLP

NLTK: Provides modules for text analysis (mostly language independent)
- 설치하기
```
pip install nltk
```
- 주요기능
  1. Text corpora: 특히, 이 튜토리얼에서는 아래의 두 가지 데이터가 필요하니 미리 다운 받아두자.
```
nltk.download('gutenberg')
nltk.download('maxent_treebank_pos_tagger')
```
  2. Word POS, NER classification
  3. Document classification
KoNLPy: Provides modules for Korean text analysis
- 설치하기
```
pip install konlpy
```
- 주요기능
  1. Text corpora
  2. Word POS classification
    - Hannanum
    - Kkma
    - Mecab
    - Komoran
    - Twitter
Gensim: Provides modules for topic modeling and calculating similarities among documents
- 설치하기
```
pip install -U gensim
```
- 주요기능
  1. Topic modeling
  2. Word embedding
    - word2vec

Twython: Provides easy access to Twitter API

설치하기
```
pip install twython
```

사용예시: "Samsung (삼성)" 관련 트윗 받기

from twython import Twython
import settings as s    # Create a file named settings.py, and put oauth KEY values inside
twitter = Twython(s.APP_KEY, s.APP_SECRET, s.OAUTH_TOKEN, s.OAUTH_TOKEN_SECRET)
tweets = twitter.search(q='삼성', count=100)
data = [(t['user']['screen_name'], t['text'], t['created_at']) for t in tweets['statuses']]

Text exploration

1. Read document

이 튜토리얼에서는 NLTK, KoNLPy에서 제공하는 문서들을 사용한다.

영어: Jane Austen의 소설 Emma
한국어: 대한민국 국회 제 1809890호 의안

할 수 있는 사람은, 위의 문서 대신 다른 텍스트 데이터를 로딩하여 사용해보자.

English

from nltk.corpus import gutenberg   # Docs from project gutenberg.org
files_en = gutenberg.fileids()      # Get file ids
doc_en = gutenberg.open('austen-emma.txt').read()

Korean

from konlpy.corpus import kobill    # Docs from pokr.kr/bill
files_ko = kobill.fileids()         # Get file ids
doc_ko = kobill.open('1809890.txt').read()

2. Tokenize

문서를 토큰으로 나누는 방법은 다양하다. 여기서는 영어에는 nltk.regexp_tokenize, 한국어에는 konlpy.tag.Twitter.morph를 사용해보자.

English

from nltk import regexp_tokenize
pattern = r'''(?x) ([A-Z]\.)+ | \w+(-\w+)* | \$?\d+(\.\d+)?%? | \.\.\. | [][.,;"'?():-_`]'''
tokens_en = regexp_tokenize(doc_en, pattern)

Korean

from konlpy.tag import Twitter; t = Twitter()
tokens_ko = t.morphs(doc_ko)

3. Load tokens with `nltk.Text()`

nltk.Text()는 문서 하나를 편리하게 탐색할 수 있는 다양한 기능을 제공한다.

English
```
import nltk
en = nltk.Text(tokens_en)
```
Korean (For Python 2, name has to be input as u'유니코드'. If you are using Python 2, use u'유니코드' for input of all following Korean text.)
```
import nltk
ko = nltk.Text(tokens_ko, name='대한민국 국회 의안 제 1809890호')   # For Python 2, input `name` as u'유니코드'
```

지금부터 nltk.Text()가 제공하는 다양한 기능을 하나씩 살펴보자. (참고링크: class nltk.text.Text API 문서)

Tokens

English

print(len(en.tokens))       # returns number of tokens (document length)
print(len(set(en.tokens)))  # returns number of unique tokens
en.vocab()                  # returns frequency distribution

191061
7927
FreqDist({',': 12018, '.': 8853, 'to': 5127, 'the': 4844, 'and': 4653, 'of': 4278, '"': 4187, 'I': 3177, 'a': 3000, 'was': 2385, ...})

Korean

print(len(ko.tokens))       # returns number of tokens (document length)
print(len(set(ko.tokens)))  # returns number of unique tokens
ko.vocab()                  # returns frequency distribution

1707
476
FreqDist({'.': 61, '의': 46, '육아휴직': 38, '을': 34, '(': 27, ',': 26, '이': 26, ')': 26, '에': 24, '자': 24, ...})

Plot frequency distributions

English

en.plot(50)     # Plot sorted frequency of top 50 tokens

Korean

ko.plot(50)     # Plot sorted frequency of top 50 tokens

Tip: To save a plot programmably, and not through the GUI, overwrite pylab.show with pylab.savefig before drawing the plot (reference):
from matplotlib import pylab
pylab.show = lambda: pylab.savefig('some_filename.png')
Troubleshooting: For those who see rectangles instead of letters in the saved plot file, include the following configurations before drawing the plot:
from matplotlib import font_manager, rc
font_fname = 'c:/windows/fonts/gulim.ttc'     # A font of your choice
font_name = font_manager.FontProperties(fname=font_fname).get_name()
rc('font', family=font_name)
Some example fonts:
Mac OS: /Library/Fonts/AppleGothic.ttf

Count

English

en.count('Emma')        # Counts occurrences

Korean

ko.count('초등학교')   # Counts occurrences

Dispersion plot

English

en.dispersion_plot(['Emma', 'Frank', 'Jane'])

Korean

ko.dispersion_plot(['육아휴직', '초등학교', '공무원'])

Concordance

English

en.concordance('Emma', lines=5)

Displaying 5 of 865 matches:
                                     Emma by Jane Austen 1816 ] VOLUME I CHAPT
                                     Emma Woodhouse , handsome , clever , and
both daughters , but particularly of Emma . Between them it was more the int
 friend very mutually attached , and Emma doing just what she liked ; highly e
r own . The real evils , indeed , of Emma ' s situation were the power of havi

Korean (or, use konlpy.utils.concordance)

ko.concordance('초등학교')

Displaying 6 of 6 matches:
 ․ 김정훈 김학송 의원 ( 10 인 ) 제안 이유 및 주요 내용 초등학교 저학년 의 경우 에도 부모 의 따뜻한 사랑 과 보살핌 이 필요 한
 을 할 수 있는 자녀 의 나이 는 만 6 세 이하 로 되어 있어 초등학교 저학년 인 자녀 를 돌보기 위해서 는 해당 부모님 은 일자리 를
 다 . 제 63 조제 2 항제 4 호 중 “ 만 6 세 이하 의 초등학교 취학 전 자녀 를 ” 을 “ 만 8 세 이하 ( 취학 중인 경우
 전 자녀 를 ” 을 “ 만 8 세 이하 ( 취학 중인 경우 에는 초등학교 2 학년 이하 를 말한 다 ) 의 자녀 를 ” 로 한 다 . 부
 . ∼ 3 . ( 현행 과 같 음 ) 4 . 만 6 세 이하 의 초등학교 취 4 . 만 8 세 이하 ( 취학 중인 경우 학 전 자녀 를 양
세 이하 ( 취학 중인 경우 학 전 자녀 를 양육 하기 위하 에는 초등학교 2 학년 이하 를 여 필요하거 나 여자 공무원 이 말한 다 ) 의

Find similar words

English

en.similar('Emma')
en.similar('Frank')

she it he i harriet you her jane him that me and all they them there herself was hartfield be
mr mrs emma harriet you it her she he him hartfield them jane that isabella all herself look i me

Korean

ko.similar('자녀')
ko.similar('육아휴직')

논의
None

Collocations

English

en.collocations()

Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss
Fairfax; every thing; young man; every body; great deal; dare say;
John Knightley; Maple Grove; Miss Smith; Miss Taylor; Robert Martin;
Colonel Campbell; Box Hill; said Emma; Harriet Smith; William Larkins

Korean

en.collocations()

초등학교 저학년; 육아휴직 대상

For more information on nltk.Text(), see the source code or API.

Tagging and chunking

Until now, we used delimited text, namely tokens, to explore our sample document. Now let's classify words into given classes, namely part-of-speech tags, and chunk text into larger pieces.

1. POS tagging

There are numerous ways of tagging a text. Among them, the most frequently used, and developed way of tagging is arguably POS tagging.

Since one document is too long to observe a parsed structure, lets use one short sentence for each language.

English

tokens = "The little yellow dog barked at the Persian cat".split()
tags_en = nltk.pos_tag(tokens)

[('The', 'DT'),
 ('little', 'JJ'),
 ('yellow', 'NN'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('Persian', 'NNP'),
 ('cat', 'NN')]

It is also possible to use the famous Stanford POS tagger with NLTK, with from nltk.tag.stanford import POSTagger

Korean

from konlpy.tag import Twitter; t = Twitter()
tags_ko = t.pos("작고 노란 강아지가 페르시안 고양이에게 짖었다")

[('작고', 'Noun'),
 ('노란', 'Adjective'),
 ('강아지', 'Noun'),
 ('가', 'Josa'),
 ('페르시안', 'Noun'),
 ('고양이', 'Noun'),
 ('에게', 'Josa'),
 ('짖었', 'Noun'),
 ('다', 'Josa')]

2. Noun phrase chunking

nltk.RegexpParser() is a great way to start chunking.

English

parser_en = nltk.RegexpParser("NP: {<DT>?<JJ>?<NN.*>*}")
chunks_en = parser_en.parse(tags_en)
chunks_en.draw()

Korean

parser_ko = nltk.RegexpParser("NP: {<Adjective>*<Noun>*}")
chunks_ko = parser_ko.parse(tags_ko)
chunks_ko.draw()

For more information on chunking, refer to Extracting Information from Text for English, and Chunking for Korean.

Drawing a word cloud

제 1809890호 의안의 빈도분포(frequency distribution)를 다시 살펴보자.

print(ko.vocab())

FreqDist({'.': 61, '의': 46, '육아휴직': 38, '을': 34, '(': 27, ',': 26, '이': 26, ')': 26, '에': 24, '자': 24, ...})

이 빈도분포의 data type과 attribute 목록을 확인해보자.

type(ko.vocab())

nltk.probability.FreqDist

dir(ko.vocab())

['B',
 'N',
 ...
 'items',
 ...
 'pop',
 'popitem',
 'pprint',
 'r_Nr',
 'setdefault',
 'subtract',
 'tabulate',
 'unicode_repr',
 'update',
 'values']

items()를 사용하면 빈도분포의 item 전체를 set의 형태로 볼 수 있다. 이를 data라는 이름의 변수에 저장한 후, data type을 관찰하자.

data = ko.vocab().items()
print(data)
print(type(data))

dict_items([('명', 5), ('예상된', 3), ('하나', 1), ('11', 2), ('팀', 2), ...])
<class 'dict_items'>

이 set을 이제 words.csv라는 파일에 저장해보자. 데이터 header는 word,freq로 하면 된다.

import csv
with open('words.csv', 'w', encoding='utf-8') as f:
    f.write('word,freq\n')
    writer = csv.writer(f)
    writer.writerows(data)

다음으로 아래의 코드를 복사하여 words.csv가 있는 폴더 내에 index.html라는 이름으로 저장하자.

<!DOCTYPE html>
<html>
<head>
  <style>
    text:hover {
        stroke: black;
    }
  </style>
  <script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
  <script src="d3.layout.cloud.js"></script>
</head>
<body>
  <div id="cloud"></div>
  <script type="text/javascript">
    var weight = 3,   // change me
        width = 960,
        height = 500;
    var fill = d3.scale.category20();
    d3.csv("words.csv", function(d) {
        return {
          text: d.word,
          size: +d.freq*weight
        }
      },
      function(data) {
        d3.layout.cloud().size([width, height]).words(data)
          //.rotate(function() { return ~~(Math.random() * 2) * 90; })
          .rotate(0)
          .font("Impact")
          .fontSize(function(d) { return d.size; })
          .on("end", draw)
          .start();
        function draw(words) {
          d3.select("#cloud").append("svg")
              .attr("width", width)
              .attr("height", height)
            .append("g")
              .attr("transform", "translate(" + width/2 + "," + height/2 + ")")
            .selectAll("text")
              .data(words)
            .enter().append("text")
              .style("font-size", function(d) { return d.size + "px"; })
              .style("font-family", "Impact")
              .style("fill", function(d, i) { return fill(i); })
              .attr("text-anchor", "middle")
              .attr("transform", function(d) {
                return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
              })
            .text(function(d) { return d.text; });
        }
      });
  </script>
</body>
</html>

view raw index.html hosted with ❤ by GitHub

위와 같은 폴더에서 아래를 실행하자.

python -m http.server 8888      # for Python2, `python -m SimpleHTTPServer`

마지막으로, 모던 브라우저(ex: 크롬)의 주소창에 http://localhost:8888를 입력하면 우리의 워드클라우드가 떠있을 것이다! (이미지를 클릭하면 interative 페이지로 이동합니다.)
더 실험해보고 싶은 경우:
1. 위의 워드클라우드는 각종 특수문자, 조사 등도 포함되어 정보 전달력이 떨어진다. 워드클라우드에 명사만 표현되게 할 수 있을까?
2. 다른 임의의 문서로도 워드클라우드를 그릴 수 있나? (ex: 내 데이터마이닝 프로젝트 제안서) 해당 문서를 파이썬으로 읽고, 문서에서 높은 빈도로 등장한 단어를 추출 후, 워드클라우드로 그려보자.
3. 여러 개의 문서에 대한 워드클라우드를 그릴 수도 있나? 파이썬으로 여러 개의 문서를 한꺼번에 읽어들인 후, 높은 빈도로 등장한 단어를 추출해서 워드클라우드로 그려보자.

Author: Lucy Park

Category: 2015-dm

Tags: text lectures

저작자표시 (새창열림)

'프로그래밍 > Python' 카테고리의 다른 글

[Python] 어제 날짜 구하기 (0)	2017.06.21
[python] pytagcloud에서 한글 안될때, font.json (0)	2017.06.21
[python] konlpy 하다가 그래프에 한글 안나올때, 어제는 안되고 오늘은 되네 (0)	2017.06.16
[python] 파이썬으로 영어와 한국어 텍스트 다루기, 문서 전처리 (0)	2017.06.15
konlpy 한국어 처리 패키지 (0)	2017.06.15

[python] konlpy 하다가 그래프에 한글 안나올때, 어제는 안되고 오늘은 되네

홍반장水_ 2017. 6. 16. 15:25

2017. 6. 16. 15:25

[python] konlpy 하다가 그래프에 한글 안나올때, 어제는 안되고 오늘은 되네.

https://www.lucypark.kr/courses/2015-dm/text-mining.html

맥북에서 그래프에 한글 계속 안나오다가, 오늘 해보니까 또 나오네. 뭔 조화인가? 오타인가?

Troubleshooting: For those who see rectangles instead of letters in the saved plot file, include the following configurations before drawing the plot:

from matplotlib import font_manager, rc
font_fname = 'c:/windows/fonts/gulim.ttc'     # A font of your choice
font_name = font_manager.FontProperties(fname=font_fname).get_name()
rc('font', family=font_name)

Some example fonts:

Mac OS: /Library/Fonts/AppleGothic.ttf

저작자표시 (새창열림)

'프로그래밍 > Python' 카테고리의 다른 글

[python] pytagcloud에서 한글 안될때, font.json (0)	2017.06.21
파이썬으로 영어와 한국어 텍스트 다루기 (0)	2017.06.21
[python] 파이썬으로 영어와 한국어 텍스트 다루기, 문서 전처리 (0)	2017.06.15
konlpy 한국어 처리 패키지 (0)	2017.06.15
[Python] 그래프에서 한글 깨질때 (0)	2017.06.15

[python] 파이썬으로 영어와 한국어 텍스트 다루기, 문서 전처리

홍반장水_ 2017. 6. 15. 17:38

2017. 6. 15. 17:38

파이썬으로 영어와 한국어 텍스트 다루기

https://www.lucypark.kr/courses/2015-dm/text-mining.html

문서 전처리

https://datascienceschool.net/view-notebook/3e7aadbf88ed4f0d87a76f9ddc925d69/

모든 데이터 분석 모형은 숫자로 구성된 고정 차원 벡터를 독립 변수로 하고 있으므로 문서(document)를 분석을 하는 경우에도 숫자로 구성된 특징 벡터(feature vector)를 문서로부터 추출하는 과정이 필요하다. 이러한 과정을 문서 전처리(document preprocessing)라고 한다.

BOW (Bag of Words)

문서를 숫자 벡터로 변환하는 가장 기본적인 방법은 BOW (Bag of Words) 이다. BOW 방법에서는 전체 문서 {D1,D2,…,Dn}{D1,D2,…,Dn} 를 구성하는 고정된 단어장(vocabulary) {W1,W2,…,Wm}{W1,W2,…,Wm} 를 만들고 DiDi 라는 개별 문서에 단어장에 해당하는 단어들이 포함되어 있는지를 표시하는 방법이다.

만약 단어 Wj가 문서Di 안에 있으면 ,→xij=1

Scikit-Learn 의 문서 전처리 기능

Scikit-Learn 의 feature_extraction.text 서브 패키지는 다음과 같은 문서 전처리용 클래스를 제공한다.

CountVectorizer:

문서 집합으로부터 단어의 수를 세어 카운트 행렬을 만든다.

TfidfVectorizer:

문서 집합으로부터 단어의 수를 세고 TF-IDF 방식으로 단어의 가중치를 조정한 카운트 행렬을 만든다.

HashingVectorizer:

hashing trick 을 사용하여 빠르게 카운트 행렬을 만든다.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [

'This is the first document.',

'This is the second second document.',

'And the third one.',

'Is this the first document?',

'The last document?',

]

vect = CountVectorizer()

vect.fit(corpus)

vect.vocabulary_

저작자표시 (새창열림)

'프로그래밍 > Python' 카테고리의 다른 글

파이썬으로 영어와 한국어 텍스트 다루기 (0)	2017.06.21
[python] konlpy 하다가 그래프에 한글 안나올때, 어제는 안되고 오늘은 되네 (0)	2017.06.16
konlpy 한국어 처리 패키지 (0)	2017.06.15
[Python] 그래프에서 한글 깨질때 (0)	2017.06.15
[python] CMD에서 tensorflow 설치 유무 확인 (0)	2017.03.27

konlpy 한국어 처리 패키지

홍반장水_ 2017. 6. 15. 14:17

2017. 6. 15. 14:17

konlpy 한국어 처리 패키지

https://datascienceschool.net/view-notebook/70ce46db4ced4a999c6ec349df0f4eb0/

konlpy는 한국어 정보처리를 위한 파이썬 패키지이다.

http://konlpy.org/ko/latest/

https://github.com/konlpy/konlpy

konlpy는 다음과 같은 다양한 형태소 분석, 태깅 라이브러리를 파이썬에서 쉽게 사용할 수 있도록 모아놓았다.

Kkma

http://kkma.snu.ac.kr/

Hannanum

http://semanticweb.kaist.ac.kr/hannanum/

Twitter

https://github.com/twitter/twitter-korean-text/

Komoran

http://www.shineware.co.kr/?page_id=835

Mecab

https://bitbucket.org/eunjeon/mecab-ko-dic

konlpy 는 다음과 같은 기능을 제공한다.

한국어 corpus

한국어 처리 유틸리티

형태소 분석 및 품사 태깅

한국어 corpus

- 예문 호출, 파일리스트

from konlpy.corpus import kolaw

kolaw.fileids()

c = kolaw.open('constitution.txt').read()

print(c[:100])

from konlpy.corpus import kobill

kobill.fileids()

d = kobill.open('1809890.txt').read()

print(d[:100])

한국어 처리 유틸리티

konlpy에는 유니코드 한글 문자열이 리스트나 딕셔너리의 내부에 있을 때도 한글 글자 모양을 정상적으로 보여주는 pprint 유틸리티 함수를 제공한다.

x = [u"한글", {u"한글 키": [u"한글 밸류1", u"한글 밸류2"]}]

print(x)

from konlpy.utils import pprint

pprint(x)

형태소 분석

konlpy는 tag 서브패키지에서 형태소 분석을 위한 5개의 클래스를 제공한다.

Kkma

Hannanum

Twitter

Komoran

Mecab

이 클래스는 다음과 같은 메서드를 대부분 제공한다.

morphs : 형태소 추출

nouns : 명사 추출

pos : pos 태깅

from konlpy.tag import *

hannanum = Hannanum()

kkma = Kkma()

twitter = Twitter()

명사 추출

문자열에서 명사만 추출하려면 noun 명령을 사용한다.

pprint(hannanum.nouns(c[:65]))

pprint(kkma.nouns(c[:65]))

pprint(twitter.nouns(c[:65]))

형태소 추출

명사 뿐 아니라 모든 품사의 형태소를 알아내려면 morphs라는 명령을 사용한다.

pprint(hannanum.morphs(c[:65]))

pprint(kkma.morphs(c[:65]))

pprint(twitter.morphs(c[:65]))

품사 태깅

pos 명령을 사용하면 품사(POS)가 붙어있는(tagging) 형태로 형태소 분석을 한다. 다만 이 때 출력되는 품사의 정의 및 기호는 형태소 분석기 마다 다르므로 각 형태소 분석기에 대한 문서를 찾아봐야 한다.

다음은 많이 쓰이는 형태소 분석기의 품사 기호를 비교한 자료이다.

Korean POS tags comparison chart : https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0

pprint(hannanum.pos(c[:65]))

pprint(kkma.pos(c[:65]))

pprint(twitter.pos(c[:65]))

저작자표시 (새창열림)

'프로그래밍 > Python' 카테고리의 다른 글

[python] konlpy 하다가 그래프에 한글 안나올때, 어제는 안되고 오늘은 되네 (0)	2017.06.16
[python] 파이썬으로 영어와 한국어 텍스트 다루기, 문서 전처리 (0)	2017.06.15
[Python] 그래프에서 한글 깨질때 (0)	2017.06.15
[python] CMD에서 tensorflow 설치 유무 확인 (0)	2017.03.27
Python socket programming 채팅 (0)	2017.03.20

PREV 이전 1 2 3 NEXT 다음

	<!DOCTYPE html>
	<html>
	<head>
	<style>
	text:hover {
	stroke: black;
	}
	</style>
	<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
	<script src="d3.layout.cloud.js"></script>
	</head>
	<body>
	<div id="cloud"></div>
	<script type="text/javascript">
	var weight = 3, // change me
	width = 960,
	height = 500;

	var fill = d3.scale.category20();
	d3.csv("words.csv", function(d) {
	return {
	text: d.word,
	size: +d.freq*weight
	}
	},
	function(data) {
	d3.layout.cloud().size([width, height]).words(data)
	//.rotate(function() { return ~~(Math.random() * 2) * 90; })
	.rotate(0)
	.font("Impact")
	.fontSize(function(d) { return d.size; })
	.on("end", draw)
	.start();

	function draw(words) {
	d3.select("#cloud").append("svg")
	.attr("width", width)
	.attr("height", height)
	.append("g")
	.attr("transform", "translate(" + width/2 + "," + height/2 + ")")
	.selectAll("text")
	.data(words)
	.enter().append("text")
	.style("font-size", function(d) { return d.size + "px"; })
	.style("font-family", "Impact")
	.style("fill", function(d, i) { return fill(i); })
	.attr("text-anchor", "middle")
	.attr("transform", function(d) {
	return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
	})
	.text(function(d) { return d.text; });
	}
	});
	</script>
	</body>
	</html>

konlpy

Try it here

API

Maven

Support for other languages.

.net

node.js

Python

Ruby

Elastic Search

Get the source 소스를 원하시는 경우

Usage 사용 방법

Basics

Running Tests

Tools

Contribution

Performance 처리 속도

Author(s)

'프로그래밍 > AI_DeepLearning' 카테고리의 다른 글

'프로그래밍 > Python' 카테고리의 다른 글

Terminologies

Text analysis process

Useful Python Packages for Text Mining and NLP

Text exploration

1. Read document

2. Tokenize

3. Load tokens with nltk.Text()

Tagging and chunking

1. POS tagging

2. Noun phrase chunking

Drawing a word cloud

'프로그래밍 > Python' 카테고리의 다른 글

'프로그래밍 > Python' 카테고리의 다른 글

'프로그래밍 > Python' 카테고리의 다른 글

'프로그래밍 > Python' 카테고리의 다른 글

티스토리툴바

3. Load tokens with `nltk.Text()`