긍정적 사고, 음식의 절제, 규칙적인 운동

nltk

NLTK 에러시 - Resource punkt not found 일때 2022.01.26
NLTK 설치 - 아나콘다 anaconda. 데이터 검색 2022.01.26
[python] 한글 토큰화 2020.12.02
파이썬으로 영어와 한국어 텍스트 다루기 2017.06.21

NLTK 에러시 - Resource punkt not found 일때

홍반장水_ 2022. 1. 26. 12:19

2022. 1. 26. 12:19

NLTK 에러시 - Resource punkt not found 일때

아래 다운로드 추가하면 됩니다.

import nltk
nltk.download()

import os
from konlpy.tag import Okt

import nltk
nltk.download()

from nltk import word_tokenize, pos_tag, ne_chunk 
sentence = 'Mike is working at IT Centre' 

# 토큰화, 품사태깅 pos_tag 후 -> ne_chunk 개체명인식 
sentence = pos_tag(word_tokenize(sentence)) 
print(sentence) 

# 개체명 인식 
sentence = ne_chunk(sentence) 
print(sentence)

저작자표시 비영리

'프로그래밍 > Python' 카테고리의 다른 글

[python] conda, pip pandas install (0)	2022.02.03
[Python] 현재 년, 월, 일, 시, 분, 초 문자열로 표현하기 (yyyymmddhh24miss 형태) (0)	2022.02.03
NLTK 설치 - 아나콘다 anaconda. 데이터 검색 (0)	2022.01.26
[오류]conda: 'conda' 용어가 cmdlet,함수,스크립트 파일 또는 실행할 수 있는 프로그램 이름으로 인식되지 않습니다. - VSCODE (0)	2022.01.25
[Python] 점프 투 플라스크 (0)	2021.09.10

NLTK 설치 - 아나콘다 anaconda. 데이터 검색

홍반장水_ 2022. 1. 26. 12:05

2022. 1. 26. 12:05

NLTK 설치

저는 아나콘다 환경에서 파이썬을 사용하고 있으므로 이미 루트 가상환경에 NLTK가 설치가 되어있었습니다. KoNLPy와 다르게 별도의 설정등을 해줄 필요가 없습니다.아나콘다 내에서 가상환경을 따로 만들어 설치를 해줄시엔 해당 가상환경 activate 후에

> conda install nltk]
> conda update nltk

위 명령어를 입력하여 설치해주면 됩니다.

하지만 예제를 수행하면 여러 에러 메세지들을 볼 수 있습니다. 예를들어 nltk.download('words'), nltk.download('maxent_ne_chunker')를 하라는 등의 메세지가 뜨면 오류메세지에 뜬 명령어 그대로 입력해서 별도의 모듈들을 설치해주면 됩니다.

○ 예제 시행해보기

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = 'Mike is working at IT Centre'

# 토큰화, 품사태깅 pos_tag 후 -> ne_chunk 개체명인식
sentence = pos_tag(word_tokenize(sentence))
print(sentence)

# 개체명 인식
sentence = ne_chunk(sentence)
print(sentence)

결과

# [('Mike', 'NNP'), ('is', 'VBZ'), ('working', 'VBG'), 
# ('at', 'IN'), ('IT', 'NNP'), ('Centre', 'NNP')]
# (S
#   (GPE Mike/NNP)
#   is/VBZ
#   working/VBG
#   at/IN
#   (ORGANIZATION IT/NNP Centre/NNP))

○ 몇 가지 기능 살펴보기

1. nltk.Text()

nltk.Text()는 자연어 데이터의 탐색을 편리하게 해주는 다양한 기능들을 제공합니다.

import nltk
from nltk.corpus import gutenberg
from nltk import regexp_tokenize

f = gutenberg.fileids()
doc_en = gutenberg.open('austen-emma.txt').read()

pattern = r'''(?x) ([A-Z]\.)+ | \w+(-\w+)* | \$?\d+(\.\d+)?%? | \.\.\. | [][.,;"'?():-_`]'''
tokens_en = regexp_tokenize(doc_en, pattern)

en = nltk.Text(toekns_en)

테스트를 하기 위해 NLTK에서 제공하는 제인오스틴의 소설 Emma데이터를 다운받습니다. 이후 문서를 토큰으로 나누는 작업을 해주었는데 한국어를 토크나이즈하거나 품사태깅, 개체명 태깅 등을 할 때는 KoNLPy를 이용하는 것이 훨씬 낫다고 봅니다.

예를 들어 NLTK에서 '나는 바보다'라는 문장에서 '바보다'라는 토큰의 개체명을 'Organization'으로 인식하는 등의 문제가 있습니다. 그래서 한국어 처리를 할 때에는 nltk를 데이터를 탐색하는 용도로 많이 사용하는 듯 합니다.

2. 그래프 그리기

en.plot(50)

en문서의 50개의 토큰만 플롯으로 그려보았습니다.

en.dispersion_plot(['Emma','Frank','Jane'])

3. 문서 탐색

print(len(en.tokens)) # 토큰의 개수 확인
print(len(set(en.tokens))) # Unique 토큰의 개수 확인
en.vocab() # Frequency Distribution 확인
print(en.count('Emma')) # 'Emma'의 개수
en.concordance('Emma', lines=5) # 'Emma'가 들어있는 5개의 문장만 출력
en.similar('Emma')
en.similar('Frank') # 비슷한 단어 찾기

> 결과

> 191061
> 7927
> FreqDist({',': 12018, '.': 8853, 'to': 5127, 'the': 4844, 'and': 4653, 'of': 4278, '"': 4187, 'I': 3177, 'a': 3000, 'was': 2385, ...})
> 865
> Displaying 5 of 865 matches:
                                     Emma by Jane Austen 1816 ] VOLUME I CHAPT
                                     Emma Woodhouse , handsome , clever , and
both daughters , but particularly of Emma . Between them it was more the int
 friend very mutually attached , and Emma doing just what she liked ; highly e
r own . The real evils , indeed , of Emma situation were the power of havi
> she it he i harriet you her jane him that me and all they them there herself was hartfield be
mr mrs emma harriet you it her she he him hartfield them jane that isabella all herself look i me

그래프를 그리는 것 외에도 토큰의 개수나 frequency distribution등을 확인하여 데이터의 구조를 살펴보는데 좋은 기능들도 많이 제공하고 있습니다.

4. Chunking nltk.RegexParser()

Chunking 얕은 구문 분석, 음성 및 단구 (명사구와 같은)를 식별하는 것

tokens = "The little yellow dog barked at the Persian cat".split()
tags_en = nltk.pos_tag(tokens)

parser_en = nltk.RegexpParser("NP: {<DT>?<JJ>?<NN.*>*}")
chunks_en = parser_en.parse(tags_en)
chunks_en.draw()

결과

[('The', 'DT'),
 ('little', 'JJ'),
 ('yellow', 'NN'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('Persian', 'NNP'),
 ('cat', 'NN')]

5. 워드클라우드 그리기

import nltk
ko = nltk.Text(tokens_ko, name='대한민국 국회 의안 제 1809890호')

한국어 예제를 하나 불러옵니다.

# 예제의 빈도분포(frequency distribution) 살펴보기

print(ko.vocab())
type(ko.vocab()) # 데이터타입, 속성목록 확인
dir(ko.vocab()) # 예제문서의 단어사전 살펴보기

data = ko.vocab().items()
print(data)
print(type(data)) # items()를 이용하면 빈도분포의 item 전체를 set형태로 확인 가능합니다.

> 결과

> FreqDist({'.': 61, '의': 46, '육아휴직': 38, '을': 34, '(': 27, ',': 26, '이': 26, ')': 26, '에': 24, '자': 24, ...})
> nltk.probability.FreqDist
> ['B',
 'N',
 ...
 'items',
 ...
 'pop',
 'popitem',
 'pprint',
 'r_Nr',
 'setdefault',
 'subtract',
 'tabulate',
 'unicode_repr',
 'update',
 'values']
 
 > dict_items([('명', 5), ('예상된', 3), ('하나', 1), ('11', 2), ('팀', 2), ...])
<class 'dict_items'>

데이터 탐색@@

import csv
with open('words.csv', 'w', encoding='utf-8') as f:
    f.write('word,freq\n')
    writer = csv.writer(f)
    writer.writerows(data)

탐색한 데이터 set을 words.csv라는 파일에 저장하여 워드클라우드를 그리는데 이용하도록 하겠습니다.

<!DOCTYPE html>
<html>
<head>
  <style>
    text:hover {
        stroke: black;
    }
  </style>
  <script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
  <script src="d3.layout.cloud.js"></script>
</head>
<body>
  <div id="cloud"></div>
  <script type="text/javascript">
    var weight = 3,   // change me
        width = 960,
        height = 500;

    var fill = d3.scale.category20();
    d3.csv("words.csv", function(d) {
        return {
          text: d.word,
          size: +d.freq*weight
        }
      },
      function(data) {
        d3.layout.cloud().size([width, height]).words(data)
          //.rotate(function() { return ~~(Math.random() * 2) * 90; })
          .rotate(0)
          .font("Impact")
          .fontSize(function(d) { return d.size; })
          .on("end", draw)
          .start();

        function draw(words) {
          d3.select("#cloud").append("svg")
              .attr("width", width)
              .attr("height", height)
            .append("g")
              .attr("transform", "translate(" + width/2 + "," + height/2 + ")")
            .selectAll("text")
              .data(words)
            .enter().append("text")
              .style("font-size", function(d) { return d.size + "px"; })
              .style("font-family", "Impact")
              .style("fill", function(d, i) { return fill(i); })
              .attr("text-anchor", "middle")
              .attr("transform", function(d) {
                return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
              })
            .text(function(d) { return d.text; });
        }
      });
  </script>
</body>
</html>

위 코드를 index.html로 저장하고 words.csv가 있는 폴더내에 저장합니다.

python -m http.server 8888

그리고 프롬프트웨 위 명령어를 입력하여 실행시켜준 뒤, http://localhost:8888로 접속하면,

생성된 워드클라우드를 확인할 수 있습니다.

출처: https://ebbnflow.tistory.com/147 [Dev Log : 삶은 확률의 구름]

저작자표시 비영리

'프로그래밍 > Python' 카테고리의 다른 글

[Python] 현재 년, 월, 일, 시, 분, 초 문자열로 표현하기 (yyyymmddhh24miss 형태) (0)	2022.02.03
NLTK 에러시 - Resource punkt not found 일때 (0)	2022.01.26
[오류]conda: 'conda' 용어가 cmdlet,함수,스크립트 파일 또는 실행할 수 있는 프로그램 이름으로 인식되지 않습니다. - VSCODE (0)	2022.01.25
[Python] 점프 투 플라스크 (0)	2021.09.10
Websites to learn Python (0)	2021.09.09

[python] 한글 토큰화

홍반장水_ 2020. 12. 2. 12:17

2020. 12. 2. 12:17

한국어는 교착어이다.

한국어는 띄어쓰기가 영어보다 잘 지켜지지 않는다.

NLTK와 KoNLPy를 이용한 영어, 한국어 토큰화 실습

NLTK에서는 영어 코퍼스에 품사 태깅 기능을 지원하고 있습니다. 품사를 어떻게 명명하고, 태깅하는지의 기준은 여러가지가 있는데, NLTK에서는 Penn Treebank POS Tags라는 기준을 사용합니다. 실제로 NLTK를 사용해서 영어 코퍼스에 품사 태깅을 해보도록 하겠습니다.

nltk 에러나면 CMD에서 pip install nltk

>>> from nltk.tokenize import word_tokenize
Traceback (most recent call last):
  File "<pyshell#19>", line 1, in <module>
    from nltk.tokenize import word_tokenize
ModuleNotFoundError: No module named 'nltk'

>>> from nltk.tokenize import word_tokenize

>>> text="I am actively looking for Ph.D. students. and you are a Ph.D. student."
>>> print(word_tokenize(text))

['I', 'am', 'actively', 'looking', 'for', 'Ph.D.', 'students', '.', 'and', 'you', 'are', 'a', 'Ph.D.', 'student', '.']

>>> from nltk.tag import pos_tag
>>> x=word_tokenize(text)
>>> pos_tag(x)

[('I', 'PRP'), ('am', 'VBP'), ('actively', 'RB'), ('looking', 'VBG'), ('for', 'IN'), ('Ph.D.', 'NNP'), ('students', 'NNS'), ('.', '.'), ('and', 'CC'), ('you', 'PRP'), ('are', 'VBP'), ('a', 'DT'), ('Ph.D.', 'NNP'), ('student', 'NN'), ('.', '.')]

영어 문장에 대해서 토큰화를 수행하고, 이어서 품사 태깅을 수행하였습니다. Penn Treebank POG Tags에서 PRP는 인칭 대명사, VBP는 동사, RB는 부사, VBG는 현재부사, IN은 전치사, NNP는 고유 명사, NNS는 복수형 명사, CC는 접속사, DT는 관사를 의미합니다.

한국어 자연어 처리를 위해서는 KoNLPy("코엔엘파이"라고 읽습니다)라는 파이썬 패키지를 사용할 수 있습니다. 코엔엘파이를 통해서 사용할 수 있는 형태소 분석기로 Okt(Open Korea Text), 메캅(Mecab), 코모란(Komoran), 한나눔(Hannanum), 꼬꼬마(Kkma)가 있습니다.

한국어 NLP에서 형태소 분석기를 사용한다는 것은 단어 토큰화가 아니라 정확히는 형태소(morpheme) 단위로 형태소 토큰화(morpheme tokenization)를 수행하게 됨을 뜻합니다. 여기선 이 중에서 Okt와 꼬꼬마를 통해서 토큰화를 수행해보도록 하겠습니다. (Okt는 기존에는 Twitter라는 이름을 갖고있었으나 0.5.0 버전부터 이름이 변경되어 인터넷에는 아직 Twitter로 많이 알려져있으므로 학습 시 참고바랍니다.)

>>> from konlpy.tag import Okt
>>> okt=Okt()

>>> print(okt.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

['열심히', '코딩', '한', '당신', ',', '연휴', '에는', '여행', '을', '가봐요']

>>> print(okt.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

[('열심히', 'Adverb'), ('코딩', 'Noun'), ('한', 'Josa'), ('당신', 'Noun'), (',', 'Punctuation'), ('연휴', 'Noun'), ('에는', 'Josa'), ('여행', 'Noun'), ('을', 'Josa'), ('가봐요', 'Verb')]

>>> print(okt.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

['코딩', '당신', '연휴', '여행']

위의 예제는 Okt 형태소 분석기로 토큰화를 시도해본 예제입니다.

1) morphs : 형태소 추출
2) pos : 품사 태깅(Part-of-speech tagging)
3) nouns : 명사 추출

위 예제에서 사용된 각 메소드는 이런 기능을 갖고 있습니다. 앞서 언급한 코엔엘파이의 형태소 분석기들은 공통적으로 이 메소드들을 제공하고 있습니다. 위 예제에서 형태소 추출과 품사 태깅 메소드의 결과를 보면, 조사를 기본적으로 분리하고 있음을 확인할 수 있습니다. 그렇기 때문에 한국어 NLP에서 전처리에 형태소 분석기를 사용하는 것은 꽤 유용합니다.

이번에는 꼬꼬마 형태소 분석기를 사용하여 같은 문장에 대해서 토큰화를 진행해볼 것입니다.

>>> from konlpy.tag import Kkma
>>> kkma=Kkma()
>>> print(kkma.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

['열심히', '코딩', '하', 'ㄴ', '당신', ',', '연휴', '에', '는', '여행', '을', '가보', '아요']

>>> print(kkma.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

[('열심히', 'MAG'), ('코딩', 'NNG'), ('하', 'XSV'), ('ㄴ', 'ETD'), ('당신', 'NP'), (',', 'SP'), ('연휴', 'NNG'), ('에', 'JKM'), ('는', 'JX'), ('여행', 'NNG'), ('을', 'JKO'), ('가보', 'VV'), ('아요', 'EFN')]

>>> print(kkma.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

['코딩', '당신', '연휴', '여행']

앞서 사용한 Okt 형태소 분석기와 결과가 다른 것을 볼 수 있습니다. 각 형태소 분석기는 성능과 결과가 다르게 나오기 때문에, 형태소 분석기의 선택은 사용하고자 하는 필요 용도에 어떤 형태소 분석기가 가장 적절한지를 판단하고 사용하면 됩니다. 예를 들어서 속도를 중시한다면 메캅을 사용할 수 있습니다.

출처 : wikidocs.net/21698

위키독스

온라인 책을 제작 공유하는 플랫폼 서비스

wikidocs.net

저작자표시 비영리

'프로그래밍 > Python' 카테고리의 다른 글

[public-google-sheets-parser] - gspread를 이용하여 Python에서 구글 시트 연동하기 (0)	2020.12.03
[Python] Jupyter 영화 리뷰 분류: 이진 분류 문제, 네이버 영화 리뷰 감성 분류 (0)	2020.12.02
[python] Word Tokenization 단어 토큰화 (0)	2020.12.01
[python] matplotlib test (0)	2020.12.01
[python] pandas 외부csv 파일 읽기 (0)	2020.12.01

파이썬으로 영어와 한국어 텍스트 다루기

홍반장水_ 2017. 6. 21. 10:39

2017. 6. 21. 10:39

파이썬으로 영어와 한국어 텍스트 다루기

- https://www.lucypark.kr/courses/2015-dm/text-mining.html

Terminologies

English	한국어	Description
Document	문서	-
Corpus	말뭉치	A set of documents
Token	토큰	Meaningful elements in a text such as words or phrases or symbols
Morphemes	형태소	Smallest meaningful unit in a language
POS	품사	Part-of-speech (ex: Nouns)

Text analysis process

전처리는 아래의 세부 과정으로 다시 한 번 나뉜다.

Load text
Tokenize text (ex: stemming, morph analyzing)
Tag tokens (ex: POS, NER)
Token(Feature) selection and/or filter/rank tokens (ex: stopword removal, TF-IDF)
...and so on (ex: calculate word/document similarities, cluster documents)

Useful Python Packages for Text Mining and NLP

NLTK: Provides modules for text analysis (mostly language independent)
- 설치하기
```
pip install nltk
```
- 주요기능
  1. Text corpora: 특히, 이 튜토리얼에서는 아래의 두 가지 데이터가 필요하니 미리 다운 받아두자.
```
nltk.download('gutenberg')
nltk.download('maxent_treebank_pos_tagger')
```
  2. Word POS, NER classification
  3. Document classification
KoNLPy: Provides modules for Korean text analysis
- 설치하기
```
pip install konlpy
```
- 주요기능
  1. Text corpora
  2. Word POS classification
    - Hannanum
    - Kkma
    - Mecab
    - Komoran
    - Twitter
Gensim: Provides modules for topic modeling and calculating similarities among documents
- 설치하기
```
pip install -U gensim
```
- 주요기능
  1. Topic modeling
  2. Word embedding
    - word2vec

Twython: Provides easy access to Twitter API

설치하기
```
pip install twython
```

사용예시: "Samsung (삼성)" 관련 트윗 받기

from twython import Twython
import settings as s    # Create a file named settings.py, and put oauth KEY values inside
twitter = Twython(s.APP_KEY, s.APP_SECRET, s.OAUTH_TOKEN, s.OAUTH_TOKEN_SECRET)
tweets = twitter.search(q='삼성', count=100)
data = [(t['user']['screen_name'], t['text'], t['created_at']) for t in tweets['statuses']]

Text exploration

1. Read document

이 튜토리얼에서는 NLTK, KoNLPy에서 제공하는 문서들을 사용한다.

영어: Jane Austen의 소설 Emma
한국어: 대한민국 국회 제 1809890호 의안

할 수 있는 사람은, 위의 문서 대신 다른 텍스트 데이터를 로딩하여 사용해보자.

English

from nltk.corpus import gutenberg   # Docs from project gutenberg.org
files_en = gutenberg.fileids()      # Get file ids
doc_en = gutenberg.open('austen-emma.txt').read()

Korean

from konlpy.corpus import kobill    # Docs from pokr.kr/bill
files_ko = kobill.fileids()         # Get file ids
doc_ko = kobill.open('1809890.txt').read()

2. Tokenize

문서를 토큰으로 나누는 방법은 다양하다. 여기서는 영어에는 nltk.regexp_tokenize, 한국어에는 konlpy.tag.Twitter.morph를 사용해보자.

English

from nltk import regexp_tokenize
pattern = r'''(?x) ([A-Z]\.)+ | \w+(-\w+)* | \$?\d+(\.\d+)?%? | \.\.\. | [][.,;"'?():-_`]'''
tokens_en = regexp_tokenize(doc_en, pattern)

Korean

from konlpy.tag import Twitter; t = Twitter()
tokens_ko = t.morphs(doc_ko)

3. Load tokens with `nltk.Text()`

nltk.Text()는 문서 하나를 편리하게 탐색할 수 있는 다양한 기능을 제공한다.

English
```
import nltk
en = nltk.Text(tokens_en)
```
Korean (For Python 2, name has to be input as u'유니코드'. If you are using Python 2, use u'유니코드' for input of all following Korean text.)
```
import nltk
ko = nltk.Text(tokens_ko, name='대한민국 국회 의안 제 1809890호')   # For Python 2, input `name` as u'유니코드'
```

지금부터 nltk.Text()가 제공하는 다양한 기능을 하나씩 살펴보자. (참고링크: class nltk.text.Text API 문서)

Tokens

English

print(len(en.tokens))       # returns number of tokens (document length)
print(len(set(en.tokens)))  # returns number of unique tokens
en.vocab()                  # returns frequency distribution

191061
7927
FreqDist({',': 12018, '.': 8853, 'to': 5127, 'the': 4844, 'and': 4653, 'of': 4278, '"': 4187, 'I': 3177, 'a': 3000, 'was': 2385, ...})

Korean

print(len(ko.tokens))       # returns number of tokens (document length)
print(len(set(ko.tokens)))  # returns number of unique tokens
ko.vocab()                  # returns frequency distribution

1707
476
FreqDist({'.': 61, '의': 46, '육아휴직': 38, '을': 34, '(': 27, ',': 26, '이': 26, ')': 26, '에': 24, '자': 24, ...})

Plot frequency distributions

English

en.plot(50)     # Plot sorted frequency of top 50 tokens

Korean

ko.plot(50)     # Plot sorted frequency of top 50 tokens

Tip: To save a plot programmably, and not through the GUI, overwrite pylab.show with pylab.savefig before drawing the plot (reference):
from matplotlib import pylab
pylab.show = lambda: pylab.savefig('some_filename.png')
Troubleshooting: For those who see rectangles instead of letters in the saved plot file, include the following configurations before drawing the plot:
from matplotlib import font_manager, rc
font_fname = 'c:/windows/fonts/gulim.ttc'     # A font of your choice
font_name = font_manager.FontProperties(fname=font_fname).get_name()
rc('font', family=font_name)
Some example fonts:
Mac OS: /Library/Fonts/AppleGothic.ttf

Count

English

en.count('Emma')        # Counts occurrences

Korean

ko.count('초등학교')   # Counts occurrences

Dispersion plot

English

en.dispersion_plot(['Emma', 'Frank', 'Jane'])

Korean

ko.dispersion_plot(['육아휴직', '초등학교', '공무원'])

Concordance

English

en.concordance('Emma', lines=5)

Displaying 5 of 865 matches:
                                     Emma by Jane Austen 1816 ] VOLUME I CHAPT
                                     Emma Woodhouse , handsome , clever , and
both daughters , but particularly of Emma . Between them it was more the int
 friend very mutually attached , and Emma doing just what she liked ; highly e
r own . The real evils , indeed , of Emma ' s situation were the power of havi

Korean (or, use konlpy.utils.concordance)

ko.concordance('초등학교')

Displaying 6 of 6 matches:
 ․ 김정훈 김학송 의원 ( 10 인 ) 제안 이유 및 주요 내용 초등학교 저학년 의 경우 에도 부모 의 따뜻한 사랑 과 보살핌 이 필요 한
 을 할 수 있는 자녀 의 나이 는 만 6 세 이하 로 되어 있어 초등학교 저학년 인 자녀 를 돌보기 위해서 는 해당 부모님 은 일자리 를
 다 . 제 63 조제 2 항제 4 호 중 “ 만 6 세 이하 의 초등학교 취학 전 자녀 를 ” 을 “ 만 8 세 이하 ( 취학 중인 경우
 전 자녀 를 ” 을 “ 만 8 세 이하 ( 취학 중인 경우 에는 초등학교 2 학년 이하 를 말한 다 ) 의 자녀 를 ” 로 한 다 . 부
 . ∼ 3 . ( 현행 과 같 음 ) 4 . 만 6 세 이하 의 초등학교 취 4 . 만 8 세 이하 ( 취학 중인 경우 학 전 자녀 를 양
세 이하 ( 취학 중인 경우 학 전 자녀 를 양육 하기 위하 에는 초등학교 2 학년 이하 를 여 필요하거 나 여자 공무원 이 말한 다 ) 의

Find similar words

English

en.similar('Emma')
en.similar('Frank')

she it he i harriet you her jane him that me and all they them there herself was hartfield be
mr mrs emma harriet you it her she he him hartfield them jane that isabella all herself look i me

Korean

ko.similar('자녀')
ko.similar('육아휴직')

논의
None

Collocations

English

en.collocations()

Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss
Fairfax; every thing; young man; every body; great deal; dare say;
John Knightley; Maple Grove; Miss Smith; Miss Taylor; Robert Martin;
Colonel Campbell; Box Hill; said Emma; Harriet Smith; William Larkins

Korean

en.collocations()

초등학교 저학년; 육아휴직 대상

For more information on nltk.Text(), see the source code or API.

Tagging and chunking

Until now, we used delimited text, namely tokens, to explore our sample document. Now let's classify words into given classes, namely part-of-speech tags, and chunk text into larger pieces.

1. POS tagging

There are numerous ways of tagging a text. Among them, the most frequently used, and developed way of tagging is arguably POS tagging.

Since one document is too long to observe a parsed structure, lets use one short sentence for each language.

English

tokens = "The little yellow dog barked at the Persian cat".split()
tags_en = nltk.pos_tag(tokens)

[('The', 'DT'),
 ('little', 'JJ'),
 ('yellow', 'NN'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('Persian', 'NNP'),
 ('cat', 'NN')]

It is also possible to use the famous Stanford POS tagger with NLTK, with from nltk.tag.stanford import POSTagger

Korean

from konlpy.tag import Twitter; t = Twitter()
tags_ko = t.pos("작고 노란 강아지가 페르시안 고양이에게 짖었다")

[('작고', 'Noun'),
 ('노란', 'Adjective'),
 ('강아지', 'Noun'),
 ('가', 'Josa'),
 ('페르시안', 'Noun'),
 ('고양이', 'Noun'),
 ('에게', 'Josa'),
 ('짖었', 'Noun'),
 ('다', 'Josa')]

2. Noun phrase chunking

nltk.RegexpParser() is a great way to start chunking.

English

parser_en = nltk.RegexpParser("NP: {<DT>?<JJ>?<NN.*>*}")
chunks_en = parser_en.parse(tags_en)
chunks_en.draw()

Korean

parser_ko = nltk.RegexpParser("NP: {<Adjective>*<Noun>*}")
chunks_ko = parser_ko.parse(tags_ko)
chunks_ko.draw()

For more information on chunking, refer to Extracting Information from Text for English, and Chunking for Korean.

Drawing a word cloud

제 1809890호 의안의 빈도분포(frequency distribution)를 다시 살펴보자.

print(ko.vocab())

FreqDist({'.': 61, '의': 46, '육아휴직': 38, '을': 34, '(': 27, ',': 26, '이': 26, ')': 26, '에': 24, '자': 24, ...})

이 빈도분포의 data type과 attribute 목록을 확인해보자.

type(ko.vocab())

nltk.probability.FreqDist

dir(ko.vocab())

['B',
 'N',
 ...
 'items',
 ...
 'pop',
 'popitem',
 'pprint',
 'r_Nr',
 'setdefault',
 'subtract',
 'tabulate',
 'unicode_repr',
 'update',
 'values']

items()를 사용하면 빈도분포의 item 전체를 set의 형태로 볼 수 있다. 이를 data라는 이름의 변수에 저장한 후, data type을 관찰하자.

data = ko.vocab().items()
print(data)
print(type(data))

dict_items([('명', 5), ('예상된', 3), ('하나', 1), ('11', 2), ('팀', 2), ...])
<class 'dict_items'>

이 set을 이제 words.csv라는 파일에 저장해보자. 데이터 header는 word,freq로 하면 된다.

import csv
with open('words.csv', 'w', encoding='utf-8') as f:
    f.write('word,freq\n')
    writer = csv.writer(f)
    writer.writerows(data)

다음으로 아래의 코드를 복사하여 words.csv가 있는 폴더 내에 index.html라는 이름으로 저장하자.

<!DOCTYPE html>
<html>
<head>
  <style>
    text:hover {
        stroke: black;
    }
  </style>
  <script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
  <script src="d3.layout.cloud.js"></script>
</head>
<body>
  <div id="cloud"></div>
  <script type="text/javascript">
    var weight = 3,   // change me
        width = 960,
        height = 500;
    var fill = d3.scale.category20();
    d3.csv("words.csv", function(d) {
        return {
          text: d.word,
          size: +d.freq*weight
        }
      },
      function(data) {
        d3.layout.cloud().size([width, height]).words(data)
          //.rotate(function() { return ~~(Math.random() * 2) * 90; })
          .rotate(0)
          .font("Impact")
          .fontSize(function(d) { return d.size; })
          .on("end", draw)
          .start();
        function draw(words) {
          d3.select("#cloud").append("svg")
              .attr("width", width)
              .attr("height", height)
            .append("g")
              .attr("transform", "translate(" + width/2 + "," + height/2 + ")")
            .selectAll("text")
              .data(words)
            .enter().append("text")
              .style("font-size", function(d) { return d.size + "px"; })
              .style("font-family", "Impact")
              .style("fill", function(d, i) { return fill(i); })
              .attr("text-anchor", "middle")
              .attr("transform", function(d) {
                return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
              })
            .text(function(d) { return d.text; });
        }
      });
  </script>
</body>
</html>

view raw index.html hosted with ❤ by GitHub

위와 같은 폴더에서 아래를 실행하자.

python -m http.server 8888      # for Python2, `python -m SimpleHTTPServer`

마지막으로, 모던 브라우저(ex: 크롬)의 주소창에 http://localhost:8888를 입력하면 우리의 워드클라우드가 떠있을 것이다! (이미지를 클릭하면 interative 페이지로 이동합니다.)
더 실험해보고 싶은 경우:
1. 위의 워드클라우드는 각종 특수문자, 조사 등도 포함되어 정보 전달력이 떨어진다. 워드클라우드에 명사만 표현되게 할 수 있을까?
2. 다른 임의의 문서로도 워드클라우드를 그릴 수 있나? (ex: 내 데이터마이닝 프로젝트 제안서) 해당 문서를 파이썬으로 읽고, 문서에서 높은 빈도로 등장한 단어를 추출 후, 워드클라우드로 그려보자.
3. 여러 개의 문서에 대한 워드클라우드를 그릴 수도 있나? 파이썬으로 여러 개의 문서를 한꺼번에 읽어들인 후, 높은 빈도로 등장한 단어를 추출해서 워드클라우드로 그려보자.

Author: Lucy Park

Category: 2015-dm

Tags: text lectures

저작자표시

'프로그래밍 > Python' 카테고리의 다른 글

[Python] 어제 날짜 구하기 (0)	2017.06.21
[python] pytagcloud에서 한글 안될때, font.json (0)	2017.06.21
[python] konlpy 하다가 그래프에 한글 안나올때, 어제는 안되고 오늘은 되네 (0)	2017.06.16
[python] 파이썬으로 영어와 한국어 텍스트 다루기, 문서 전처리 (0)	2017.06.15
konlpy 한국어 처리 패키지 (0)	2017.06.15

PREV 이전 1 NEXT 다음

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

긍정적 사고, 음식의 절제, 규칙적인 운동

nltk

NLTK 에러시 - Resource punkt not found 일때

'프로그래밍 > Python' 카테고리의 다른 글

NLTK 설치 - 아나콘다 anaconda. 데이터 검색

NLTK 설치

○ 예제 시행해보기

○ 몇 가지 기능 살펴보기

'프로그래밍 > Python' 카테고리의 다른 글

[python] 한글 토큰화

한국어는 교착어이다.

한국어는 띄어쓰기가 영어보다 잘 지켜지지 않는다.

NLTK와 KoNLPy를 이용한 영어, 한국어 토큰화 실습

'프로그래밍 > Python' 카테고리의 다른 글

파이썬으로 영어와 한국어 텍스트 다루기

Terminologies

Text analysis process

Useful Python Packages for Text Mining and NLP

Text exploration

1. Read document

2. Tokenize

3. Load tokens with `nltk.Text()`

Tagging and chunking

1. POS tagging

2. Noun phrase chunking

Drawing a word cloud

'프로그래밍 > Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

	<!DOCTYPE html>
	<html>
	<head>
	<style>
	text:hover {
	stroke: black;
	}
	</style>
	<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
	<script src="d3.layout.cloud.js"></script>
	</head>
	<body>
	<div id="cloud"></div>
	<script type="text/javascript">
	var weight = 3, // change me
	width = 960,
	height = 500;

	var fill = d3.scale.category20();
	d3.csv("words.csv", function(d) {
	return {
	text: d.word,
	size: +d.freq*weight
	}
	},
	function(data) {
	d3.layout.cloud().size([width, height]).words(data)
	//.rotate(function() { return ~~(Math.random() * 2) * 90; })
	.rotate(0)
	.font("Impact")
	.fontSize(function(d) { return d.size; })
	.on("end", draw)
	.start();

	function draw(words) {
	d3.select("#cloud").append("svg")
	.attr("width", width)
	.attr("height", height)
	.append("g")
	.attr("transform", "translate(" + width/2 + "," + height/2 + ")")
	.selectAll("text")
	.data(words)
	.enter().append("text")
	.style("font-size", function(d) { return d.size + "px"; })
	.style("font-family", "Impact")
	.style("fill", function(d, i) { return fill(i); })
	.attr("text-anchor", "middle")
	.attr("transform", function(d) {
	return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
	})
	.text(function(d) { return d.text; });
	}
	});
	</script>
	</body>
	</html>

긍정적 사고, 음식의 절제, 규칙적인 운동

nltk

NLTK 에러시 - Resource punkt not found 일때

'프로그래밍 > Python' 카테고리의 다른 글

NLTK 설치 - 아나콘다 anaconda. 데이터 검색

NLTK 설치

○ 예제 시행해보기

○ 몇 가지 기능 살펴보기

'프로그래밍 > Python' 카테고리의 다른 글

[python] 한글 토큰화

한국어는 교착어이다.

한국어는 띄어쓰기가 영어보다 잘 지켜지지 않는다.

NLTK와 KoNLPy를 이용한 영어, 한국어 토큰화 실습

'프로그래밍 > Python' 카테고리의 다른 글

파이썬으로 영어와 한국어 텍스트 다루기

Terminologies

Text analysis process

Useful Python Packages for Text Mining and NLP

Text exploration

1. Read document

2. Tokenize

3. Load tokens with nltk.Text()

Tagging and chunking

1. POS tagging

2. Noun phrase chunking

Drawing a word cloud

'프로그래밍 > Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

3. Load tokens with `nltk.Text()`