전체 글

[python] gensim.models.Word2Vec.train 2017.07.12
[TEDx] 왜 내가 하루에 책을 읽어야하는지(그리고 왜 그렇게해야하는지) 33 % 법칙 - 타이 로페즈 2017.07.12
책을 읽어야 하는 이유 (영화 '디테치먼트') 2017.07.12
[2017 GSC] 스타트업, 겁먹지 말고 도전하라 메가스터디 그룹 손주은 회장 2017.07.12
2017년 전국연합학력평가 2017.07.12
리더가 결코 잊어서는 안되는 한 단어, 겸손 2017.07.12

[python] gensim.models.Word2Vec.train

홍반장水_ 2017. 7. 12. 17:42

2017. 7. 12. 17:42

gensim.models.Word2Vec.train

Word2Vec.train(sentences, total_words=None, word_count=0, total_examples=None, queue_factor=2, report_delay=1.0)

Update the model’s neural weights from a sequence of sentences (can be a once-only generator stream). For Word2Vec, each sentence must be a list of unicode strings. (Subclasses may accept other examples.)

문장의 시퀀스에서 모델의 신경 가중치를 업데이트하십시오 (한 번만 생성기 스트림 일 수 있음). Word2Vec의 경우 각 문장은 유니 코드 문자열 목록이어야합니다. 서브 클래스는 다른 예를 받아들이는 일이 있습니다.

To support linear learning-rate decay from (initial) alpha to min_alpha, either total_examples (count of sentences) or total_words (count of raw words in sentences) should be provided, unless the sentences are the same as those that were used to initially build the vocabulary.

(초기) alpha에서 min_alpha까지의 선형 학습 률 감소를 지원하려면, 문장이 처음 빌드에 사용 된 것과 같지 않으면 total_examples (문장의 수) 또는 total_words (문장의 원시 단어의 수)가 제공되어야합니다 어휘.

https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.train.html#gensim.models.Word2Vec.train

GEMSIM , gensim.models.word2vec

*** 한국어 Word2Vec, train

import gensim

sentences = [["my", "name", "is", "jamie"], ["jamie", "is", "cute"]]

model = gensim.models.Word2Vec(sentences)

#----------------------------------------------------------------------

sentences_vocab = SentenceReader('corpus.txt')

sentences_train = SentenceReader('corpus.txt')

model = gensim.models.Word2Vec()

model.build_vocab(sentences_vocab)

model.train(sentences_train)

#----------------------------------------------------------------------

class SentenceReader:

def __init__(self, filepath):

self.filepath = filepath

def __iter__(self):

for line in codecs.open(self.filepath, encoding='utf-8'):

yield line.split(' ')

#----------------------------------------------------------------------

model.save('model')

model = gensim.models.Word2Vec.load('model')

model.most_similar(positive=["한국/Noun", "도쿄/Noun"], negative=["서울/Noun"], topn=1)

# [("일본/Noun", 0.6401702165603638)]

#----------------------------------------------------------------------

import multiprocessing

config = {

'min_count': 5, # 등장 횟수가 5 이하인 단어는 무시

'size': 300, # 300차원짜리 벡터스페이스에 embedding

'sg': 1, # 0이면 CBOW, 1이면 skip-gram을 사용한다

'batch_words': 10000, # 사전을 구축할때 한번에 읽을 단어 수

'iter': 10, # 보통 딥러닝에서 말하는 epoch과 비슷한, 반복 횟수

'workers': multiprocessing.cpu_count(),

}

model = gensim.models.Word2Vec(**config)

#----------------------------------------------------------------------

...

저작자표시 (새창열림)

'프로그래밍 > Python' 카테고리의 다른 글

[python] Python List sort() Method (0)	2017.07.17
[python] 단어 임베딩의 원리와 gensim.word2vec 사용법¶ (0)	2017.07.12
[python] Wordcloud 만들기 (0)	2017.07.11
[Python] Join, Split 리스트를 문자열로, 문자열을 리스트로 변환 (0)	2017.07.07
[python] gensim + word2vec 모델 만들어서 사용하기 (0)	2017.06.28