-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: You must specify either total_examples or total_words, for proper alpha and progress calculations. The usual value is total_examples=model.corpus_count. #1956
Comments
Hello @Keramatfar, |
Definitely, but my data is in Persian. My Code: import csv
from gensim.models import doc2vec
from collections import namedtuple
from nltk import word_tokenize
from gensim.models.doc2vec import TaggedDocument
import gensim
LabeledSentence = gensim.models.doc2vec.LabeledSentence
docs = []
tags = []
with open('all.csv', encoding = 'UTF-16') as data:
r = csv.reader(data, delimiter = '\t')
for i in r:
words = word_tokenize(i[1].lower().replace('-', ' '))
tags.append(i[0])
docs.append(words)
class LabeledLineSentence(object):
def __init__(self, docs, tags):
self.labels_list = tags
self.doc_list = docs
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield LabeledSentence(words=doc,tags=[self.labels_list[idx]])
it = LabeledLineSentence(docs, tags)
model = gensim.models.Doc2Vec(size=300, window=10, min_count=5, workers=11,alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(it)
for epoch in range(10):
model.train(it, epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no deca
model.train(it)
#Most Similar Docs
t = word_tokenize("مقایسه اثر گاماگلوبولین وریدی با رژیم کتوژنیک در کودکان مبتلا به صرع مقاوم رژیم کتوژنیک صرع - رژیم درمانی گاماگلوبولین مجله علمی دانشگاه علوم پزشکی و خدمات بهداشتی درمانی همدان")
tokens = model.infer_vector(t)
sims = model.docvecs.most_similar([tokens])
print(sims) data file is a csv file like this:
|
@Keramatfar this is not a problem that your data in Persian, please share I already see several mistakes (looks like "old" approach") model = gensim.models.Doc2Vec(size=300, window=10, min_count=5, workers=11,alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(it)
for epoch in range(10):
model.train(it, epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no deca
model.train(it) should be model = gensim.models.Doc2Vec(size=300, window=10, min_count=5, workers=11,alpha=0.025, min_alpha=0.025, iter=20)
model.build_vocab(it)
model.train(it, epochs=model.iter, total_examples=model.corpus_count) |
Ping @Keramatfar, we are waiting for |
I have a similar problem. There is no way to reuse a pre-scanned vocabulary. I have explained the problem in the comments of the following code: vocab = gensim.models.word2vec.Word2VecVocab(min_count=1, sorted_vocab=True)
vocab.scan_vocab(sentences, progress_per=PROGRESS_PER, trim_rule=None)
vocab.save(VOCAB_FILE_NAME)
#LATER
vocab = gensim.models.word2vec.Word2VecVocab.load(VOCAB_FILE_NAME)
#Word2VecVocab object doesn't keep total_words and corpus_count info
#it is now impossible for me to retrieve these values.
#These values are returned from scan_vocab instead of being saved as fields of the object
#Create an empty model.
model = gensim.models.word2vec.Word2Vec(min_count=MIN_COUNT, max_vocab_size=MAX_VOCAB_SIZE, seed=SEED, sg=SG, workers=WORKERS, size=SIZE)
#prepare the vocabulary. This step is also very confusing.
#I think prepare_vocab should belong to the model object since it actually modifies model's parameters
vocab.prepare_vocab(model.hs, model.negative, model.wv, min_count=MIN_COUNT, keep_raw_vocab=False)
#This part is evern messier. There is now way I could have known what to do next without having to go thru the source code.
#model.build_vocab rescans the whole vocabulary again, rendering my first phase useless.
#instead, I have to use model.trainables.prepare_weights to have a workaround.
#Again, what does vocabulary=model.vocabulary do?
model.trainables.prepare_weights(model.hs, model.negative, model.wv, update=False, vocabulary=model.vocabulary)
#Now, most importantly. How do I set total_examples parameter? and why is it named differently?
#model.corpus_count returns zero, because the only way of setting it up is to use build_vocab, which will rescan the vocabulary
#since model.corpus_count is zero now, I get this error:
#File "/home/ahmed/anaconda3/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 135, in _job_producer
# epoch_progress = 1.0 * pushed_words / total_words
#ZeroDivisionError: float division by zero
#---
#there is no way I can get the total_words info from the vocabulary without scanning it again
model.train(sentences=sentences, total_examples=model.corpus_count, total_words=model.corpus_count, epochs=model.iter)
model.save(MODEL_FILE_NAME) |
@ahmedahmedov unfortunately, you can't retrieve it correctly if you didn't call |
closed (unreproducible) |
That sounds messy indeed. If what @ahmedahmedov writes is true, that's just bad API. @menshikh-iv @gojomo what should @ahmedahmedov 's example look like, "officially", according to you? What is he doing wrong? |
@piskvorky I also didn't understand why the example I provided is not reproducible. One can easily go thru the same steps by just initiating the |
Yes, the There's no "official" way to prepare a vocabulary for |
Yeah, we need some clarity around this. Both in terms of fixing the API (where needed), and definitely clearer ELI5 docs with examples / tutorials. My feeling is that fixing the API needs @gojomo 's hand, or it will keep sinking further. The tutorial could be written by anyone knowledgeable with the refactored code base (@manneshiva?), but is also non-trivial. |
@menshikh-iv,
|
I am trying to train a doc2vec model using gensim but i get some confusing errors. i get teh above error when i use this line in a for loop to train:
model.train(it, epochs=model.iter, total_examples=model.corpus_count)
The text was updated successfully, but these errors were encountered: