Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_vocab fails when calling with different trim_rule for same corpus #1187

Closed
prakhar2b opened this issue Mar 7, 2017 · 4 comments
Closed

Comments

@prakhar2b
Copy link
Contributor

model = gensim.models.Word2Vec(sentences,min_count=3,trim_rule=my_rule)

Now, if we try to build vocabulary for the same model with a different trim_rule

model.build_vocab(sentences, trim_rule= my_rule2)

we get error that "must sort before initializing vectors/weights"

Here is the error log

RuntimeError                              Traceback (most recent call last)
<ipython-input-13-541bf9e02ddb> in <module>()
----> 1 model.build_vocab(sentences, trim_rule= my_rule2)
      2 #print_vocab(model2)

/home/prakhar/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py in build_vocab(self, sentences, trim_rule, keep_raw_vocab, progress_per, update)
    545         self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
    546         self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
--> 547         self.finalize_vocab(update=update)  # build tables & arrays
    548 
    549     def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):

/home/prakhar/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py in finalize_vocab(self, update)
    703             self.scale_vocab()
    704         if self.sorted_vocab and not update:
--> 705             self.sort_vocab()
    706         if self.hs:
    707             # add info about each word's Huffman encoding

/home/prakhar/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py in sort_vocab(self)
    726         """Sort the vocabulary so the most frequent words have the lowest indexes."""
    727         if len(self.wv.syn0):
--> 728             raise RuntimeError("must sort before initializing vectors/weights")
    729         self.wv.index2word.sort(key=lambda word: self.wv.vocab[word].count, reverse=True)
    730         for i, word in enumerate(self.wv.index2word):

RuntimeError: must sort before initializing vectors/weights

Isn't it a bug ? Vocabulary should get updated according to the new trim_rule provided.

@gojomo
Copy link
Collaborator

gojomo commented Mar 7, 2017

I believe you'd get that same error even without trim_rule specified.

In general, triggering build_vocab() more than once, without the (un my opinion experimental/sketchy) update parameter, isn't a supported/well-defined operation. The best it could do (and what I believe it used to do) is completely clobber the existing vocabulary & model state – essentially starting a new model. Now, it appears it will trigger this error, because of the sanity-check on the sort.

The error message is poorly worded, implying taking an extra step (sorting earlier) might fix the issue. Instead, it's the sort-attempt that's failing. So perhaps the message should be: "cannot sort vocabulary after model weights already initialized".

@prakhar2b
Copy link
Contributor Author

@gojomo Yes. Thanks for specifying that. I was trying to solve another issue involving trim_rule when I encountered this issue. I'll update the error message and submit a PR.

@tmylk
Copy link
Contributor

tmylk commented Mar 7, 2017

Fixed in #1190

@tmylk tmylk closed this as completed Mar 7, 2017
@liuyuanyue185
Copy link

I believe you'd get that same error even without trim_rule specified.

In general, triggering build_vocab() more than once, without the (un my opinion experimental/sketchy) update parameter, isn't a supported/well-defined operation. The best it could do (and what I believe it used to do) is completely clobber the existing vocabulary & model state – essentially starting a new model. Now, it appears it will trigger this error, because of the sanity-check on the sort.

The error message is poorly worded, implying taking an extra step (sorting earlier) might fix the issue. Instead, it's the sort-attempt that's failing. So perhaps the message should be: "cannot sort vocabulary after model weights already initialized".

Dear gojomo,

I agree with you. I meet the same problem when I tried to use GridSearch to find the best parameters of Doc2Vec. Do you know how to clobber the existing vocabulary & model state efficiently?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants