build_vocab fails when calling with different trim_rule for same corpus #1187

prakhar2b · 2017-03-07T08:59:38Z

model = gensim.models.Word2Vec(sentences,min_count=3,trim_rule=my_rule)

Now, if we try to build vocabulary for the same model with a different trim_rule

model.build_vocab(sentences, trim_rule= my_rule2)

we get error that "must sort before initializing vectors/weights"

Here is the error log

RuntimeError                              Traceback (most recent call last)
<ipython-input-13-541bf9e02ddb> in <module>()
----> 1 model.build_vocab(sentences, trim_rule= my_rule2)
      2 #print_vocab(model2)

/home/prakhar/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py in build_vocab(self, sentences, trim_rule, keep_raw_vocab, progress_per, update)
    545         self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
    546         self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
--> 547         self.finalize_vocab(update=update)  # build tables & arrays
    548 
    549     def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):

/home/prakhar/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py in finalize_vocab(self, update)
    703             self.scale_vocab()
    704         if self.sorted_vocab and not update:
--> 705             self.sort_vocab()
    706         if self.hs:
    707             # add info about each word's Huffman encoding

/home/prakhar/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py in sort_vocab(self)
    726         """Sort the vocabulary so the most frequent words have the lowest indexes."""
    727         if len(self.wv.syn0):
--> 728             raise RuntimeError("must sort before initializing vectors/weights")
    729         self.wv.index2word.sort(key=lambda word: self.wv.vocab[word].count, reverse=True)
    730         for i, word in enumerate(self.wv.index2word):

RuntimeError: must sort before initializing vectors/weights

Isn't it a bug ? Vocabulary should get updated according to the new trim_rule provided.

The text was updated successfully, but these errors were encountered:

gojomo · 2017-03-07T20:17:00Z

I believe you'd get that same error even without trim_rule specified.

In general, triggering build_vocab() more than once, without the (un my opinion experimental/sketchy) update parameter, isn't a supported/well-defined operation. The best it could do (and what I believe it used to do) is completely clobber the existing vocabulary & model state – essentially starting a new model. Now, it appears it will trigger this error, because of the sanity-check on the sort.

The error message is poorly worded, implying taking an extra step (sorting earlier) might fix the issue. Instead, it's the sort-attempt that's failing. So perhaps the message should be: "cannot sort vocabulary after model weights already initialized".

prakhar2b · 2017-03-07T20:30:52Z

@gojomo Yes. Thanks for specifying that. I was trying to solve another issue involving trim_rule when I encountered this issue. I'll update the error message and submit a PR.

tmylk · 2017-03-07T23:17:27Z

Fixed in #1190

liuyuanyue185 · 2021-02-28T05:29:50Z

I believe you'd get that same error even without trim_rule specified.

In general, triggering build_vocab() more than once, without the (un my opinion experimental/sketchy) update parameter, isn't a supported/well-defined operation. The best it could do (and what I believe it used to do) is completely clobber the existing vocabulary & model state – essentially starting a new model. Now, it appears it will trigger this error, because of the sanity-check on the sort.

The error message is poorly worded, implying taking an extra step (sorting earlier) might fix the issue. Instead, it's the sort-attempt that's failing. So perhaps the message should be: "cannot sort vocabulary after model weights already initialized".

Dear gojomo,

I agree with you. I meet the same problem when I tried to use GridSearch to find the best parameters of Doc2Vec. Do you know how to clobber the existing vocabulary & model state efficiently?

Thanks

prakhar2b mentioned this issue Mar 7, 2017

Updated the error message for the case when build_vocab() is triggered more than once without update parameter #1190

Merged

tmylk closed this as completed Mar 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build_vocab fails when calling with different trim_rule for same corpus #1187

build_vocab fails when calling with different trim_rule for same corpus #1187

prakhar2b commented Mar 7, 2017

gojomo commented Mar 7, 2017

prakhar2b commented Mar 7, 2017

tmylk commented Mar 7, 2017

liuyuanyue185 commented Feb 28, 2021

build_vocab fails when calling with different trim_rule for same corpus #1187

build_vocab fails when calling with different trim_rule for same corpus #1187

Comments

prakhar2b commented Mar 7, 2017

Isn't it a bug ? Vocabulary should get updated according to the new trim_rule provided.

gojomo commented Mar 7, 2017

prakhar2b commented Mar 7, 2017

tmylk commented Mar 7, 2017

liuyuanyue185 commented Feb 28, 2021