Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401

Closed
jeremybmerrill opened this issue Jun 8, 2017 · 30 comments · Fixed by #1853
Closed

different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401

jeremybmerrill opened this issue Jun 8, 2017 · 30 comments · Fixed by #1853
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@jeremybmerrill
Copy link

Description

Hi friends -- I've been using gensim 0.13.4.1 to generate bigrams and trigrams (Phrase model), then to train a word2vec model on a stream of sentences run thru the trigrams phrase model. Works great.

But I've tried to upgrade to 1.0.0 and 2.1.0 and the results are way way worse -- with the exact same code. I've scoured the release notes but haven't found an explanation. Don't see deprecation warnings in 0.13.4.1 either.

I should clarify -- on gensim 2.1.0, everything "works", i.e. there are no exceptions. It's just that the answers it gives for similarity are incorrect. The classic man : king :: woman : __ analogy gives "chernow" in the 2.1.0-trained model, but "queen" (the expected answer) in the 0.13.4.1-trained model.

I'm not an expert on any of this, so this could be a silly mistake. (I posted about this in the gitter channel and was directed to open an issue.)

Steps/Code/Corpus to Reproduce

Snippet:

  model = gensim.models.Word2Vec(
                                  ngrams_models[ngrams_model](sentences) if ngrams_model else sentences, 
                                  workers=4, 
                                  min_count=min_count, 
                                  size=size, 
                                  sample=downsampling, 
                                  # sg=(1 if use_skipgrams else 0)
                                )

  model.init_sims(replace=True)
  try:
    model_name = "model_%s_%s_%s_min_count_%s_size_%s_downsampling_%s_%s.bin" % (sentences_filename.split("/")[-1].split(".")[0], "stemmed" if stemming else "raw_words", ngrams_model, min_count, size, downsampling, "sg" if use_skipgrams else "cbow")
  except:
    model_name = "model.bin"
  model.save(model_name)

  with open("most_recent_model_filename.txt", "w") as f:
    f.write(model_name)

  print(model.most_similar_cosmul(positive=['woman', 'king'], negative=['man']))

FULL training script (slightly sanitized): https://github.com/jeremybmerrill/ineffable_abracadabra_of_cleaning/blob/master/train_in_vain.py. The only difference it takes to get the different results is changing requirements.txt to specify gensim 2.1.0 and re-running pip install -r requirements.txt

Expected Results

print(model.most_similar_cosmul(positive=['woman', 'king'], negative=['man']))
should have queen as the first answer, e.g.
[(u'queen', 0.8925966620445251), (u'royal', 0.8144874572753906), (u'wales', 0.7882706522941589), (u'kingston_ontario', 0.7752816677093506), (u'helen_mirren', 0.7718634009361267), (u'bismarck', 0.7668762803077698), (u'princess', 0.7609482407569885), (u'queen_elizabeth_ii', 0.76094651222229), (u'prince_philip', 0.7592762112617493), (u'duchess', 0.7587381601333618)]

this is the result from in 0.13.4.1

Actual Results

gives chernow as the first answer and a bunch of other irrelevant stuff, ("chernow",0.8686840534210205),("anxiety_disorders",0.8328080177307129),("marlinga",0.8300415277481079), ...

(This is with a model trained in 2.1.0 and calling model.most_similar_cosmul in 2.1.0. I haven't tested what happens using a model trained in 0.13.4.1, but loading the model and calling model.most_similar_cosmul in 2.1.0.)

Versions

Linux-4.12.0-999-generic-x86_64-with-Ubuntu-17.04-zesty
>>> import sys; print("Python", sys.version)
('Python', '2.7.13 (default, Jan 19 2017, 14:48:08) \n[GCC 6.3.0 20170118]')
>>> import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.12.1')
>>> import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.19.0')
>>> import gensim; print("gensim", gensim.__version__)
('gensim', '2.1.0')
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
('FAST_VERSION', 1)

I appreciate any help y'all can offer! (Great software, love it, so much fun, :D etc. )

@gojomo
Copy link
Collaborator

gojomo commented Jun 9, 2017

Certainly a concern and maybe a worrisome regression.

Without yet looking deeply into specifics of your process, some things worth checking:

  • do you see a similar discrepancy with plain most_similar() (as opposed to most_similar_cosmul())?
  • with logging on (say to INFO level), is there anything different/fishy about the reporting/progress-stats/timing when running the 2.0.1 variant – perhaps suggesting less training is happening, or the same corpus is being interpreted as a different size?
  • since you are able to see the error with a unigram test, what if you eliminate all the Phrases setup – do you see the same drop-off in results-quality?
  • if it's easy to re-try your process changing just the version of gensim, do you see the same problem in intervening versions? (Especially: 1.0.1 and 2.0.0.)
  • it would be interesting to test the variant you parenthetically mention: if a model trained in 0.13.4.1, save()d, but then load()ed in 2.0.1 shows the better or worse results
  • I don't believe there have been any significant changes in defaults between those versions (though there were earlier); still, to be certain you may want to log the actual in-model values used, for each of the things that can be specified in the model-initialization (Word2Vec.__init__() parameters as copied into model properties)

@jeremybmerrill
Copy link
Author

jeremybmerrill commented Jun 9, 2017

Hi @gojomo; thanks.

it would be interesting to test the variant you parenthetically mention: if a model trained in 0.13.4.1, save()d, but then load()ed in 2.0.1 shows the better or worse results

This one is easy to check. loading a model in 2.0.1 that I saved in 0.13.4.1 gives the "right" answers. I.e. "queen" not "chernow".

I will check on the rest of the stuff.

@jeremybmerrill
Copy link
Author

with logging on (say to INFO level), is there anything different/fishy about the reporting/progress-stats/timing when running the 2.0.1 variant – perhaps suggesting less training is happening, or the same corpus is being interpreted as a different size?

Yes, it turns out.

2017-06-09 09:09:36,280 : WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable).

2017-06-09 09:09:36,282 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
2017-06-09 09:09:36,282 : WARNING : supplied example count (0) did not equal expected count (27454375)

Looks like quite the hint. I train the model like this (which presumably calls train() internally):

  model = gensim.models.Word2Vec(
                                  trigrams_model[bigrams_model[sentences]], 
                                  workers=4, 
                                  min_count=min_count, 
                                  size=size, 
                                  sample=downsampling, 
                                  # sg=(1 if use_skipgrams else 0)
                                )

I'm trying right now to train it without the ngram stuff, just the plain sentences iterator.

@jeremybmerrill
Copy link
Author

You are right about the ngrams -- removing that and throwing in the plain sentences iterator on 2.1.0 works fine.

Any idea what might have caused the change?

@gojomo
Copy link
Collaborator

gojomo commented Jun 9, 2017

This is helpful narrowing. It might be a Phrases regression, we'll have to take a look at recent changes there, especially around 34759be @tmylk @ELind77

A practical workaround could be to apply the (two-level) phrase-ification just once, writing the results to a new file – which is then iterated over multiple times as a simple text corpus. (This could offer performance benefits too - the repeated phrase-promotion on each training iteration can be expensive duplicate work.)

@piskvorky
Copy link
Owner

piskvorky commented Jun 10, 2017

Thanks a lot @jeremybmerrill ! And @gojomo .

Whatever the root cause is, it implies our tests are insufficient, if they allow something like this to slip through.

@jeremybmerrill
Copy link
Author

jeremybmerrill commented Jun 18, 2017

Thanks @gojomo for that tip. I load the bigrams/trigrams models, then iterate through my raw corpus, saving the resulting phrase-ified tokens to a new clean file (space-delimited tokens, basically). Then I train w2v on that new clean file.

Now it works as expected (i.e. producing "queen" from "man : king :: woman :_____").

Oddly though, it took about 3 hours to train everything, compared to about an hour before. Not a huge problem, but still odd. Is there a quicker way that you were envisioning to write the output of the phraseification to disk than this?

  sentences = SentencesToPhraseify(raw_sentences_filename) # this runs the sentences through the `trigrams[bigrams[sentence]]` thing
  with open(ngrammed_sentences_filename, 'a') as f:
    for s in sentences:
      f.write(' '.join(s))

@gojomo
Copy link
Collaborator

gojomo commented Jun 18, 2017

While the initial writing of the tokenized/phrase-combined corpus is an extra time-consuming step, I would expect each individual followup Word2Vec pass to be a little faster – because the computationally-expensive phrase-upgrading isn't being repeated each time, in a single bottleneck thread.

Can you track the timing of each individual step, by either watching logging or adding time/duration printouts, to see for sure if anything has slowed? (And, to be sure that in some way the 'faster' path isn't just inadvertently doing less, such as skipping training entirely because a non-restartable iterator only allows the vocab-discovery pass, not multiple training passes?)

That is roughly the way I'd write a tokenized corpus to disk. I would however tend to use the smart_open() utility method as used in gensim code/examples, and a filename that indicates compression is desired (ends in .gz or .bz2) so that smart_open() automatically compresses. For typical spinning-HD speeds, compression often saves more in IO time than it costs in CPU, so this step might also improve later training iteration speed. (It may have less of an impact if already using an SSD.)

@jeremybmerrill
Copy link
Author

Yes, it works now. Just took 16 minutes to train, with the pre-written phraseified corpus. And the results are as expected. Thanks for the suggestion!

I'll try re-doing the writing step with timers.

@gojomo
Copy link
Collaborator

gojomo commented Jun 21, 2017

So the remaining issue here is that something in the Phrases behavior changed between 0.13.4.1 and 2.1.0, in a way that caused your otherwise working (and proper-seeming) code to stop working. Ideally we'll find a tiny test case to exhibit the difference. Leaving this issue open, with an updated title, for finding/fixing that issue.

@gojomo gojomo changed the title different analogy results for same code/corpus between 0.13.4.1 and 2.1.0 different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 Jun 21, 2017
@piskvorky
Copy link
Owner

piskvorky commented Jun 22, 2017

CC @prakhar2b @menshikh-iv @jayantj -- let's make sure we don't transfer this newly introduced bug into the "optimized" Phrases too.

@jayantj
Copy link
Contributor

jayantj commented Jun 22, 2017

@piskvorky okay, we'll make sure.

@ELind77
Copy link
Contributor

ELind77 commented Jul 23, 2017

@gojomo Just do be clear, did I break something? It seems like I didn't but I want to make sure so that I can fix it if I need to.

@gojomo
Copy link
Collaborator

gojomo commented Jul 24, 2017

@ELind77 I'm not sure, but the evidence above makes it seem like something changed in Phrases-related behavior breaking @jeremybmerrill's code – and we should figure out for sure what explains the reported symptoms.

@gojomo
Copy link
Collaborator

gojomo commented Aug 31, 2017

@piskvorky piskvorky added the bug Issue described a bug label Sep 1, 2017
@menshikh-iv menshikh-iv added the difficulty medium Medium issue: required good gensim understanding & python skills label Oct 2, 2017
@sj29-innovate
Copy link
Contributor

@menshikh-iv As per your suggestion , I would like to this up, can you provide some starting points ? Thanks a lot.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 12, 2018

@sj29-innovate

  1. Carefully read the bug reports (there are all the links here).
  2. Reproduce this bug (compare how it works 0.13.4.1 and 3.2.0).
  3. Look into "candidates for regressions", for example, different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401 (comment).
  4. Found, why this problem happens (and how this happens).
  5. Fix this bug & submit PR, don't forget to add tests for this case (to prevent regression in future).

@sj29-innovate
Copy link
Contributor

@menshikh-iv I read the bug reports and tried to reproduce the error by comparing both the versions but i am getting unexpected results for both the versions , can you please suggest what corpus should i use for training the word2vec model.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 13, 2018

@sj29-innovate this is bug in Phrases (w2v used only for demonstration, you can ignore it).
Use any corpus (no object).

also, add 2.1.0 to the comparison.

@sj29-innovate
Copy link
Contributor

sj29-innovate commented Jan 13, 2018

@menshikh-iv So basically , first i should find some test case such that the phrasing output is different for different versions , right ? thanks.

@menshikh-iv
Copy link
Contributor

@sj29-innovate problem not with "concrete" dataset, problem with Phrases class

main "hint" for you here - #1401 (comment) ("train called with empty iterator")

Phrases called as trigrams_model[bigrams_model[sentences]]

You can try to reproduce this behavior with w2v and any dataset (as (1) from my list).

@sj29-innovate
Copy link
Contributor

sj29-innovate commented Jan 18, 2018

@menshikh-iv @gojomo I think I have found the glitch , the "_is_single" function caused this behaviour . Returning "obj" instead of iterator to obj "iter(obj)" gives correct results.
Currently it exist as ,

obj_iter = iter(obj) 
    try:  
        peek = next(obj_iter) 
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document
        return True, obj
    if isinstance(peek, string_types):
        # It's a document, return the iterator
        return True, obj_iter
    else:
        # If the first item isn't a string, assume obj is a corpus
        return False, obj_iter

Changing it to ,

obj_iter = iter(obj)
    try:
        peek = next(obj_iter)
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document
        return True, obj
    if isinstance(peek, string_types):
        # It's a document, return the iterator
        return True, obj_iter
    else:
        # If the first item isn't a string, assume obj is a corpus
        return False, list(obj_iter)

gives correct results because _is_single check is followed by "self._apply" later on and therefore returning iterator in case the object is a corpus gives an empty iterator.
I have used eng_news_2015_30K.tar.gz corpus for my purpose , my observation was that after the above change I got similar results if I train word2vec on plain sentences iterator as well as using "trigrams_models[bigrams_model[sentences]]" . Without this change i was getting quite different results for both. Though the results are not very satisfactory may be because of the limited corpus but I was able to generate same results from the two inputs with the above change and also now the results match those of the version 0.13.4.1

@gojomo
Copy link
Collaborator

gojomo commented Jan 20, 2018

@sj29-innovate The lack of line formatting makes the code hard to compare/view; can you fix in your comment? Is there a minimal synthetic corpus that illustrates the issue, and that the behavior after the proposed-change is preferable?

@sj29-innovate
Copy link
Contributor

@menshikh-iv @gojomo Should i submit a PR with appropriate tests for the same ?

@menshikh-iv
Copy link
Contributor

@sj29-innovate of course!

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 23, 2018
… versions , added tests for checking empty iterator
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 23, 2018
… versions , added tests for checking empty iterator
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 23, 2018
…versions , test added for empty iterator
@gojomo
Copy link
Collaborator

gojomo commented Jan 23, 2018

I don't quite understand how the proposed test applies, but: does this exact test succeed in v. 0.13.4.1, then fail in v. 3.2.0, then succeed after the fix is applied?

@sj29-innovate
Copy link
Contributor

Yes it failed on the current version whereas worked after the fix was applied. What i deduced was that the "_is_single" function returned iterator to the object in case the object is a corpus , but this iterator was further going through "self._apply()" which again made iterator out of it , so therefore such fix worked.

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 31, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 31, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 31, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 31, 2018
@menshikh-iv
Copy link
Contributor

Very important comment from Gordon: #1853 (comment)

menshikh-iv pushed a commit that referenced this issue Feb 15, 2018
* bm25 scoring function updated

* Fixes #1401 , Phrases behavious now consistent on different versions , test added for empty iterator

* Fixes #1401 , IS_SINGLE Function updated

* Fixes #1401 , IS_SINGLE function updated

* Fixes #1401 , IS_SINGLE function updated

* Fixes #1401 , tests for phrasified sentences added
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
* bm25 scoring function updated

* Fixes piskvorky#1401 , Phrases behavious now consistent on different versions , test added for empty iterator

* Fixes piskvorky#1401 , IS_SINGLE Function updated

* Fixes piskvorky#1401 , IS_SINGLE function updated

* Fixes piskvorky#1401 , IS_SINGLE function updated

* Fixes piskvorky#1401 , tests for phrasified sentences added
@NikkiAdams
Copy link

Sorry to be a dunce, but this conversation means this issue is now fixed, right? I have version 3.2.0 and still see this odd behavior with phrases. But perhaps I need to ask our systems admin to update, to either re-get version 3.2.0, or get a later version? In the meantime, I will work to implement the temporary fix discussed here.

@menshikh-iv
Copy link
Contributor

@NikkiAdams fix of this issue was released in 3.4.0, feel free to update >=3.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants