Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2vec wmdistance method implementation is not compatible with Word Mover's Distance original implementation #1094

Closed
MSardelich opened this issue Jan 16, 2017 · 11 comments

Comments

@MSardelich
Copy link

I am cross posting this issue here, since it could be relevant to the Word Mover's Distance implementation. The original issue can be found here.

In simple terms, it seems that the original Kusner 2015 Word Mover's Distance implementation produces completely different results, when compared with GenSim model.wmdistance.

For a full comparison see this link.

Could anybody please replicate my results and check my claims?

@gojomo
Copy link
Collaborator

gojomo commented Jan 16, 2017

Can you clarify what you mean by "first twitter corpus texts"?

One guess would be that gensim (and specifically your usage pattern) might be using unit-normed vectors when the other does not, or vice-versa. What does the code for your full gensim test look like? Are you using word-vectors loaded from elsewhere?

@tmylk
Copy link
Contributor

tmylk commented Jan 16, 2017

@MSardelich Do you still get the same list of 10 most similar documents?
The difference is that in Gensim the vectors are normalized prior to computing similarities. @olavurmortensen found that it gives better results.

@MSardelich
Copy link
Author

@gojomo The original implementation calculates, by default, all the pairwise distances for a corpus found here

Answering your question, the first two sentences are:
"now all apple has to do is get swype on the iphone and it will be crack iphone that is"
and
"apple will be adding more carrier support to the iphone 4s just announced"

I run both implementations using the default GoogleNews pre-trained vectors and I get 0.99 distance using GenSim wmd implementation and 2.6625 using the original one.

Did you get the same results?

@MSardelich
Copy link
Author

@tmylk @olavurmortensen I understand your point regrading the X most similar. Maybe they are the same regardless the implementation. However, in my case I am trying to use WMD as a proxy for "absolute" similarity. In this case, apparently, the variance is really small.

One example and it will be easier to understand my statement.
See below the closeness between multiple texts and the reference text 'The President greets the press in Chicago' using GenSim wmdistance:

Distance between 'Obama speaks to the media in Illinois' and 'The President greets the press in Chicago' using WMD: 1.02
Distance between 'Obama speaks in Illinois' and 'The President greets the press in Chicago' using WMD: 1.12
Distance between 'The band gave a concert in Japan' and 'The President greets the press in Chicago' using WMD: 1.27
Distance between 'The cat is on mat' and 'The President greets the press in Chicago' using WMD: 1.35

As you said, if I perform a relative measure, the less similar would be 'The cat is on mat'.

However, consider that I want to check if there exist any relation, at all, between the text 'The President greets the press in Chicago' and all other texts. The last text is only about 20% less similar than the second text, though it is clearly completely dissimilar.

Actually, I expect to get a much wider difference/variance between the results if I use the original Kusner implementation. That said, in the "absolute" relationship case, It would be easier to check if there exist any relationship between a pair texts using a simple threshold or a gaussian kernel with a given a fitted/ad hoc gamma parameter.

One more question, when you say that GenSim normalizes the vectors do you mean the standard 'l2' normalization of the pre-trained vectors or any other normalization within the function call?

@gojomo
Copy link
Collaborator

gojomo commented Jan 17, 2017

@MSardelich - The only norming gensim does is converting vectors to unit-lengths, for the purposes of most_similar() calcualtions. These are put in a separate property, syn0norm, so that both raw and unit-normed vectors are available – unless you explicitly call init_sims(replace=True), in which case to save memory both syn0 and syn0norm hold the same single copy of unit-normalized vectors.

@tmylk @olavurmortensen I don't see any evidence that gensim's WMD calculation is done on unit-normed vectors – unless the user forces that, by using init_sims(replace=True). See my test code below.

@MSardelich - Are you sure you haven't previously done such unit-norming, on the vectors as seen by gensim wmdistance? In my (slightly different) test, I get the larger value (more like what you report the original WMD code reports) pre-unit-norming, and a smaller value (more like what you report the 'gensim' value to be) only after forced-norming.

My test, which is truncating the number of GoogleNews vectors loaded to 500,000 to save time/memory:

from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('/Users/scratch/Documents/dev2016/training_practical_ml/notebooks/GoogleNews-vectors-negative300.bin.gz', 
                                      binary=True, limit=500000)
s1 = 'now all apple has to do is get swype on the iphone and it will be crack iphone that is'.split()
s2 = 'apple will be adding more carrier support to the iphone 4s just announced'.split()
distance_prenorming = model.wmdistance(s1, s2)
model.init_sims()  # calc unit-normed vectors alongside original raw vectors
distance_postnorming = model.wmdistance(s1, s2)
model.init_sims(replace=True)  # replace raw vectors in-place with unit-normed ones
distance_norms_only = model.wmdistance(s1, s2)
print((distance_prenorming, distance_postnorming, distance_norms_only))
# shows: (1.8161385991456382, 1.8161385991456382, 0.8207403953201577)

This suggests to me that the gensim WMD code isn't using unit-normed vectors unless you force it do so, and that unit-norming has a big effect on the result and is probably the cause of the reported discrepancy.

@MSardelich
Copy link
Author

@gojomo Thank you so much! Matter solved!

It turns that you were completely right. The only difference is the normalization, by that I mean the L2 normalization step on top of the pre-trained word vectors.

If I don't normalize the vectors, I get exactly the same results as in Kusner original implementation.

Please, feel free to close the issue...

@tmylk tmylk closed this as completed Jan 17, 2017
@piskvorky
Copy link
Owner

piskvorky commented Jan 17, 2017

@tmylk @olavurmortensen what's the status on this:

The difference is that in Gensim the vectors are normalized prior to computing similarities.
@olavurmortensen found that it gives better results.

?

If it does give better results, why isn't it used (as @gojomo suggests)?

@tmylk
Copy link
Contributor

tmylk commented Jan 18, 2017

It is used by default in wmdsimilarity

@loretoparisi
Copy link
Contributor

@tmylk not clear to me if this has been addressed avoiding the L2 normalization to match the paper's results of WMD or not. Thank you!

@rahulgithub
Copy link

rahulgithub commented Mar 8, 2018

I used the W2Vec from GoogleNews-vectors-negative300.bin

screen shot 2018-03-07 at 11 13 36 pm

@gojomo
Copy link
Collaborator

gojomo commented Mar 8, 2018

@rahulgithub It's not clear to me from those examples that s1-s2 "should" be closest. And, it's not the "complete opposite" in your numbers - s1-s2 is in the middle position of pairwise distances (neither farthest nor closest). And, the full range of distances (1.84 to 2.44) isn't that great, so there's not a lot of contrast in this example. So not sure there's anything wrong here.

If you have open-ended questions about usage that aren't clearly bugs or feature requests, the project discussion list, https://groups.google.com/forum/#!forum/gensim, is better than this issue-tracker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants