Word2vec wmdistance method implementation is not compatible with Word Mover's Distance original implementation #1094

MSardelich · 2017-01-16T13:02:10Z

I am cross posting this issue here, since it could be relevant to the Word Mover's Distance implementation. The original issue can be found here.

In simple terms, it seems that the original Kusner 2015 Word Mover's Distance implementation produces completely different results, when compared with GenSim model.wmdistance.

For a full comparison see this link.

Could anybody please replicate my results and check my claims?

The text was updated successfully, but these errors were encountered:

gojomo · 2017-01-16T20:48:25Z

Can you clarify what you mean by "first twitter corpus texts"?

One guess would be that gensim (and specifically your usage pattern) might be using unit-normed vectors when the other does not, or vice-versa. What does the code for your full gensim test look like? Are you using word-vectors loaded from elsewhere?

tmylk · 2017-01-16T21:04:33Z

@MSardelich Do you still get the same list of 10 most similar documents?
The difference is that in Gensim the vectors are normalized prior to computing similarities. @olavurmortensen found that it gives better results.

MSardelich · 2017-01-17T04:18:22Z

@gojomo The original implementation calculates, by default, all the pairwise distances for a corpus found here

Answering your question, the first two sentences are:
"now all apple has to do is get swype on the iphone and it will be crack iphone that is"
and
"apple will be adding more carrier support to the iphone 4s just announced"

I run both implementations using the default GoogleNews pre-trained vectors and I get 0.99 distance using GenSim wmd implementation and 2.6625 using the original one.

Did you get the same results?

MSardelich · 2017-01-17T04:46:23Z

@tmylk @olavurmortensen I understand your point regrading the X most similar. Maybe they are the same regardless the implementation. However, in my case I am trying to use WMD as a proxy for "absolute" similarity. In this case, apparently, the variance is really small.

One example and it will be easier to understand my statement.
See below the closeness between multiple texts and the reference text 'The President greets the press in Chicago' using GenSim wmdistance:

Distance between 'Obama speaks to the media in Illinois' and 'The President greets the press in Chicago' using WMD: 1.02
Distance between 'Obama speaks in Illinois' and 'The President greets the press in Chicago' using WMD: 1.12
Distance between 'The band gave a concert in Japan' and 'The President greets the press in Chicago' using WMD: 1.27
Distance between 'The cat is on mat' and 'The President greets the press in Chicago' using WMD: 1.35

As you said, if I perform a relative measure, the less similar would be 'The cat is on mat'.

However, consider that I want to check if there exist any relation, at all, between the text 'The President greets the press in Chicago' and all other texts. The last text is only about 20% less similar than the second text, though it is clearly completely dissimilar.

Actually, I expect to get a much wider difference/variance between the results if I use the original Kusner implementation. That said, in the "absolute" relationship case, It would be easier to check if there exist any relationship between a pair texts using a simple threshold or a gaussian kernel with a given a fitted/ad hoc gamma parameter.

One more question, when you say that GenSim normalizes the vectors do you mean the standard 'l2' normalization of the pre-trained vectors or any other normalization within the function call?

gojomo · 2017-01-17T05:33:52Z

@MSardelich - The only norming gensim does is converting vectors to unit-lengths, for the purposes of most_similar() calcualtions. These are put in a separate property, syn0norm, so that both raw and unit-normed vectors are available – unless you explicitly call init_sims(replace=True), in which case to save memory both syn0 and syn0norm hold the same single copy of unit-normalized vectors.

@tmylk @olavurmortensen I don't see any evidence that gensim's WMD calculation is done on unit-normed vectors – unless the user forces that, by using init_sims(replace=True). See my test code below.

@MSardelich - Are you sure you haven't previously done such unit-norming, on the vectors as seen by gensim wmdistance? In my (slightly different) test, I get the larger value (more like what you report the original WMD code reports) pre-unit-norming, and a smaller value (more like what you report the 'gensim' value to be) only after forced-norming.

My test, which is truncating the number of GoogleNews vectors loaded to 500,000 to save time/memory:

from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('/Users/scratch/Documents/dev2016/training_practical_ml/notebooks/GoogleNews-vectors-negative300.bin.gz', 
                                      binary=True, limit=500000)
s1 = 'now all apple has to do is get swype on the iphone and it will be crack iphone that is'.split()
s2 = 'apple will be adding more carrier support to the iphone 4s just announced'.split()
distance_prenorming = model.wmdistance(s1, s2)
model.init_sims()  # calc unit-normed vectors alongside original raw vectors
distance_postnorming = model.wmdistance(s1, s2)
model.init_sims(replace=True)  # replace raw vectors in-place with unit-normed ones
distance_norms_only = model.wmdistance(s1, s2)
print((distance_prenorming, distance_postnorming, distance_norms_only))
# shows: (1.8161385991456382, 1.8161385991456382, 0.8207403953201577)

This suggests to me that the gensim WMD code isn't using unit-normed vectors unless you force it do so, and that unit-norming has a big effect on the result and is probably the cause of the reported discrepancy.

MSardelich · 2017-01-17T11:35:07Z

@gojomo Thank you so much! Matter solved!

It turns that you were completely right. The only difference is the normalization, by that I mean the L2 normalization step on top of the pre-trained word vectors.

If I don't normalize the vectors, I get exactly the same results as in Kusner original implementation.

Please, feel free to close the issue...

piskvorky · 2017-01-17T23:57:21Z

@tmylk @olavurmortensen what's the status on this:

The difference is that in Gensim the vectors are normalized prior to computing similarities.
@olavurmortensen found that it gives better results.

?

If it does give better results, why isn't it used (as @gojomo suggests)?

tmylk · 2017-01-18T00:18:06Z

It is used by default in wmdsimilarity

loretoparisi · 2017-07-07T16:04:08Z

@tmylk not clear to me if this has been addressed avoiding the L2 normalization to match the paper's results of WMD or not. Thank you!

rahulgithub · 2018-03-08T07:14:12Z

I used the W2Vec from GoogleNews-vectors-negative300.bin

gojomo · 2018-03-08T20:10:47Z

@rahulgithub It's not clear to me from those examples that s1-s2 "should" be closest. And, it's not the "complete opposite" in your numbers - s1-s2 is in the middle position of pairwise distances (neither farthest nor closest). And, the full range of distances (1.84 to 2.44) isn't that great, so there's not a lot of contrast in this example. So not sure there's anything wrong here.

If you have open-ended questions about usage that aren't clearly bugs or feature requests, the project discussion list, https://groups.google.com/forum/#!forum/gensim, is better than this issue-tracker.

tmylk closed this as completed Jan 17, 2017

gojomo mentioned this issue Jun 25, 2020

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

gojomo mentioned this issue Mar 10, 2021

Update WMD documentation #3067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word2vec wmdistance method implementation is not compatible with Word Mover's Distance original implementation #1094

Word2vec wmdistance method implementation is not compatible with Word Mover's Distance original implementation #1094

MSardelich commented Jan 16, 2017

gojomo commented Jan 16, 2017

tmylk commented Jan 16, 2017

MSardelich commented Jan 17, 2017

MSardelich commented Jan 17, 2017

gojomo commented Jan 17, 2017 •

edited

Loading

MSardelich commented Jan 17, 2017

piskvorky commented Jan 17, 2017 •

edited

Loading

tmylk commented Jan 18, 2017

loretoparisi commented Jul 7, 2017

rahulgithub commented Mar 8, 2018 •

edited

Loading

gojomo commented Mar 8, 2018

Word2vec wmdistance method implementation is not compatible with Word Mover's Distance original implementation #1094

Word2vec wmdistance method implementation is not compatible with Word Mover's Distance original implementation #1094

Comments

MSardelich commented Jan 16, 2017

gojomo commented Jan 16, 2017

tmylk commented Jan 16, 2017

MSardelich commented Jan 17, 2017

MSardelich commented Jan 17, 2017

gojomo commented Jan 17, 2017 • edited Loading

MSardelich commented Jan 17, 2017

piskvorky commented Jan 17, 2017 • edited Loading

tmylk commented Jan 18, 2017

loretoparisi commented Jul 7, 2017

rahulgithub commented Mar 8, 2018 • edited Loading

gojomo commented Mar 8, 2018

gojomo commented Jan 17, 2017 •

edited

Loading

piskvorky commented Jan 17, 2017 •

edited

Loading

rahulgithub commented Mar 8, 2018 •

edited

Loading