-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2vec wmdistance method implementation is not compatible with Word Mover's Distance original implementation #1094
Comments
Can you clarify what you mean by "first twitter corpus texts"? One guess would be that gensim (and specifically your usage pattern) might be using unit-normed vectors when the other does not, or vice-versa. What does the code for your full gensim test look like? Are you using word-vectors loaded from elsewhere? |
@MSardelich Do you still get the same list of 10 most similar documents? |
@gojomo The original implementation calculates, by default, all the pairwise distances for a corpus found here Answering your question, the first two sentences are: I run both implementations using the default GoogleNews pre-trained vectors and I get 0.99 distance using GenSim wmd implementation and 2.6625 using the original one. Did you get the same results? |
@tmylk @olavurmortensen I understand your point regrading the X most similar. Maybe they are the same regardless the implementation. However, in my case I am trying to use WMD as a proxy for "absolute" similarity. In this case, apparently, the variance is really small. One example and it will be easier to understand my statement.
As you said, if I perform a relative measure, the less similar would be However, consider that I want to check if there exist any relation, at all, between the text Actually, I expect to get a much wider difference/variance between the results if I use the original Kusner implementation. That said, in the "absolute" relationship case, It would be easier to check if there exist any relationship between a pair texts using a simple threshold or a gaussian kernel with a given a fitted/ad hoc gamma parameter. One more question, when you say that GenSim normalizes the vectors do you mean the standard 'l2' normalization of the pre-trained vectors or any other normalization within the function call? |
@MSardelich - The only norming gensim does is converting vectors to unit-lengths, for the purposes of @tmylk @olavurmortensen I don't see any evidence that gensim's WMD calculation is done on unit-normed vectors – unless the user forces that, by using @MSardelich - Are you sure you haven't previously done such unit-norming, on the vectors as seen by gensim My test, which is truncating the number of GoogleNews vectors loaded to 500,000 to save time/memory:
This suggests to me that the gensim WMD code isn't using unit-normed vectors unless you force it do so, and that unit-norming has a big effect on the result and is probably the cause of the reported discrepancy. |
@gojomo Thank you so much! Matter solved! It turns that you were completely right. The only difference is the normalization, by that I mean the L2 normalization step on top of the pre-trained word vectors. If I don't normalize the vectors, I get exactly the same results as in Kusner original implementation. Please, feel free to close the issue... |
@tmylk @olavurmortensen what's the status on this:
? If it does give better results, why isn't it used (as @gojomo suggests)? |
It is used by default in wmdsimilarity |
@tmylk not clear to me if this has been addressed avoiding the L2 normalization to match the paper's results of WMD or not. Thank you! |
@rahulgithub It's not clear to me from those examples that s1-s2 "should" be closest. And, it's not the "complete opposite" in your numbers - s1-s2 is in the middle position of pairwise distances (neither farthest nor closest). And, the full range of distances (1.84 to 2.44) isn't that great, so there's not a lot of contrast in this example. So not sure there's anything wrong here. If you have open-ended questions about usage that aren't clearly bugs or feature requests, the project discussion list, https://groups.google.com/forum/#!forum/gensim, is better than this issue-tracker. |
I am cross posting this issue here, since it could be relevant to the Word Mover's Distance implementation. The original issue can be found here.
In simple terms, it seems that the original Kusner 2015 Word Mover's Distance implementation produces completely different results, when compared with GenSim
model.wmdistance
.For a full comparison see this link.
Could anybody please replicate my results and check my claims?
The text was updated successfully, but these errors were encountered: