Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update WMD documentation #3067

Closed
piskvorky opened this issue Mar 9, 2021 · 5 comments
Closed

Update WMD documentation #3067

piskvorky opened this issue Mar 9, 2021 · 5 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix documentation Current issue related to documentation good first issue Issue for new contributors (not required gensim understanding + very simple) impact LOW Low impact on affected users reach LOW Affects only niche use-case users
Milestone

Comments

@piskvorky
Copy link
Owner

piskvorky commented Mar 9, 2021

A user reported a documentation issue on the mailing list: https://groups.google.com/g/gensim/c/8nobtm9tu-g.

The report shows two problems:

  1. Something changed with wmdistance between 3.8 and 4.0 that is not properly captured in the Migration notes.
  1. The WMD tutorial contains instructions that are now outdated in 4.0:

    When using the wmdistance method, it is beneficial to normalize the word2vec vectors first, so they all have equal length. To do this, simply call model.init_sims(replace=True) and Gensim will take care of that for you.

    …And then our own tutorial logs a WARNING : destructive init_sims(replace=True) deprecated & no longer required for space-efficiency, which looks silly.

@piskvorky piskvorky added bug Issue described a bug documentation Current issue related to documentation difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) impact LOW Low impact on affected users reach LOW Affects only niche use-case users labels Mar 9, 2021
@piskvorky piskvorky added this to the 4.0.0 milestone Mar 9, 2021
@piskvorky
Copy link
Owner Author

piskvorky commented Mar 9, 2021

@gojomo do you remember the motivation for forcing normalized vectors in WMD? I vaguely remember some discussion.

@gojomo
Copy link
Collaborator

gojomo commented Mar 10, 2021

There's discussion in #1094 - but there's no clear case for either way there. (Non-normalized seems closer to the original implementation; some secondhand testimony suggests norming might give better results.) It's hard to think of why you'd want to use euclidean distance on non-normalized vectors, as their magnitudes could then contribute a lot of "difference" that isn't correlated with the more typical cosine-distance-comparison. So I suspect default to unit-normed vectors is better-grounded & better-performing, but don't have an empirical case either way.

The discussion as part of the KeyedVectors big-refactor was left kind of hanging as to what the proper default/configurability should be: #2698 (comment)

@piskvorky
Copy link
Owner Author

Okay, thanks @gojomo. So basically we don't have a strong intuition / evidence one way or the other.

@mattkoehne since you're the user, what makes more sense to you, or gives better results?

I'm thinking we could add a norm parameter to wmdistance and pass it through = leave the decision with the user.
Although that feels kinda weak… principled defaults are better.

@gojomo
Copy link
Collaborator

gojomo commented Mar 10, 2021

Well, if there's any reason to think different users will have good reasons to prefer different norming-or-not, an optional parameter is very easy to add & defer-to, just in case anyone needs it. (Or even just to support: future experiments either way when someone has proper time to research.)

Though, there's still the question of what the default should be. I've seen slightly more testimony for using normed vectors – including via the tutorial which includes a whole section claiming it's a good step & advocating (destructive via replace=True!) normalization as 1st step.

Can we trust that the author of those docs (@olavurmortensen? someone else?) knew what they were talking about, to prefer the norm default (even as we delete the steps/explanations that no longer apply in 4.0, and provide an option to not normalize)?

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 14, 2021

Ha, looks like @Witiko already cleaned up the tutorial, as part of #3003. Just the public website hasn't been updated yet.
Thanks @Witiko !

I just updated the Migration notes too, to match.

@gojomo I added the option to operate over un-normalized vectors in #3073.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix documentation Current issue related to documentation good first issue Issue for new contributors (not required gensim understanding + very simple) impact LOW Low impact on affected users reach LOW Affects only niche use-case users
Projects
None yet
Development

No branches or pull requests

2 participants