Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: pyemd to POT for EMD computation in wmdistance #3327

Merged
merged 5 commits into from
Nov 3, 2022

Conversation

TLouf
Copy link
Contributor

@TLouf TLouf commented Apr 16, 2022

fixes #3312

Why this change?

  • Speed:
import gensim.downloader as api
corpus = api.load('text8')
model = api.load("glove-wiki-gigaword-50")
corpus = [d for i, d in enumerate(corpus) if i < 2]

Timings for the following call (took a subset because I'm on an old laptop right now):

model.wmdistance(corpus[0][:1000], corpus[1][:1000])
pyemd (before) POT (after)
30s 84ms

@piskvorky
Copy link
Owner

piskvorky commented Apr 16, 2022

Thanks. Can you add some benchmarks into the PR description?

Something to point people to, a TL;DR for the release notes.

Also a link to this "POT" dependency, its license etc. Is it a maintained / stable library?

@TLouf
Copy link
Contributor Author

TLouf commented Apr 16, 2022

I've updated my first comment with the info, but maybe you meant I should amend the first commit's message with all this info?

@piskvorky piskvorky requested a review from gojomo April 16, 2022 23:28
@sidravi1
Copy link

Hi - Is this PR still being worked on? Happy to take over if needed. I just swapped out pyemd for POT and it improved performance by around a factor of 3.

@TLouf
Copy link
Contributor Author

TLouf commented Oct 19, 2022

I'm still able to work on the PR, but now I'm simply waiting for a review.

@gojomo
Copy link
Collaborator

gojomo commented Oct 19, 2022

From a read-over on Github (no personal testing), this looks good. The POT package also appears to have far more attention/contributors/recent-development than pyemd, which I consider a good sign. +1 on including though whatever the tiny conflict is should be resolved.

As an aside – not at all a blocker for this issue, and not at all any knock about the work done here, which follows project precedents – it's a bit unfortunate, a "project smell", that the tiny functional change requires changes scattered across 16 files, with many bits of updated comments/boilerplate. The core of the change is small & well-contained: import a different library, replace one function call, in keyedvectors.py. Obviously, some changes to docs & declared-requirements are unavoidable.

But all the conditionals in tests would ideally be oblivious to implementation, perhaps by using a single is-support-available test rather than a dependency-specific check. And all the autogenerated output changes (eg in auto_examples stuff) would ideally happen separately in some way that wouldn't burden evaluation of the functional change, expanding the magnitude of this diff, or even risking potential human error or other unrelated changes sneaking in inadvertently.

But that's all more: thoughts for project going forward than anything requiring immediate changes.

@piskvorky
Copy link
Owner

@TLouf can you fix the conflict please?

After that we should be ready to merge @mpenkov – unless you see something off!

@sidravi1
Copy link

Thanks @TLouf!

One other thing we may wish to do -
there is a function provided by POT to calculate distances. It falls back on scipy's cdist if metric is not euclidean.

Do we want to switch to that?

@gojomo
Copy link
Collaborator

gojomo commented Oct 20, 2022

One other thing we may wish to do - there is a function provided by POT to calculate distances. It falls back on scipy's cdist if metric is not euclidean.

Do we want to switch to that?

Switch to it in which places, in pursuit of what advantage?

@sidravi1
Copy link

Switch to it in which places

This line to ot.dist.

in pursuit of what advantage?

First, since we are already using POT for emd, we might as well be using POT's implementation of distance.
Second, it provide the flexibility to use other backends. AFAIK, Gensim wouldn't be taking advantage of it but provide an option for future extensions.

Neither are strong reasons but thought I'd raise it in case others feel more strongly.

@piskvorky piskvorky requested a review from mpenkov October 21, 2022 08:03
@gojomo
Copy link
Collaborator

gojomo commented Oct 21, 2022

As it, ot.dist replaces cdist on that line? I'd generally prefer to avoid extra changes unless/until there's a tangible benefit, in perf, functionality, or readability.

(Perhaps, making distance-measure configurable improves performance in some cases. If we know that to be true via investigation, then it could be a good change... but even then, better as a focused PR with its own rationale, rather than mixed with this sort of drop-in, functionality-maintaining performance boost.)

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 22, 2022

I got this in CI, can @TLouf please double check?

The gallery cache appears stale.

Rebuild the documentation using the following commands from the gensim root subdirectory:

    pip install -e .[docs]
    make -C docs/src html

and then run `git add docs/src/auto_examples` to update the cache.

Stale files: ['docs/src/auto_examples/tutorials/run_wmd.py', 'docs/src/auto_examples/tutorials/run_fasttext.py']

@TLouf
Copy link
Contributor Author

TLouf commented Oct 23, 2022

@mpenkov just went along with CI's recommendation, however it added a ton of files, some of which I should not git add I think, like all the index.rst and the .pngs. Let me know if that's correct, and if I should exclude more, and I'll revert the commit to only include what's necessary.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 23, 2022

No, I think you did everything right, with the exception of those two files that the CI thinks you did not add. Can you add them? Or am I misunderstanding something?

@TLouf
Copy link
Contributor Author

TLouf commented Oct 24, 2022

With what I added in my last commit I think CI should be happy, the 2 python files were already added, but I hadn't included some cache files generated by those scripts, from what I understood.

@sidravi1
Copy link

sidravi1 commented Nov 1, 2022

Hi all - Can this be merged? I'd love to use this from the dev branch. Thanks for your work on this.

@mpenkov
Copy link
Collaborator

mpenkov commented Nov 3, 2022

Yes, of course. Thank you for reminding me.

@TLouf Thank you for your work!

@mpenkov mpenkov merged commit fdf40eb into piskvorky:develop Nov 3, 2022
@TLouf TLouf deleted the pyemd-to-pot branch November 3, 2022 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: easy speedup of WMD using POT.emd2
5 participants