Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure performance of gensim 4.0.0 vs previous versions #2887

Closed
mpenkov opened this issue Jul 19, 2020 · 5 comments · Fixed by #2982
Closed

Measure performance of gensim 4.0.0 vs previous versions #2887

mpenkov opened this issue Jul 19, 2020 · 5 comments · Fixed by #2982
Assignees
Milestone

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 19, 2020

Not every 1-line decision; just ones that are in inner loops of hot-spot code.

Definitely a big TODO: compare performance before/after.

Originally posted by @piskvorky in https://github.com/_render_node/MDExOlB1bGxSZXF1ZXN0MzQ5Mjk1NTk1/timeline/more_items

@piskvorky piskvorky added this to the *2vec aftermath milestone Jul 26, 2020
@piskvorky
Copy link
Owner

piskvorky commented Jul 26, 2020

This link is also broken for me – I get 400. @mpenkov this way of creating tickets seems more trouble than worth, with the context missing.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Aug 16, 2020

I'll take it up with github support. It's convenient, but only when it works.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Aug 19, 2020

@piskvorky From github support:

This might be an uncaught edge case on our end. I have raised this up with our engineering team to investigate further.

I'll keep my eyes open for the problem in case it recurs.

@piskvorky
Copy link
Owner

piskvorky commented Sep 24, 2020

Some Word2vec measurements here: #2939 (comment)

I wonder what the original "Not every 1-line decision; just ones that are in inner loops of hot-spot code." was referring to though, the link is still broken. Probably some change of code deep in C loops.

@piskvorky piskvorky modified the milestones: *2vec aftermath, 4.0.0 Sep 24, 2020
@piskvorky piskvorky self-assigned this Oct 16, 2020
@piskvorky
Copy link
Owner

piskvorky commented Oct 18, 2020

Comparing current develop (at ea87470) against 3.8.3. Identical training params (all default except 12 workers + 1 epoch), identical HW, text9 corpus (124,301,826 words), measured with gensim_benchmark.py:

fasttext 3.8.3

training on a 124301826 raw words (88163974 effective words) took 107.3s, 821794 effective words/s
2:53.89 elapsed
4318564k peak RAM
stored model size: 1.9G

fasttext develop

training on a 124301826 raw words (88166519 effective words) took 96.4s, 914282 effective words/s
2:17.43 elapsed
1318592k peak RAM (!! 3x less memory)
stored model size: 939M

word2vec 3.8.3

training on a 124301826 raw words (88162276 effective words) took 52.3s, 1684982 effective words/s
1:41.36 elapsed
373612k peak RAM
stored model size: 181M

word2vec develop

training on a 124301826 raw words (88166114 effective words) took 50.0s, 1762436 effective words/s
1:13.82 elapsed (!! – faster weight init)
348060k peak RAM
stored model size: 176M

phrases 3.8.3

using 17692319 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2:23.51 elapsed
1611916k peak RAM
stored model size: 699M
186.7s apply frozen[text9]

phrases develop

merged Phrases<17692319 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
1:50.76 elapsed
1886588k peak RAM
stored model size: 429M
81.6s apply frozen[text9]


CC @gojomo FYI. I also double-checked loading models with mmap='r' and everything seems fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants