-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address benchmark inconsistencies in Annoy tutorial #1105 #1113
Conversation
"Gensim: 0.007451029\n", | ||
"Annoy: 0.002149934\n", | ||
"\n", | ||
"Annoy is 3.46570127269 times faster on average over 1000 random queries\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The focus and emphasis on such a level of precision is misleading (and unnecessary).
Also, please mention the other factors that affect this number, like index size etc. So people don't go away thinking "annoy is ~3.5x faster than gensim", whereas in reality this is anything between 1x-infinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky Should I round to a smaller decimal place or leave the exact figure out completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say round to a smaller decimal place, plus include a fat disclaimer that this number is by no means "constant" :)
It's completely incidental to this dataset, BLAS setup, Annoy parameters etc. The algos have fundamentally different complexity characteristics.
"('terrorism,', 0.6300898194313049)\n", | ||
"('creditors', 0.6264415979385376)\n" | ||
"('signature', 0.5921074748039246)\n", | ||
"('\"dangerously', 0.5920691192150116)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like bad preprocessing. Any reason not to simply use utils.simple_preprocess
?
This doesn't look right -- I still see EDIT: disregard, github was showing me only partial changes. Thanks for the fixes 👍 |
Issue #1105
Uses average query time of 1000 random queries as opposed to only a single query. Includes a "dry run" before running queries. Also fixes a discrepancy where a comment says that the vector for "army" is being retrieved when the word is actually "science". Benchmarks were ran on a 2.4GHz 4 core i7 processor.