Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address benchmark inconsistencies in Annoy tutorial #1105 #1113

Merged
merged 4 commits into from
Jan 29, 2017

Conversation

droudy
Copy link
Contributor

@droudy droudy commented Jan 29, 2017

Issue #1105

Uses average query time of 1000 random queries as opposed to only a single query. Includes a "dry run" before running queries. Also fixes a discrepancy where a comment says that the vector for "army" is being retrieved when the word is actually "science". Benchmarks were ran on a 2.4GHz 4 core i7 processor.

"Gensim: 0.007451029\n",
"Annoy: 0.002149934\n",
"\n",
"Annoy is 3.46570127269 times faster on average over 1000 random queries\n"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The focus and emphasis on such a level of precision is misleading (and unnecessary).

Also, please mention the other factors that affect this number, like index size etc. So people don't go away thinking "annoy is ~3.5x faster than gensim", whereas in reality this is anything between 1x-infinity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky Should I round to a smaller decimal place or leave the exact figure out completely?

Copy link
Owner

@piskvorky piskvorky Jan 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say round to a smaller decimal place, plus include a fat disclaimer that this number is by no means "constant" :)

It's completely incidental to this dataset, BLAS setup, Annoy parameters etc. The algos have fundamentally different complexity characteristics.

"('terrorism,', 0.6300898194313049)\n",
"('creditors', 0.6264415979385376)\n"
"('signature', 0.5921074748039246)\n",
"('\"dangerously', 0.5920691192150116)\n",
Copy link
Owner

@piskvorky piskvorky Jan 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like bad preprocessing. Any reason not to simply use utils.simple_preprocess?

@tmylk tmylk merged commit 6ece162 into piskvorky:develop Jan 29, 2017
@piskvorky
Copy link
Owner

piskvorky commented Jan 30, 2017

This doesn't look right -- I still see "dangerously in the notebook as a token, which should never happen with simple_preprocess.

EDIT: disregard, github was showing me only partial changes. Thanks for the fixes 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants