-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added save/load functionality to AnnoyIndexer #845
Conversation
Added basic test cases to test_similarities for both Word2Vec and Doc2Vec |
…me+'.d' exists before trying to load index. Added test case for unexistant index file.
Thanks for the PR! Could you please add a line to CHANGELOG and update the annoy notebook tutorial with the new functionality? |
Sure! Will push when done. |
…rsisting AnnoyIndexer instances.
Any other suggestions to make this interface more robust? |
def save(self, fname): | ||
self.index.save(fname) | ||
d = {'f': self.model.vector_size, 'num_trees': self.num_trees, 'labels': self.labels} | ||
pickle.dump(d, open(fname+'.d', 'wb'), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use smart_open as in https://github.com/RaRe-Technologies/gensim/blob/6a289fefd72f038c8cc14826f63624950f5de1f8/gensim/utils.py#L896
|
||
def load(self, fname): | ||
if os.path.exists(fname) and os.path.exists(fname+'.d'): | ||
d = pickle.load(open(fname+'.d', 'rb')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use smart_open as in https://github.com/RaRe-Technologies/gensim/blob/6a289fefd72f038c8cc14826f63624950f5de1f8/gensim/utils.py#L907
from gensim.similarities.index import AnnoyIndexer | ||
self.test_index = AnnoyIndexer() | ||
self.test_index.load('test-index') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to raise IOError
Thanks for the quick fix. Once 2.6 tests runs we could merge. Also it would be interesting to see a test where 2 parallel processes load the same model from disk and mmap the same index file? |
Great suggestion, let me add this as well! |
Having Annoy integrated directly into gensim is really great, but one feature that I was personally missing is the ability to save/load indexes. I am working with indexes in the 10s of GB and having to recreate them every time I run my code is a waste of time.
So I added a simple save/load interface that is similar to Annoy.
For example this code:
Will create 2 files, index and index.d. Both files must be present when using the
load
function, otherwise nothing happens.For this to work, I also added a if case in the constructor to allow for object creation without passing model and num_trees.
try/except on import and using pickle protocol v2 to stay 2-3 compatible.
All comments and suggestions are welcome.