Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added save/load functionality to AnnoyIndexer #845

Merged
merged 7 commits into from
Sep 27, 2016

Conversation

fortiema
Copy link

@fortiema fortiema commented Aug 30, 2016

Having Annoy integrated directly into gensim is really great, but one feature that I was personally missing is the ability to save/load indexes. I am working with indexes in the 10s of GB and having to recreate them every time I run my code is a waste of time.

So I added a simple save/load interface that is similar to Annoy.

For example this code:

fname = 'index'
if os.path.exists(fname):
    self.index_en = AnnoyIndexer()
    self.index_en.load(fname)
    self.index_en.model = model
else:
    self.index_en = AnnoyIndexer(model, 1)
    self.index_en.save(fname)

Will create 2 files, index and index.d. Both files must be present when using the load function, otherwise nothing happens.

For this to work, I also added a if case in the constructor to allow for object creation without passing model and num_trees.

try/except on import and using pickle protocol v2 to stay 2-3 compatible.

All comments and suggestions are welcome.

@fortiema
Copy link
Author

Added basic test cases to test_similarities for both Word2Vec and Doc2Vec

…me+'.d' exists before trying to load index. Added test case for unexistant index file.
@tmylk
Copy link
Contributor

tmylk commented Sep 4, 2016

Thanks for the PR!

Could you please add a line to CHANGELOG and update the annoy notebook tutorial with the new functionality?

@fortiema
Copy link
Author

fortiema commented Sep 5, 2016

Sure! Will push when done.

@fortiema
Copy link
Author

fortiema commented Sep 9, 2016

Any other suggestions to make this interface more robust?

def save(self, fname):
self.index.save(fname)
d = {'f': self.model.vector_size, 'num_trees': self.num_trees, 'labels': self.labels}
pickle.dump(d, open(fname+'.d', 'wb'), 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def load(self, fname):
if os.path.exists(fname) and os.path.exists(fname+'.d'):
d = pickle.load(open(fname+'.d', 'rb'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from gensim.similarities.index import AnnoyIndexer
self.test_index = AnnoyIndexer()
self.test_index.load('test-index')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to raise IOError

@tmylk
Copy link
Contributor

tmylk commented Sep 16, 2016

Thanks for the quick fix. Once 2.6 tests runs we could merge.

Also it would be interesting to see a test where 2 parallel processes load the same model from disk and mmap the same index file?

@fortiema
Copy link
Author

Great suggestion, let me add this as well!

@tmylk tmylk merged commit 3a546ca into piskvorky:develop Sep 27, 2016
@fortiema fortiema deleted the annoy-saveload branch September 28, 2016 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants