Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec/Doc2Vec offer model-minimization method #446

Closed
gojomo opened this issue Sep 7, 2015 · 2 comments
Closed

Word2Vec/Doc2Vec offer model-minimization method #446

gojomo opened this issue Sep 7, 2015 · 2 comments
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature

Comments

@gojomo
Copy link
Collaborator

gojomo commented Sep 7, 2015

If you're sure you're done training a model, a bunch of its memory-consumptive parts can be discarded:

  • syn0 (non-normalized; be sure to save aside normalized versions 1st)
  • syn1, syn1neg, syn0_lockf
  • doctag_syn0, doctag_syn0_lockf (in Doc2Vec)

There should be a documented method (such as finished_training) to discard these, and testing to ensure there are no lingering unintended dependencies on those sticking around.

Semantics: As a tradeoff, finished_training discards as many model attributes as possible, while still being able to answer infer_vector and __getitem__ queries on the resulting trimmed model. No more continued training on that word2vec/doc2vec model is possible, and any attempt to do so results in a clear, understandable exception.

(Note though that a Doc2Vec model used for future infer_vector() ops needs to keep the syn0 & syn1* values.)

@piskvorky piskvorky added feature Issue described a new feature difficulty easy Easy issue: required small fix labels Sep 11, 2015
pum-purum-pum-pum added a commit to pum-purum-pum-pum/gensim that referenced this issue Oct 31, 2016
add finished_training method
@gojomo
Copy link
Collaborator Author

gojomo commented Nov 2, 2016

Since I initially wrote this, I've seen cases where the non-unit-normalized syn0 is preferable to the unit-normed version. (Sometimes the magnitude of the vectors is relevant, with their magnitude in some sense being an indicator of stron/unambiguous meaning.) Also, some Doc2Vec users only want the model for inference, but others would consider the doctag_syn0 to be what they want to keep around for lookups/similarity-rankings.

So utility functions for this model-slimming need to be very carefully named and documented to set expectations properly - and perhaps factored to separate operations, rather than one big finished_training().

tmylk pushed a commit that referenced this issue Nov 13, 2016
* issue #446

add finished_training method

* private _minimize_model, tests

We can't just call «the super method in word2vec explicitly» without
adding the flag to save syn0_lockf, which as is necessary to save in
d2v.

* fix_print

* flag finished_training fix

* fix_bug with docvecs, controllability

* rename flag, flag move, init_sims

* renaming the RuntimeError message

* fix, add more tests

* fix, i == j

* fix

* tests_fix

* delete useless code

* numpy fix

* hs,neg in tests; assert parameters existance

* changelog update

* rename replace, description fix
@tmylk
Copy link
Contributor

tmylk commented Feb 8, 2017

Fixed in #987

@tmylk tmylk closed this as completed Feb 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

3 participants