Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975

Open
gojomo opened this issue Oct 6, 2020 · 2 comments
Open

Comments

@gojomo
Copy link
Collaborator

gojomo commented Oct 6, 2020

Background:

My "2Vec refactor wishlist" (#1623) suggested among other things:

  1. separating vocabulary-management into explicitly different classes/objects, for more control/customization, perhaps including closer integration with new n-gram (phrasing) options

I had made a few tiny steps for enabling more control with a segmenting of the build_vocab() step into 3 distinct, sequential steps:

  • scan_vocab(): survey the corpus only, without few model-specific influences, so that you'd have a reusable model that could be saved & resude with alternate downstream params
  • prepare_vocab(): start specializing the model by vocabulary, with reporting of the memory effects of different choices
  • finalize_vocab(): actually allocate vectors & support data-structures dependent on a frozen, final vocabulary

The #1777 refactoring both destroyed that distinct 3-step definition of build_vocab(), and moved away from actually-separable/reusable vocabulary-objects (despite claiming to do the opposite) - as per some of the problems I highlighted at the time.

Most of the complexifying damage from #1777 has been reversed, but the clean 3-steps hasn't yet been restored. And, it had been my hope that the 'KeyedVectors & other undoing of #1777' #2698 might also address this vocab-reuse concern, as per my status & wishlist comment Jan 30:

Functionally I'd also like to:

  • formally ensure the prior decomposition of build_vocab() into distinct scan/scale/finalize steps again works (or something similar/better, for people wanting to do finicky things with their vocabulary)

...and then if I see a path that's not too complex...

  • perhaps add an alternate initialization path from existing KeyedVectors – providing an official path to the oft-repeated "re-use other word-vectors to initialized my X model" request
  • maybe add a more rational & better-supported way to modify the vocabulary of an existing model between training runs

There's been a little improvement here in #2698 and since (including #2944) - far fewer methods & lines of code achieving the same things, some things that used to fail (#2853) & segfault (#1019) no longer doing so. But nothing matching the specific hopes above.

Now:

Still, cleaning up the FastText & related initialization recently led me to dust-off some earlier false-start code & I now have an approach I like, that's working for Word2Vec & Doc2Vec ( & should be working for FastText soon.)

The top-level summary is that build_vocab() becomes, in the initial run, abstractly & essentially:

def build_vocab(self, corpus, **etc):
    survey = self.scan_vocab(corpus)  # do only corpus-surveying
    self.allocate_model(survey)  # do only model-initialization

The survey is an instance of a new vocab-and-other-corpus-stats utility class – currently named TokenSurvey, & mainly a frequency dict like the old raw_vocab wrapped with a few other things. Notably, this can be used outside of the 2Vec classes, essentially to do the (long, costly) corpus-analysis 1st, once, and save it aside. Then, possibly examine/alter it in arbitrary ways, and also reuse it multiple times later.

And, while TokenSurvey currently replicates the crude capped-size pruning (#2024) we've historically used to prevent a vocab-scan from overflowing memory, it'd be amenable to a workalike swap-out that just uses disk scratch space to do a precise vocabulary count, or an approximate counting process (#1962). Similarly, it'd be amenable to users doing whatever other corpus-surveys they want - multithreaded, Hadoop-backed, whatever – and just knowing, if they can construct a TokenSurvey-workalike from their results, they can hand it to the *2Vec classes to make their model from that.

The API impacts are:

  • no external effect on people who are only using the corpus-in-constructure, or simple build_vocab() then train() patterns
  • people who were doing custom things before/during/after build_vocab(), especially related to the memory-estimations once available from prepare_vocab(), or doing any direct access to the old .raw_vocab, will need to update their code. But, I think this is 1-in-a-hundred of users, and more advanced users, who'll be happy with the new separable extension/save points.
  • slight back-compat work needed, as old models with raw_vocab will need some hot-upgrade to survey objects whereever possible

PR forthcoming as soon as I have it working for FastText

@piskvorky
Copy link
Owner

I think this is 1-in-a-hundred of users, and more advanced users, who'll be happy with the new separable extension/save points.

Agreed. This is one point where backward compatibility is no concern.

@piskvorky
Copy link
Owner

piskvorky commented Oct 6, 2020

Maybe it's too late at night and I'm hallucinating, but if we manage to squeeze in enough "goodies" into 4.0 (speed, memory, new functionality) that directly benefit the user (as opposed to "refactored to be prettier / more extensible internally / more robust"), we could get away with not supporting any old models at all.

I know It's a dramatic 180° from my previous position, but if the benefits warrant it, 4.0 is the place to do it. We have to carefully weigh the annoyance to us (coding up migrations our side) vs annoyance to users (coding up migrations & retraining their side).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants