You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My "2Vec refactor wishlist" (#1623) suggested among other things:
separating vocabulary-management into explicitly different classes/objects, for more control/customization, perhaps including closer integration with new n-gram (phrasing) options
I had made a few tiny steps for enabling more control with a segmenting of the build_vocab() step into 3 distinct, sequential steps:
scan_vocab(): survey the corpus only, without few model-specific influences, so that you'd have a reusable model that could be saved & resude with alternate downstream params
prepare_vocab(): start specializing the model by vocabulary, with reporting of the memory effects of different choices
finalize_vocab(): actually allocate vectors & support data-structures dependent on a frozen, final vocabulary
The #1777 refactoring both destroyed that distinct 3-step definition of build_vocab(), and moved away from actually-separable/reusable vocabulary-objects (despite claiming to do the opposite) - as per some of the problems I highlighted at the time.
Most of the complexifying damage from #1777 has been reversed, but the clean 3-steps hasn't yet been restored. And, it had been my hope that the 'KeyedVectors & other undoing of #1777' #2698 might also address this vocab-reuse concern, as per my status & wishlist comment Jan 30:
Functionally I'd also like to:
formally ensure the prior decomposition of build_vocab() into distinct scan/scale/finalize steps again works (or something similar/better, for people wanting to do finicky things with their vocabulary)
...and then if I see a path that's not too complex...
perhaps add an alternate initialization path from existing KeyedVectors – providing an official path to the oft-repeated "re-use other word-vectors to initialized my X model" request
maybe add a more rational & better-supported way to modify the vocabulary of an existing model between training runs
There's been a little improvement here in #2698 and since (including #2944) - far fewer methods & lines of code achieving the same things, some things that used to fail (#2853) & segfault (#1019) no longer doing so. But nothing matching the specific hopes above.
Now:
Still, cleaning up the FastText & related initialization recently led me to dust-off some earlier false-start code & I now have an approach I like, that's working for Word2Vec & Doc2Vec ( & should be working for FastText soon.)
The top-level summary is that build_vocab() becomes, in the initial run, abstractly & essentially:
def build_vocab(self, corpus, **etc):
survey = self.scan_vocab(corpus) # do only corpus-surveying
self.allocate_model(survey) # do only model-initialization
The survey is an instance of a new vocab-and-other-corpus-stats utility class – currently named TokenSurvey, & mainly a frequency dict like the old raw_vocab wrapped with a few other things. Notably, this can be used outside of the 2Vec classes, essentially to do the (long, costly) corpus-analysis 1st, once, and save it aside. Then, possibly examine/alter it in arbitrary ways, and also reuse it multiple times later.
And, while TokenSurvey currently replicates the crude capped-size pruning (#2024) we've historically used to prevent a vocab-scan from overflowing memory, it'd be amenable to a workalike swap-out that just uses disk scratch space to do a precise vocabulary count, or an approximate counting process (#1962). Similarly, it'd be amenable to users doing whatever other corpus-surveys they want - multithreaded, Hadoop-backed, whatever – and just knowing, if they can construct a TokenSurvey-workalike from their results, they can hand it to the *2Vec classes to make their model from that.
The API impacts are:
no external effect on people who are only using the corpus-in-constructure, or simple build_vocab() then train() patterns
people who were doing custom things before/during/after build_vocab(), especially related to the memory-estimations once available from prepare_vocab(), or doing any direct access to the old .raw_vocab, will need to update their code. But, I think this is 1-in-a-hundred of users, and more advanced users, who'll be happy with the new separable extension/save points.
slight back-compat work needed, as old models with raw_vocab will need some hot-upgrade to survey objects whereever possible
PR forthcoming as soon as I have it working for FastText
The text was updated successfully, but these errors were encountered:
Maybe it's too late at night and I'm hallucinating, but if we manage to squeeze in enough "goodies" into 4.0 (speed, memory, new functionality) that directly benefit the user (as opposed to "refactored to be prettier / more extensible internally / more robust"), we could get away with not supporting any old models at all.
I know It's a dramatic 180° from my previous position, but if the benefits warrant it, 4.0 is the place to do it. We have to carefully weigh the annoyance to us (coding up migrations our side) vs annoyance to users (coding up migrations & retraining their side).
Background:
My "2Vec refactor wishlist" (#1623) suggested among other things:
I had made a few tiny steps for enabling more control with a segmenting of the
build_vocab()
step into 3 distinct, sequential steps:scan_vocab()
: survey the corpus only, without few model-specific influences, so that you'd have a reusable model that could be saved & resude with alternate downstream paramsprepare_vocab()
: start specializing the model by vocabulary, with reporting of the memory effects of different choicesfinalize_vocab()
: actually allocate vectors & support data-structures dependent on a frozen, final vocabularyThe #1777 refactoring both destroyed that distinct 3-step definition of
build_vocab()
, and moved away from actually-separable/reusable vocabulary-objects (despite claiming to do the opposite) - as per some of the problems I highlighted at the time.Most of the complexifying damage from #1777 has been reversed, but the clean 3-steps hasn't yet been restored. And, it had been my hope that the 'KeyedVectors & other undoing of #1777' #2698 might also address this vocab-reuse concern, as per my status & wishlist comment Jan 30:
There's been a little improvement here in #2698 and since (including #2944) - far fewer methods & lines of code achieving the same things, some things that used to fail (#2853) & segfault (#1019) no longer doing so. But nothing matching the specific hopes above.
Now:
Still, cleaning up the FastText & related initialization recently led me to dust-off some earlier false-start code & I now have an approach I like, that's working for
Word2Vec
&Doc2Vec
( & should be working forFastText
soon.)The top-level summary is that
build_vocab()
becomes, in the initial run, abstractly & essentially:The
survey
is an instance of a new vocab-and-other-corpus-stats utility class – currently namedTokenSurvey
, & mainly a frequency dict like the oldraw_vocab
wrapped with a few other things. Notably, this can be used outside of the 2Vec classes, essentially to do the (long, costly) corpus-analysis 1st, once, and save it aside. Then, possibly examine/alter it in arbitrary ways, and also reuse it multiple times later.And, while
TokenSurvey
currently replicates the crude capped-size pruning (#2024) we've historically used to prevent a vocab-scan from overflowing memory, it'd be amenable to a workalike swap-out that just uses disk scratch space to do a precise vocabulary count, or an approximate counting process (#1962). Similarly, it'd be amenable to users doing whatever other corpus-surveys they want - multithreaded, Hadoop-backed, whatever – and just knowing, if they can construct aTokenSurvey
-workalike from their results, they can hand it to the *2Vec classes to make their model from that.The API impacts are:
build_vocab()
thentrain()
patternsbuild_vocab()
, especially related to the memory-estimations once available fromprepare_vocab()
, or doing any direct access to the old.raw_vocab
, will need to update their code. But, I think this is 1-in-a-hundred of users, and more advanced users, who'll be happy with the new separable extension/save points.raw_vocab
will need some hot-upgrade tosurvey
objects whereever possiblePR forthcoming as soon as I have it working for
FastText
The text was updated successfully, but these errors were encountered: