Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975

gojomo · 2020-10-06T20:37:14Z

Background:

My "2Vec refactor wishlist" (#1623) suggested among other things:

separating vocabulary-management into explicitly different classes/objects, for more control/customization, perhaps including closer integration with new n-gram (phrasing) options

I had made a few tiny steps for enabling more control with a segmenting of the build_vocab() step into 3 distinct, sequential steps:

scan_vocab(): survey the corpus only, without few model-specific influences, so that you'd have a reusable model that could be saved & resude with alternate downstream params
prepare_vocab(): start specializing the model by vocabulary, with reporting of the memory effects of different choices
finalize_vocab(): actually allocate vectors & support data-structures dependent on a frozen, final vocabulary

The #1777 refactoring both destroyed that distinct 3-step definition of build_vocab(), and moved away from actually-separable/reusable vocabulary-objects (despite claiming to do the opposite) - as per some of the problems I highlighted at the time.

Most of the complexifying damage from #1777 has been reversed, but the clean 3-steps hasn't yet been restored. And, it had been my hope that the 'KeyedVectors & other undoing of #1777' #2698 might also address this vocab-reuse concern, as per my status & wishlist comment Jan 30:

Functionally I'd also like to:

formally ensure the prior decomposition of build_vocab() into distinct scan/scale/finalize steps again works (or something similar/better, for people wanting to do finicky things with their vocabulary)

...and then if I see a path that's not too complex...

perhaps add an alternate initialization path from existing KeyedVectors – providing an official path to the oft-repeated "re-use other word-vectors to initialized my X model" request

maybe add a more rational & better-supported way to modify the vocabulary of an existing model between training runs

There's been a little improvement here in #2698 and since (including #2944) - far fewer methods & lines of code achieving the same things, some things that used to fail (#2853) & segfault (#1019) no longer doing so. But nothing matching the specific hopes above.

Now:

Still, cleaning up the FastText & related initialization recently led me to dust-off some earlier false-start code & I now have an approach I like, that's working for Word2Vec & Doc2Vec ( & should be working for FastText soon.)

The top-level summary is that build_vocab() becomes, in the initial run, abstractly & essentially:

def build_vocab(self, corpus, **etc):
    survey = self.scan_vocab(corpus)  # do only corpus-surveying
    self.allocate_model(survey)  # do only model-initialization

The survey is an instance of a new vocab-and-other-corpus-stats utility class – currently named TokenSurvey, & mainly a frequency dict like the old raw_vocab wrapped with a few other things. Notably, this can be used outside of the 2Vec classes, essentially to do the (long, costly) corpus-analysis 1st, once, and save it aside. Then, possibly examine/alter it in arbitrary ways, and also reuse it multiple times later.

And, while TokenSurvey currently replicates the crude capped-size pruning (#2024) we've historically used to prevent a vocab-scan from overflowing memory, it'd be amenable to a workalike swap-out that just uses disk scratch space to do a precise vocabulary count, or an approximate counting process (#1962). Similarly, it'd be amenable to users doing whatever other corpus-surveys they want - multithreaded, Hadoop-backed, whatever – and just knowing, if they can construct a TokenSurvey-workalike from their results, they can hand it to the *2Vec classes to make their model from that.

The API impacts are:

no external effect on people who are only using the corpus-in-constructure, or simple build_vocab() then train() patterns
people who were doing custom things before/during/after build_vocab(), especially related to the memory-estimations once available from prepare_vocab(), or doing any direct access to the old .raw_vocab, will need to update their code. But, I think this is 1-in-a-hundred of users, and more advanced users, who'll be happy with the new separable extension/save points.
slight back-compat work needed, as old models with raw_vocab will need some hot-upgrade to survey objects whereever possible

PR forthcoming as soon as I have it working for FastText

The text was updated successfully, but these errors were encountered:

piskvorky · 2020-10-06T21:22:59Z

I think this is 1-in-a-hundred of users, and more advanced users, who'll be happy with the new separable extension/save points.

Agreed. This is one point where backward compatibility is no concern.

piskvorky · 2020-10-06T21:28:12Z

Maybe it's too late at night and I'm hallucinating, but if we manage to squeeze in enough "goodies" into 4.0 (speed, memory, new functionality) that directly benefit the user (as opposed to "refactored to be prettier / more extensible internally / more robust"), we could get away with not supporting any old models at all.

I know It's a dramatic 180° from my previous position, but if the benefits warrant it, 4.0 is the place to do it. We have to carefully weigh the annoyance to us (coding up migrations our side) vs annoyance to users (coding up migrations & retraining their side).

gojomo mentioned this issue Oct 6, 2020

Adopting a (narrow) backward-compatibility standard; implications for 4.0.0 #2967

Open

This was referenced Oct 10, 2020

[MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining #2944

Merged

[WIP] Implement position-dependent weighting to fastText #2905

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975

Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975

gojomo commented Oct 6, 2020 •

edited

Loading

piskvorky commented Oct 6, 2020

piskvorky commented Oct 6, 2020 •

edited

Loading

Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975

Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975

Comments

gojomo commented Oct 6, 2020 • edited Loading

piskvorky commented Oct 6, 2020

piskvorky commented Oct 6, 2020 • edited Loading

gojomo commented Oct 6, 2020 •

edited

Loading

piskvorky commented Oct 6, 2020 •

edited

Loading