Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining #2944

Merged
merged 17 commits into from
Oct 15, 2020

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Sep 12, 2020

Fixes #2853 & fixes #2943.

Refactors a bunch of FastText/Word2Vec/Doc2Vec initialization & update code for clarity & less-duplication. Each of these classes now share the faster bulk style of vector-random-initalization previously only FastText used, a slight change in prior per-vector seeding behavior.

Seems to fix the test code for triggering ancient Doc2Vec update crash in #1019 - but there may still be problems there, and there's been no design/testing for build_vocab(..., update=True) on Doc2Vec, so current status there is still "not supported".

There was some purported (but possibly not useful, and not covered by unit-tests) mmap_path-specifying code in KeyedVectors that's been stubbed out here with FIXMEs/comments, pending intended (#2955) repair/rationalization of that functionality before 4.0.0 final release.

@piskvorky piskvorky added this to the 4.0.0 milestone Sep 13, 2020
@gojomo gojomo force-pushed the ft_save_after_update_vocab branch 4 times, most recently from cd2e326 to 52adad8 Compare September 14, 2020 03:54
@gojomo
Copy link
Collaborator Author

gojomo commented Sep 14, 2020

The paths for first, or post-vocab-expansion, allocation/initialization of vectors/vectors_vocab were more complex & redundant than necessary - streamlining into fewer methods also cleared up the discrepancy causing the save_facebook_model error.

There's a chance this also resolves very-old Doc2Vec-segfault #1019 - the test code there doesn't trigger a segfault anymore, but maybe it was other recent changes that fixed it, or maybe it's just become harder to trigger.

Making the Word2Vec & Doc2Vec vector-randomizations share the FastText code-path means they all now do the randomization in one bulk call, as FT has been doing, instead of the vector-at-a-time-seeded-by-key that W2V/D2V were doing. This should be faster.

But, at the moment the prior support for memmapped vector-arrays hasn't yet been maintained in this refactor, and the .uniform() call used (following the FT precedent) will 1st generate oversized np.float64numbers. Fixing those is next.

gensim/models/word2vec.py Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
@gojomo
Copy link
Collaborator Author

gojomo commented Sep 23, 2020

When the larger #2939 is merged, I propose rebasing this on that & applying - it fixes the original goal, and then some.

But, mmap functionality is currently broken in KeyedVectors, and likely other 2Vec classes as well - though in ways not covered by tests. That would best be addressed in a separate focused issue/PR - for which I've started #2955.

@piskvorky
Copy link
Owner

piskvorky commented Sep 28, 2020

#2939 is merged now. @gojomo will you be able to finish this & the broken mmap?

I'd love to release this week. Only the documentation items in #2960 (for me) and this ticket (for you, hopefully) are left now.

Plus the decision around supporting pre-3.8.3 models, but that's best discussed live – see email.

@gojomo gojomo force-pushed the ft_save_after_update_vocab branch from 2387b5d to a21cad6 Compare September 28, 2020 22:59
@gojomo
Copy link
Collaborator Author

gojomo commented Sep 29, 2020

Subject to the provisos...

...this is ready-for-final-review & merging!

@gojomo gojomo changed the title [WIP] save_facebook_model after update-vocab [MRG] save_facebook_model after update-vocab Sep 29, 2020
gensim/models/fasttext.py Outdated Show resolved Hide resolved
gensim/models/fasttext.py Outdated Show resolved Hide resolved

def prep_vectors(target_shape, prior_vectors=None, seed=0, dtype=REAL):
"""Return a numpy array of the given shape. Reuse prior_vectors values instance or values
to extent possible. Initialize new values randomly if requested."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a public API function. We should document the parameters so they appear in the documentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ultimately may not be, pending the mmap work & other initialization clean-up, which might also jostle the internal names a bit. (At the moment, this is only called from resize_vectors() which may be the preferable public-entry point, because outside callers are less-likely to have a prior_vectors.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this stuff will show up in the docs, right?

I can think of several better ways forward:

  • Make the docstring a code-comment so it doesn't show up in the docs
  • Mark the function as internal

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR description up to date? W.r.t. functionality contained within, and issues fixed by it.

PR titles and descriptions is what we have to go with for release notes. Plus that's what interested users see when they click through for more details on a change.

@piskvorky
Copy link
Owner

@gojomo is it still the case this doesn't fix #2853? Is the PR description up-to-date w.r.t. all the changes within?

@gojomo gojomo force-pushed the ft_save_after_update_vocab branch from 267594d to dda970e Compare October 6, 2020 07:54
@gojomo gojomo changed the title [MRG] save_facebook_model after update-vocab [MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining Oct 6, 2020
@gojomo
Copy link
Collaborator Author

gojomo commented Oct 6, 2020

Title & top-comment updated to describe what's addressed. Any lingering TODOs/doc-comment/naming stuff will be addressed in a followup PR related to #2955 or others.

@gojomo gojomo changed the title [MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining [debugging] fix save_facebook_model failure after update-vocab & other initialization streamlining Oct 6, 2020
@gojomo
Copy link
Collaborator Author

gojomo commented Oct 6, 2020

So, that one distant test-failing is a real head-scratcher, as noted in the code-comment I've added in 1edbb4c:

    rng = np.random.default_rng(seed=seed)  # use new instance of numpy's recommended generator/algorithm               
    # FIXME: `uniform` passes all tests, but generates temporary double-sized np.float64 array,                         
    # then cast-down ito right-sized np.float32, which means momentary 3x RAM usage on the model's                      
    # largest structure (often GB in size)                                                                              
    new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(dtype)
    # Meanwhile, this alternative, which by docs/reasoning/visual-inspection should be equivalent                       
    # while never creating the unneeded oversized np.float64 array, passes all *2Vec class                              
    # functional tests, but mysteriously (but reliably!) fails one obscure barely-sensible test                         
    # of a fringe downstream functionality: `TestBackMappingTranslationMatric.test_infer_vector`.                       
    # I'd adjust or jettison that test entirely *except* that the failure is *so* reliable, and                         
    # *so* mysterious, that it may be warning of something very subtle. So for now, very briefly,                       
    # sticking with the RAM-wasteful-but-all-tests-passing approach above, TODO debug/fix ASAP.                         
    # new_vectors = rng.random(target_shape, dtype=dtype)  # [0.0, 1.0)                                                 
    # new_vectors *= 2.0  # [0.0, 2.0)                                                                                  
    # new_vectors -= 1.0  # [-1.0, 1.0)                                                                                 
    new_vectors /= vector_size

So, in order to get this PR mergeable, I've reverted the code to what works-but-is-inefficient, with that blaring comment commitment to remedy as part of the next PR. (The inefficiency is what FastText has been doing all along, and what passes all tests, so no little harm to develop, but still something I'd like to fix before 4.0.0-final.)

@gojomo gojomo changed the title [debugging] fix save_facebook_model failure after update-vocab & other initialization streamlining [MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining Oct 6, 2020
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
@gojomo
Copy link
Collaborator Author

gojomo commented Oct 8, 2020

On further investigation, the flimsiness of the failing test is the real problem, and it was only passing in the 1st place via lucky seeding. (Over a range of seeds, either code alternative makes that test fail 20-30% of the time.) So I'll disable that test, & created #2977 to eventually – not at all urgently – clean up the related classes/docs.

@piskvorky
Copy link
Owner

piskvorky commented Oct 9, 2020

Thanks! Merge will have to wait for @mpenkov's review, but I already branched off here for my tickets, so no more rebasing / force-pushing please (normal merges are OK).

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Please have a look @gojomo .

gensim/models/doc2vec.py Outdated Show resolved Hide resolved
Vector dimensions will default to `np.float32` (AKA `REAL` in some Gensim code) unless
another type is provided here.
mapfile_path : string, optional
TODO: UNDER CONSTRUCTION / SUBJECT TO CHANGE - pending mmap work
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do about this parameter? I don't think leaving a TODO in publicly visible documentation is a good look.

Suggested change
TODO: UNDER CONSTRUCTION / SUBJECT TO CHANGE - pending mmap work
Currently unimplemented.

But really, if we're not currently using this parameter, let's just get rid of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comments 4 days ago, I'm not proposing any of these comments become part of the permanent codebase - they're all just temporary until #2955 is addressed, to prevent this PR from growing to become a diverse grab-bag of things unrelated to its title and initial focus. As @piskvorky has already started building on this in another PR, and I've got other functional things that depend on this, polishing comments here just risks making the tangle of PRs-on-PRs worse.

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

def prep_vectors(target_shape, prior_vectors=None, seed=0, dtype=REAL):
"""Return a numpy array of the given shape. Reuse prior_vectors values instance or values
to extent possible. Initialize new values randomly if requested."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this stuff will show up in the docs, right?

I can think of several better ways forward:

  • Make the docstring a code-comment so it doesn't show up in the docs
  • Mark the function as internal

gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/test/test_translation_matrix.py Outdated Show resolved Hide resolved
gojomo and others added 2 commits October 9, 2020 18:06
Co-authored-by: Michael Penkov <m@penkov.dev>
@gojomo
Copy link
Collaborator Author

gojomo commented Oct 10, 2020

For some reason, github isn't showing any option to reply to @mpenkov's comment about prep_vectors. To that:

So, I'd say: let's see how it settles in that other work.

@gojomo
Copy link
Collaborator Author

gojomo commented Oct 15, 2020

Does the approval mean it's OK for me to push the Github 'rebase & merge' button?

@piskvorky
Copy link
Owner

OK from me. Don't know about @mpenkov . Definitely no rebase though – just merge!

@gojomo
Copy link
Collaborator Author

gojomo commented Oct 15, 2020

I'm just looking at the default button in the Github interface (which I suspect results in the cleanest trunk history), and might not foul any history elsewhere... nor foul other similar merges). But: if you or @mpenkov can just merge this as soon as it's OK, that'll be quickest - in my opinion, this PR has long been functionally done, and I'd prefer for simplicity to base the followup PR off develop.

@piskvorky
Copy link
Owner

Okay. Merging here – @mpenkov please add any additional comments here / into @gojomo 's new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with calculation of new_words and pre_exist_words save_facebook_model() - AssertionError
3 participants