Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] 2Vec SaveLoad improvements #2892

Closed
wants to merge 7 commits into from

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Jul 23, 2020

The 2Vec classes were doing serialization fixups in save()/load() that should really, by utils.SaveLoad design, be in _save_specials()/_load_specials() - to better handle cases when one SaveLoad contains another.

Also looks like there may be no existing functional tests of loading of gensim-3.8.3 FastText model, so it's possible that functionality is broken-with-an-unhelpful-failure. If so, I'll add a test & fix for that here, if possible.

(This builds atop #2891, which can & should be applied asap.)

@piskvorky
Copy link
Owner

@gojomo this branch has conflicts, so I resolved them in gojomo#11. Please review & merge that PR.

Anything else before this PR can be merged?

@gojomo
Copy link
Collaborator Author

gojomo commented Sep 8, 2020

Everything currently in this is probably a good/safe change, but the initial hope was to knock down some related issues in this PR as well, such as some or all of:

  • ensure there's at least one full test load of gensim-3.8.3 FastText/Word2Vec/Doc2Vec models that might not yet be tested
  • resolve whatever mysterious bloat is in the pickle file of pre-4.0-dev Word2Vec models
  • double-checking mmap options are functionally tested across 2Vec models, & fixing anything that might turn up
  • delete all tests of loading no-longer-supported older models – ideally, everything earlier than gensim-3.8.3 - and then potentially discarding support code thus no-longer needed

@piskvorky
Copy link
Owner

piskvorky commented Sep 8, 2020

Alright. I'll see what I can add in my PR. Please merge my PR gojomo#11 into this PR, to avoid divergence (I already had to resolve some conflicts).

@gojomo
Copy link
Collaborator Author

gojomo commented Sep 8, 2020

I know your PR-against-a-PR is trying to help, but your changes would work much better in their own cleanup PR.

My intent creating this PR was to early-apply some of the CI machine auto-testing as a convenience against changes I thought safe, but might've had subtle other problems. It was marked [WIP] so I'd have the freedom to rebase/squash/force-push etc to keep the commit-log clean & focused. (If we could adopt an even stronger label to indicate, "this may be squashed/reorged/rolled-back/etc", I'd use that.)

I also find it much easier to handle core functionality/coverage issues totally separately from superficial cleanup like docstring-formatting or py2-to-py3 changes - it keeps the diffs focused & easy-to-skim, without mixed-in other changes, and avoids confusion over what might be to blame for any test-result/functionality changes.

Also, reviewing-for-integration someone else's merge-from-develop is an inefficient distraction. I could rebase this atop any number of changes on develop quite easily, because I knew exactly the intent of this PR & the couple-dozen lines changed/moved. But your ~38 commits & 65-files-changed merge-from-develop is a giant interleaving of what I was focused on, with a bunch of miscellany: https://github.com/gojomo/gensim/pull/11/files. It's far harder to review for interactions. If it was just preamble to the one-newline-and-two-backticks of gojomo@49b35b7, those can and should come another way. Even if I just blindly trusted the merge, as you've almost certainly done the right conflict-resolutions, it would leave this PR less focused and reviewable for its original purpose.

I would prefer to rebase this on develop, then apply the one-newline-and-two-backticks, & then merge that into develop for you to do your other (py2/py3/etc) cleanup on separate PR(s) from there.

@piskvorky
Copy link
Owner

piskvorky commented Sep 8, 2020

I understand your preference for "private" PRs, but right now I'd prefer to finish up things that can be finished, in standard "shared" collaborative PRs. To unblock the release.

I cannot edit your PR, so please either finish it yourself, or I'll close this and merge my branch directly (hopefully after implementing some of your other suggestions from your checklist above).

The gojomo#11 diff only shows commits from develop because you haven't merged it yet – the 38 commits are new to your PR, but not to develop. Github will not show you already-existing commits once merged (I hope). One of the reasons to avoid rebasing.

@gojomo
Copy link
Collaborator Author

gojomo commented Sep 8, 2020

It's not so much "private" as both "focused" and "amenable to experimentation". And why on earth should a PR with the title "2Vec SaveLoad Improvements" get unrelated py2-to-py3 housekeeping edit mixed in?

I don't quite understand the reference to "blocking" 4.0 either. I haven't heard a target release date; I haven't heard a target set of features. I've not noticed any answers to questions I keep asking about whether it's OK to start dropping before-gensim-3.8.3 backward-compatibility.

Rather than declaring this little low-priority, nothing-functional-is-broken PR as somehow "critical path" for a release, and sending what are, in my opinion, superficial & unrelated changes into it that I should attend to before proceeding with its original goals, it would have been more efficient just to ignore it until/unless it became ready. Do other touchup/polish elsewhere. If such changes land in develop & then I have to resolve the conflicts to get this PR mergeable, fine – that's absolutely expected as part of having a separate exploratory branch. But the purposeful branch would have remained focused, & in an understood-by-me state, that I only had to compare with the trunk - not some other stream of inserted priorities. (To my mind, allowing such independent progress, only reconciled if/when needed, is nearly the whole point of the git branching model.)

Or alternatively, if your review (and/or my report) deemed this code as both ready & urgent for merging - just merge it. Don't complicate it with a PR-on-a-PR process, or defocus with other layers of changes that have to be reconciled-then-reconciled. I'd rather have to catch up to a trunk that's changed, for good reason, only when I think my changes are ready - rather than be put on the spot to both accept side-changes promptly before resuming my own priorites, and still not fall further behind any trunk changes. Efficient batching of attention/effort is really crucial.

I'll try merging your PR to see what Github's interfaces do, but if it winds up hiding my ability to see just the narrow intended changes of this branch/PR, I'm going to roll-back to my original plan.

@gojomo gojomo force-pushed the 2vec_saveload_fixes branch from b5794ee to 02ebe60 Compare September 8, 2020 22:53
@gojomo
Copy link
Collaborator Author

gojomo commented Sep 8, 2020

Tried the Github default 'squash & merge' (as the option to 'retain all commits' was "not enabled for this repository", & 'rebase & merge' seemed worse). Alas, that left this PR/branch appearing to have changed all 65 files with all the changes of the ~40 commits in Github's 'files changed' interface -- which would make this PR/branch useless. Github's 'revert' button appeared to create only an additive undo PR. So I force-pushed my working branch (gojomo:2vec_saveload_fixes) back to match my local work. Now I see no way to re-open the auto-closed-upon-merge gojomo#11.

So I suggest you ignore the gojomo#11 to here (#2892) to develop pathway, and instead commit to develop whatever you think belongs there - including anything from here, without reservations. (I'll happily adapt any other work-in-progress whatever's on develop trunk.)

Regarding the other potential-next-steps items listed at #2892 (comment) - I wasn't in-progress or imminently expecting to address any of them, so feel free to tackle any of them. (I suspect the Word2Vec model bloat is something very simple, like a single stray unintended property duplication.) To the extent I have some time in the next few days, I'd currently be more ready to look into the loss-reporting issues (#2922 & linked concerns).

@piskvorky
Copy link
Owner

piskvorky commented Sep 8, 2020

Github's "squash" = rebase, AFAIU. Anything but plain merge is bound to create trouble on collaborative PRs.

The outstanding tickets & PRs for 4.0.0 are marked in the 4.0.0 milestone. That's what I'm working off of – @gojomo @mpenkov if you see anything else to be included in 4.0.0, please add it there.

@piskvorky
Copy link
Owner

Continued in #2939, closing here.

@piskvorky piskvorky closed this Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants