Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further focus/slim keyedvectors.py module #2873

Open
gojomo opened this issue Jul 6, 2020 · 8 comments
Open

Further focus/slim keyedvectors.py module #2873

gojomo opened this issue Jul 6, 2020 · 8 comments
Assignees

Comments

@gojomo
Copy link
Collaborator

gojomo commented Jul 6, 2020

Pre-#2698, keyedvectors.py was 2500+ lines, including functionality over-specific to other models, & redundant classes. Post-#2698, with some added generic functionality, it's still over 1800 lines.

It should shed some other grab-bag utility functions that have accumulated, & don't logically fit inside the KeyedVectors class.

In particular, the evaluation (analogies, word_ranks) helpers could move to their own module that takes a KV instance as an argument. (If other more-sophisticated evaluations can be contributed, as would be welcome, they should also live alongside those, rather than bloating KeyedVectors.)

The get_keras_embedding method, as its utilit is narrow to very specific uses, and is conditional on a not-necessarily install package, could go elsewhere too – either a kera-focused utilities module, or even just documentation/example code about how to convert to/from keras from `KeyedVectors.

Some of the more advanced word-vector-using calculations, like 'Word Mover's Distance' or 'Soft Cosine SImilarity', could move to method-specific modules that are then better documented/self-contained/optimized, without bloating the generic 'set of vectors' module. (They might be more discoverable, there, as well.)

And finally, some of the existing calculations could be unified/streamlined (especially the two variants of most_similar(), and some of the steps shared by multiple operations). My hope would be the module is eventually <1000 lines.

@piskvorky
Copy link
Owner

@gojomo do you see this as essential for 4.0.0 = API breaking?

Or can we leave it for a later release?

@gojomo
Copy link
Collaborator Author

gojomo commented Oct 6, 2020

This is low-risk and not-hard, but also low-priority.

Making the decision that some things should be relocated would be nicer to do in 4.0.0, along with other "update your imports/code/function-names" changes, but could wait.

@piskvorky
Copy link
Owner

@mpenkov do you feel like taking this up, reshuffling / organizing KeyedVectors?

I already resolved the get_keras_embedding part in #2937.

@gojomo
Copy link
Collaborator Author

gojomo commented Oct 6, 2020

My 1st thoughts would be:

  • move evaluation code to its own module
  • move WMD calcs to its own module (even if a passthrough method remains here indefinitely); the options/optimizations for WMD might improve more beneficially in its own space (and evolve in parallel with other extensions like 'soft cosine')

Next but lower-priority, some of the other utility methods may be amenable to more reuse. (Perhaps, the cosmul as an option to most_similar, or some methods being redefined in terms of fewer more-central operations.)

Finally, after everything else has settled, the methods should be reordered by importance & grouped by role, so the autogenerated documentation has the most-used stuff up top, and a casual top-to-bottom read makes more sense.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 17, 2020

@mpenkov do you feel like taking this up, reshuffling / organizing KeyedVectors?

@piskvorky Sorry I missed this. I'm in the middle of a house move, so I'd rather not get involved until things have settled down.

If this can wait a couple of weeks, then I'd be happy to pick it up then. It looks like my sort of thing, and I've done it a couple of times with gensim already.

@piskvorky
Copy link
Owner

piskvorky commented Oct 17, 2020

If this can wait a couple of weeks

Yes, it can. In order of urgency:

  1. Release 4.0.0beta – hopefully this coming week (only TODO left are release/migration docs = my task). I'd like you to do the release process though (or together).
  2. A few weeks to tie up the other loose ends in https://github.com/RaRe-Technologies/gensim/milestone/3 – hopefully including this one.
  3. Release 4.0.0rc1 / full 4.0.0.

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 27, 2021

How about moving the high-level methods from keyedvectors.py to a separate wordtasks.py submodule? They could be pure functions there. For example:

  • keyedvectors.KeyedVectors.most_similar(self, ...) -> wordtasks.most_similar(model, ...)
  • keyedvectors.KeyedVectors.similar_by_word(self, ...) -> wordtasks.similar_by_word(model, ...)
  • keyedvectors.KeyedVectors.similar_by_key -> and so on...
  • keyedvectors.KeyedVectors.similar_by_vector
  • keyedvectors.KeyedVectors.wmdistance
  • keyedvectors.KeyedVectors.most_similar_cosmul
  • keyedvectors.KeyedVectors.rank_by_centrality
  • keyedvectors.KeyedVectors.doesnt_match
  • keyedvectors.KeyedVectors.cosine_similarities
  • keyedvectors.KeyedVectors.distances
  • keyedvectors.KeyedVectors.evaluate_word_pairs
  • keyedvectors.KeyedVectors.evaluate_word_analogies

All the above operations do not modify the keyedvectors model, they are read-only.

This would leave the lower-level IO stuff in keyedvectors.py, so serialization (loading/saving) shouldn't be affected, as far as I understand.

The name wordtasks comes from the keyedvectors docstring:

What can I do with word vectors? You can perform various syntactic/semantic NLP word tasks with the trained vectors.

WDYT?

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 9, 2021

@gojomo Let's remove this from the 4.0 milestone and deal with it later

@piskvorky piskvorky removed this from the 4.0.0 milestone Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants