Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Poincare model keyedvectors #1700

Merged
merged 188 commits into from
Dec 4, 2017
Merged

Conversation

jayantj
Copy link
Contributor

@jayantj jayantj commented Nov 8, 2017

This branch includes the changes to KeyedVectors and PoincareKeyedVectors for the PoincareModel.

Includes commits from #1696 , to be reviewed and merged after #1696

TODOs:

  • Evaluating Gensim implementation and adding results to notebook
  • Adding KeyedVector methods like most_similar to PoincareKeyedVectors
  • Refactoring KeyedVectors into KeyedVectorsBase and EuclideanKeyedVectors for cleaner class hierarchy
  • Tests for PoincareKeyedVectors

@jayantj
Copy link
Contributor Author

jayantj commented Nov 22, 2017

Nice catch for the integer division, fixed.

@jayantj
Copy link
Contributor Author

jayantj commented Nov 22, 2017

Short summary and rationale for the changes to KeyedVectors in this PR -

  1. Refactoring of the previously existing KeyedVectors class into KeyedVectorsBase and EuclideanKeyedVectors. For backwards compatibility, the keyedvectors module contains a reference KeyedVectors which points to EuclideanKeyedVectors. It may be a good idea to rename KeyedVectors elsewhere in the codebase to EuclideanKeyedVectors.

  2. KeyedVectorsBase is simply a collection of vectors and associated labels, supporting the following old methods -

    • __getitem__
    • __contains__
    • word_vec
    • similarity
    • most_similar_to_given
    • load_word2vec_format/save_word2vec_format
      Along with the following new methods -
    • distance
    • distances
    • words_closer_than
    • rank
      Note that the KeyedVectorsBase class does not provide definitions for the distance, distances and similarity methods, it is upto the child class to define them as appropriate.
  3. EuclideanKeyedVectors is derived from KeyedVectorsBase and contains functionality that is only relevant/meaningful for vectors in Euclidean space. Both word2vec and fasttext vectors fall into this category.

  4. PoincareKeyedVectors is a new class to contain vectors in hyperbolic space for the Poincare model, which supports operations specific to the vectors for the Poincare model.

  5. The most_similar method conceptually makes sense for both PoincareKeyedVectors and EuclideanKeyedVectors, however as the already existing API for most_similar could not be supported in PoincareKeyedVectors, PoincareKeyedVectors provides an implementation of most_similar with a different API. For this reason, most_similar is not present in the KeyedVectorsBase class.

The PR also adds missing tests for some of the older KeyedVectors methods.

@jayantj jayantj force-pushed the poincare_model_keyedvectors branch from 004b572 to 73ed696 Compare November 22, 2017 01:15
@jayantj jayantj changed the title [WIP] Poincare model keyedvectors [MRG] Poincare model keyedvectors Nov 22, 2017
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general - very nice work 💣 🔥
Only several questions (and I'll fix docstrings / PEP8 after your commits).

@@ -0,0 +1,187 @@
#!/usr/bin/env python
Copy link
Contributor

@menshikh-iv menshikh-iv Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better write this code in ipython notebook (or maybe move to docs/notebooks and import from "under the feet").
gensim.models isn't a suitable place for this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree gensim.models isn't a good place for it. I don't want to have it in an ipython notebook or docs/notebooks though since that would mean a user can't import and use it, and I think it is definitely useful for a user. Do you think creating a new package poincare in gensim/ would be a good idea? Other models do this too (e.g. topic_coherence)

Copy link
Contributor

@menshikh-iv menshikh-iv Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is a very good idea, after refactoring, topic_coherence will be moved/renamed in "deep" of gensim.modules (coherence contains only inner/secondary functions, not public API), but in your case, API is public.

I agree about imports (it's really untrivial, how to import from docs/notebooks if you in /randomfolder, only manually with importlib I think.

We have many viz helpers (produced by @parulsethi on GSoC) + now Parul works on very nice viz for topic models. Potentially, we can create the distinct repository (like gensim-data) and move all viz helpers, or, as you suggest, create submodule gensim.viz and move all viz stuff (not only your Poincare viz).

Hard question, I don't know what's better right now.

WDYT @piskvorky @janpom @parulsethi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A submodule gensim.viz sounds good to me, keeping in mind we might have future visualizations too. I don't have a good enough perspective on this though, so whatever you decide is okay with me.

Copy link
Contributor Author

@jayantj jayantj Nov 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @menshikh-iv , so is it okay if I create a gensim.viz package for this and any future gensim visualizations, and move the poincare visualization there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gensim.viz submodule would be useful for #1616 also in future, and the long code blocks of network graph/dendrogram could also be wrapped up in a function under this module so that those visualizations can be produced simply using the imports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayantj sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for the feedback @parulsethi @menshikh-iv
Pushed changes with a new gensim.viz package.

@@ -514,6 +516,8 @@ def train(self, epochs, batch_size=10, print_every=1000, check_gradients_every=N
"""
if self.workers > 1:
raise NotImplementedError("Multi-threaded version not implemented yet")
# Some divide-by-zero results are handled explicitly
old_settings = np.seterr(divide='ignore', invalid='ignore')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? You mean that division by zero is expected and you process this situation in code, I'm correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it happens in PoincareBatch. I'm setting it here to avoid repeated calls to np.seterr


Parameters
----------
node : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always str (or int possible too)? This question more global (about all methods) that pass node argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be an int too in theory, depending on what the vocab keys are. The most common case is str though. How would you prefer to handle this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe str or int everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also made some other changes to docstrings for more clarity.

@jayantj
Copy link
Contributor Author

jayantj commented Nov 23, 2017

Some completely unrelated tests seem to be failing on travis (test_translation_matrix, test_lda_model). Not sure what that is about.

@piskvorky
Copy link
Owner

@menshikh-iv we need this resolved & finished -- can you have a look? Cheers.

@menshikh-iv menshikh-iv merged commit 1ac5a26 into poincare Dec 4, 2017
@jayantj jayantj mentioned this pull request Dec 4, 2017
@menshikh-iv menshikh-iv deleted the poincare_model_keyedvectors branch July 5, 2018 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants