Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Poincare Model implementation #1696

Merged
merged 116 commits into from
Nov 15, 2017
Merged

[MRG] Poincare Model implementation #1696

merged 116 commits into from
Nov 15, 2017

Conversation

jayantj
Copy link
Contributor

@jayantj jayantj commented Nov 6, 2017

Pure Python implementation of the Poincare model from [1].

TODO -

  • Unit tests
  • API conformity
  • More logging

Follow up PR: #1700

[1] Poincaré Embeddings for Learning Hierarchical Representations

Whether the input array contains any duplicates.

"""
seen = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return len(array) != len(set(array)) simpler. Probably not even worth adding a method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Parameters
----------
train_data : iterable of (str, str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str is ambiguous for Python 2 vs 3. Better to say unicode or bytes instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both unicode and bytes are allowed here. Wherever this is true, I've used str, wherever a specific type is required returned, I've used unicode/bytes. Does that sound okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, perfect. Are you sure it works correctly with bytes, though? I suppose if we train on bytes we'll end up with a bytes based model. I wonder if that's common for other gensim models. Won't that cause unexpected behavior with some KeyedVectors calls?

node_relations = defaultdict(set) # Mapping from node index to its related node indices

logger.info("Loading relations from train data..")
for hypernym_pair in self.train_data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename hypernym_pair to something more generic such as relation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Vectors of all nodes `u` in the batch.
Expected shape (batch_size, dim).
vectors_v : numpy.array
Vectors of all hypernym nodes `v` and negatively sampled nodes `v'`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just "nodes"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



class PoincareRelations(object):
"""Class to stream hypernym relations for `PoincareModel` from a tsv-like file."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just "relations", here and elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""Class to stream hypernym relations for `PoincareModel` from a tsv-like file."""

def __init__(self, file_path, encoding='utf8', delimiter='\t'):
"""Initialize instance from file containing one hypernym pair per line.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hypernym pair -> relation (here and elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jayantj
Copy link
Contributor Author

jayantj commented Nov 13, 2017

I've added the rst files and made some fixes for python2 bugs. The only failing test is the one that requires autograd now (due to it being missing from test dependencies). With autograd added to test dependencies, the build errors (due to some MKL error, as you mentioned).

@menshikh-iv
Copy link
Contributor

@jayantj maybe remove this test (because we can't run it correctly in CI)?

def __init__(
self, train_data, size=50, alpha=0.1, negative=10, workers=1,
epsilon=1e-5, burn_in=10, burn_in_alpha=0.01, init_range=(-0.001, 0.001), seed=0):
"""Initialize and train a Poincare embedding model from an iterable of transitive closure relations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the transitive closure a requirement? If not, let's just say "iterable of relations".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@jayantj
Copy link
Contributor Author

jayantj commented Nov 14, 2017

@menshikh-iv I've instead added a skiptest in case autograd is not installed, that way we can continue to check if the test runs locally, making development easier. Does that seem okay?

@menshikh-iv menshikh-iv merged commit 0ae0f96 into poincare Nov 15, 2017
@jayantj jayantj mentioned this pull request Dec 4, 2017
@menshikh-iv menshikh-iv deleted the poincare_model branch July 5, 2018 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants