Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[MRG] Add poincare vectors, tests and evaluation (#1700)
* Initial classes and loading data for poincare model * Initial implementation of training using autograd * faster negative sampling, bugfix in vector updates * allows poincare dist function to be differentiable by autograd * batched gradient descent initial implementation * minor changes to batch poincare distance computation * Adds calculation of gradients for poincare model * Correct implementation of clipping of updated vectors * Fixes error in gradient computation * Better messages while training * Renames PoincareDistance to PoincareExample for clarity * Compares computed gradients to autograd gradients every few iterations * Avoids doing some numpy computations twice * Avoids creating copies of numpy vectors * Only calls nan_to_num when gamma has at least one value equal to 1 * Simply sets nan gradients to zero instead of nan_to_num * Adds batch-wise implementation of training and gradient computations * Minor correction in clipping * Fixes typo in clip_vectors * Prints average loss every few iterations instead of current loss * Adds weighted negative sampling * Ensures positive edges are not returned by negative sampling * Poincare model stores node indices in relations instead of node keys * Minor renaming; uses node indices for batch training instead of node keys * Changes shapes of vectors passed to PoincareBatch * Minor bugfixes related to batch size * Corrects implementation of negative sampling for batch training * Adds option to check gradients in batchwise training * Checks gradients only every few iterations * Handles multiple occurrence of same node across and within batches * Removes unused section of code * Implements slightly different clipping method * Fixes bugs with wrong reshape in batchwise training * Example-wise training takes into account multiple occurrences of same node in an example too * Batchwise training prints average loss over many iterations instead of current batch * Fixes bug in updating vector for batchwise training * Faster implementation of negative sampling * Negative sampling for a node follows different paths depending on fraction of positive relations * Uses a buffer for negative samples to reduce calls to np.random.choice * Cleans up poincare.py, removes unused code * Adds shapes to PoincareBatch, more documentation * Adds more documentation to PoincareModel * Stores indices for nodes in a batch in PoincareBatch for better encapsulation * More documentation for poincare module * Implements burn-in for poincare model * Slightly better logging for poincare model * Uses np.random.random and np.searchsorted for random sampling rather than np.random.choice * Removes duplicates in negative samples * Moves helper classes in poincare after PoincareModel * Change in PoincareModel API to allow initializing from an iterable, separate class for streaming from file * Adds failing test for handling encoding in PoincareData * Fixes encoding handling in PoincareData * Adds docstrings to PoincareData, PoincareData streams tuples now * More unittests for PoincareModel * Changes handle_duplicates to staticmethod, adds test * Adds batch size and print_every parameters to train method * Renames print_check to should_print * Adds separate parameter for checking gradients * Minor fixes for coding style * Removes default values from docstrings, redundant * Adds example to PoincareModel init docstring * Extracts buffer for negatives out into a separate class * More detailed logging, fix to check_gradients * Minor fixes to documentation in poincare.py * Adds support for most_similar to PoincareKeyedVectors * Refactors most_similar and loss_fn to use PoincareKeyedVectors.poincare_dists * Adds tests for gradients checking * Raise AssertionError if gradients check fails * Adds failing tests for saving/loading PoincareModel instances * Fixes bug with saving/loading PoincareModel to disk * Adds test and fix for raising error on invalid input data * Adds test and fix for no duplicates and positives in negative sample * Bugfix with NegativesBuffer having less than items left * Uses larger data for poincare tests, adds data files * Bugfix with incorrect use of random state * Minor fixes in documentation style * Renames PoincareData to PoincareRelations * Change in the order of conditions checked before resampling * Imports datapath from test.utils instead of defining own * Adds working examples and a more detailed description in docstring * Renames term_relations to node_relations * Removes unused imports * Moves iter parameter to train instead of __init__, renames to epochs * Fixes term_relations in tests * Adds option to disable gradient check, disabled by default * Extracts gradient checking code into a separate method * Conditionally import autograd only if gradient checking is enabled * Marks private methods in poincare module with leading underscore * Adds init_range as an API parameter to PoincareModel * Marks private properties with a leading underscore * Fixes bug with burn-in happening on subsequent calls to train * Adds test for training multiple times * Adds autograd to test dependencies * Renames wv to kv in PoincareModel * add numpy==1.12 as test dependency * add missing quote * Moves methods for evaluating poincare embeddings to poincare.py * Updates docstrings for newly added classes * Moves trie-related methods to LexicalEntailmentEvaluation * Moves code for loading PoincareEmbedding into notebook * Removes PoincareEmbedding class, adds functionality to PoincareKeyedVectors * Updates eval nb with code and evaluation results for gensim models * Minor documentation updates + bugfix in distance * Adds methods for rank and nodes_closer_than to PoincareKeyedVectors * Adds methods to return closest child, parent, and ancestor and descendant chain for an input node * Updates LE and reconstruction results for gensim models in eval nb * Adds notebook detailing Poincare embedding operations and report * Adds images for poincare embedding report * Updates image links in poincare report nb * try to run tests without autograd * fix PEP8 in poincare.py * fix PEP8 in test_poincare * PoincareRelations handles python2 correctly * Bugfix with int division for python2 * Imports mock module for tests correctly in python2 * Cleaner implementation of __iter__ for PoincareRelations * Adds rst file and updates apiref.rst for poincare module * Adds clarifying comment to PoincareRelations.__iter__ * Adds functions for visualization to poincare_visualization.py * Suppresses certain numpy warnings while training model * Updates rst file for poincare * Updates poincare report nb with reduced code, section on training, better visualization labels and titles * Renames hypernym pair to relations everywhere * Simpler way of detecting duplicates * Minor documentation updates in poincare.py * Skips gradients test if autograd not installed, adds test for bytes input data * Adds results of gensim models on link prediction to eval notebook * Adds link prediction results to report, more information about training * Adds further details to concept and motivation sections, section on future work, and images * Fix flake8 (noqa + remove unused var) * Fix missing mock dependency for win * Fix links in docstrings * Refactors KeyedVectors into KeyedVectorsBase and EuclideanKeyedVectors * Changes error message for negative sampling failing * Adds option to specify dtype for PoincareModel and corresponding unittest * Extends test for dtype to check after training, updates docstring * Adds tests for new methods in PoincareKeyedVectors * Fixes bug in closest_child implementation * Adds similarity and distance to KeyedVectorsBase interface, implementation and tests for similarity for PoincareKeyedVectors * Minor fixes to Poincare report notebook * Adds method to compute all distances to KeyedVectorsBase, moves most_similar from EuclideanKeyedVectors to KeyedVectorsBase * Allows PoincareKeyedVectors.distances to accept an optional list of words * Adds implementation of PoincareKeyedVectors.similarities and tests * Adds restrict_vocab option to most_similar and tests for EuclideanKeyedVectors.most_similar * Adds docstring for tests * Adds implementation of EuclideanKeyedVectors.distances and tests, updates docstrings * Moves most_similar_to_given to KeyedVectorsBase, adds tests * Moves similar_by_vector and similar_by_word to KeyedVectorsBase, adds tests * Adds failing tests for similar_by_word and similar_by_vector to PoincareKeyedVector tests * Moves multiple methods out of KeyedVectorsBase back to EuclideanKeyedVectors, removes tests * Adds test for most_similar with vector input for EuclideanKeyedVectors * Adds failing test for vector input for most_similar for PoincareKeyedVectors * Allows passing in vector input to most_similar and distances methods in PoincareKeyedVectors * Removes precompute_max_distance and uses simpler formula for similarity in PoincareKeyedVectors * Renames PoincareKeyedVectors.poincare_dists to PoincareKeyedVectors.poincare_distance_batch * Fixes error with unclosed file in PoincareRelations * Adds tests and method for computing poincare distance between two input vectors * Adds methods and tests for finding position and difference in hierarchical positions of input vectors * Fixes unused import, pep8 and docstring issues * More intuitive naming of arguments for methods in PoincareKeyedVectors * Uses w1 and w2 consistently across KeyedVectors methods * Removes most_similar from KeyedVectorsBase * Adds failing tests for words_closer_than and rank for EuclideanKeyedVectors and PoincareKeyedVectors * Adds distances method to KeyedVectorsBase and EuclideanKeyedVectors, fixes tests * Makes default argument for distances immutable * Uses conditional import for pygtrie in LexicalEntailmentEvaluation * Renames position_in_hierarchy to norm with minor change in behaviour, updates tests * Renames poincare_distance and poincare_distance_batch to vector_distance and vector_distance_batch * Forces float division for positive_fraction in _sample_negatives * Removes unused method from PoincareKeyedVectors * Updates report notebook with usage examples of new API methods * Minor pep8 fix * Fixes pep8 issues, unused imports and typo * Adds example of saving and loading model to notebook * Updates docstrings in poincare.py * Moves poincare visualization methods to new gensim.viz module * Updates rst files for poincare viz * Adds newline at the end of poincare.py in viz package * Adds link to original paper to poincare notebook * fix viz.poincare & update docs dependencies * add link to init file * fix PEP8 * fixes for poincare.py
- Loading branch information