Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Poincare model keyedvectors #1700

Merged
merged 188 commits into from
Dec 4, 2017
Merged
Show file tree
Hide file tree
Changes from 163 commits
Commits
Show all changes
188 commits
Select commit Hold shift + click to select a range
6afdd22
Initial classes and loading data for poincare model
jayantj Oct 23, 2017
a804006
Initial implementation of training using autograd
jayantj Oct 23, 2017
6bd0d4b
faster negative sampling, bugfix in vector updates
jayantj Oct 25, 2017
98f94a7
allows poincare dist function to be differentiable by autograd
jayantj Oct 25, 2017
b727523
batched gradient descent initial implementation
jayantj Oct 25, 2017
1e6aee1
minor changes to batch poincare distance computation
jayantj Oct 25, 2017
e286a0b
Adds calculation of gradients for poincare model
jayantj Oct 26, 2017
3e28e8b
Correct implementation of clipping of updated vectors
jayantj Oct 26, 2017
99a2270
Fixes error in gradient computation
jayantj Oct 26, 2017
2e9e31c
Better messages while training
jayantj Oct 26, 2017
d72cb10
Renames PoincareDistance to PoincareExample for clarity
jayantj Oct 27, 2017
d439501
Compares computed gradients to autograd gradients every few iterations
jayantj Oct 27, 2017
e1ed24d
Avoids doing some numpy computations twice
jayantj Oct 27, 2017
3b2a383
Avoids creating copies of numpy vectors
jayantj Oct 27, 2017
7d68aae
Only calls nan_to_num when gamma has at least one value equal to 1
jayantj Oct 27, 2017
ba82d42
Simply sets nan gradients to zero instead of nan_to_num
jayantj Oct 27, 2017
71f61d1
Adds batch-wise implementation of training and gradient computations
jayantj Oct 27, 2017
2a5a7fb
Minor correction in clipping
jayantj Oct 30, 2017
0c57aa1
Merge branch 'poincare' into poincare_model
jayantj Oct 30, 2017
9c51609
Fixes typo in clip_vectors
jayantj Oct 30, 2017
f22d9b2
Prints average loss every few iterations instead of current loss
jayantj Oct 31, 2017
7905c8c
Adds weighted negative sampling
jayantj Nov 2, 2017
075df25
Ensures positive edges are not returned by negative sampling
jayantj Nov 2, 2017
6060e56
Poincare model stores node indices in relations instead of node keys
jayantj Nov 2, 2017
8ea8f23
Minor renaming; uses node indices for batch training instead of node …
jayantj Nov 2, 2017
b8d77e3
Changes shapes of vectors passed to PoincareBatch
jayantj Nov 3, 2017
0011b93
Minor bugfixes related to batch size
jayantj Nov 3, 2017
b52ee2e
Corrects implementation of negative sampling for batch training
jayantj Nov 3, 2017
d247384
Adds option to check gradients in batchwise training
jayantj Nov 3, 2017
8c4f5a3
Checks gradients only every few iterations
jayantj Nov 3, 2017
34b0ad3
Handles multiple occurrence of same node across and within batches
jayantj Nov 3, 2017
1779cd7
Removes unused section of code
jayantj Nov 3, 2017
faacb43
Implements slightly different clipping method
jayantj Oct 31, 2017
c68088e
Fixes bugs with wrong reshape in batchwise training
jayantj Nov 3, 2017
0c2f2cb
Example-wise training takes into account multiple occurrences of same…
jayantj Nov 3, 2017
386f602
Batchwise training prints average loss over many iterations instead o…
jayantj Nov 3, 2017
f0fb9e9
Fixes bug in updating vector for batchwise training
jayantj Nov 3, 2017
7d8fbec
Faster implementation of negative sampling
jayantj Nov 3, 2017
315f95c
Negative sampling for a node follows different paths depending on fra…
jayantj Nov 6, 2017
0802dd5
Uses a buffer for negative samples to reduce calls to np.random.choice
jayantj Nov 6, 2017
a106191
Cleans up poincare.py, removes unused code
jayantj Nov 6, 2017
1aa586d
Adds shapes to PoincareBatch, more documentation
jayantj Nov 6, 2017
13b00dc
Adds more documentation to PoincareModel
jayantj Nov 6, 2017
5978af6
Stores indices for nodes in a batch in PoincareBatch for better encap…
jayantj Nov 6, 2017
e40c3e3
More documentation for poincare module
jayantj Nov 6, 2017
ec8b516
Implements burn-in for poincare model
jayantj Nov 6, 2017
86ae4d6
Slightly better logging for poincare model
jayantj Nov 6, 2017
ac51e9c
Uses np.random.random and np.searchsorted for random sampling rather …
jayantj Nov 7, 2017
5900c6f
Removes duplicates in negative samples
jayantj Nov 7, 2017
4ac4d2e
Moves helper classes in poincare after PoincareModel
jayantj Nov 7, 2017
9eb6f48
Change in PoincareModel API to allow initializing from an iterable, s…
jayantj Nov 7, 2017
2ded72b
Adds failing test for handling encoding in PoincareData
jayantj Nov 7, 2017
81960e1
Fixes encoding handling in PoincareData
jayantj Nov 7, 2017
5de194b
Adds docstrings to PoincareData, PoincareData streams tuples now
jayantj Nov 7, 2017
6dd6915
More unittests for PoincareModel
jayantj Nov 8, 2017
4b502af
Changes handle_duplicates to staticmethod, adds test
jayantj Nov 8, 2017
12be121
Adds batch size and print_every parameters to train method
jayantj Nov 8, 2017
29e799c
Renames print_check to should_print
jayantj Nov 8, 2017
b4ff1dd
Adds separate parameter for checking gradients
jayantj Nov 8, 2017
e2f72bc
Minor fixes for coding style
jayantj Nov 8, 2017
953b4a7
Removes default values from docstrings, redundant
jayantj Nov 8, 2017
eebc12a
Adds example to PoincareModel init docstring
jayantj Nov 8, 2017
21a1c82
Extracts buffer for negatives out into a separate class
jayantj Nov 8, 2017
f9325ea
More detailed logging, fix to check_gradients
jayantj Nov 8, 2017
5db8456
Minor fixes to documentation in poincare.py
jayantj Nov 8, 2017
e5c1a3b
Adds support for most_similar to PoincareKeyedVectors
jayantj Nov 8, 2017
c62da7a
Refactors most_similar and loss_fn to use PoincareKeyedVectors.poinca…
jayantj Nov 8, 2017
53030a0
Adds tests for gradients checking
jayantj Nov 8, 2017
db0d293
Raise AssertionError if gradients check fails
jayantj Nov 8, 2017
1adf81a
Adds failing tests for saving/loading PoincareModel instances
jayantj Nov 8, 2017
3898089
Fixes bug with saving/loading PoincareModel to disk
jayantj Nov 8, 2017
5cd913a
Adds test and fix for raising error on invalid input data
jayantj Nov 8, 2017
6305228
Adds test and fix for no duplicates and positives in negative sample
jayantj Nov 8, 2017
fb13eb5
Bugfix with NegativesBuffer having less than items left
jayantj Nov 8, 2017
ea2fd48
Uses larger data for poincare tests, adds data files
jayantj Nov 8, 2017
110fb1e
Bugfix with incorrect use of random state
jayantj Nov 8, 2017
0aeec2f
Minor fixes in documentation style
jayantj Nov 8, 2017
38feb7a
Renames PoincareData to PoincareRelations
jayantj Nov 9, 2017
0e7ebb3
Change in the order of conditions checked before resampling
jayantj Nov 9, 2017
52a1e57
Merge branch 'poincare' into poincare_model
jayantj Nov 9, 2017
630771d
Imports datapath from test.utils instead of defining own
jayantj Nov 9, 2017
7c6d972
Adds working examples and a more detailed description in docstring
jayantj Nov 9, 2017
16dcf0b
Renames term_relations to node_relations
jayantj Nov 9, 2017
3501d6f
Removes unused imports
jayantj Nov 9, 2017
d690a25
Moves iter parameter to train instead of __init__, renames to epochs
jayantj Nov 9, 2017
2383e82
Fixes term_relations in tests
jayantj Nov 9, 2017
3ed0bea
Adds option to disable gradient check, disabled by default
jayantj Nov 9, 2017
9f562cb
Extracts gradient checking code into a separate method
jayantj Nov 9, 2017
98e078d
Conditionally import autograd only if gradient checking is enabled
jayantj Nov 9, 2017
530146d
Marks private methods in poincare module with leading underscore
jayantj Nov 9, 2017
d17c075
Adds init_range as an API parameter to PoincareModel
jayantj Nov 9, 2017
be0249a
Marks private properties with a leading underscore
jayantj Nov 9, 2017
dc2ab95
Fixes bug with burn-in happening on subsequent calls to train
jayantj Nov 9, 2017
a306f20
Adds test for training multiple times
jayantj Nov 9, 2017
f9750e6
Adds autograd to test dependencies
jayantj Nov 9, 2017
b7212ff
Renames wv to kv in PoincareModel
jayantj Nov 9, 2017
3556ee4
add numpy==1.12 as test dependency
menshikh-iv Nov 10, 2017
4644eda
add missing quote
menshikh-iv Nov 10, 2017
6946b74
Merge branch 'poincare_model' into poincare_model_keyedvectors
jayantj Nov 11, 2017
770c5a9
Moves methods for evaluating poincare embeddings to poincare.py
jayantj Nov 11, 2017
f027f20
Updates docstrings for newly added classes
jayantj Nov 11, 2017
7eea9b7
Moves trie-related methods to LexicalEntailmentEvaluation
jayantj Nov 11, 2017
241f706
Moves code for loading PoincareEmbedding into notebook
jayantj Nov 11, 2017
3e6f0fe
Removes PoincareEmbedding class, adds functionality to PoincareKeyedV…
jayantj Nov 11, 2017
7a87d6a
Updates eval nb with code and evaluation results for gensim models
jayantj Nov 11, 2017
9d495a1
Minor documentation updates + bugfix in distance
jayantj Nov 12, 2017
63750e7
Adds methods for rank and nodes_closer_than to PoincareKeyedVectors
jayantj Nov 12, 2017
7ea9d13
Adds methods to return closest child, parent, and ancestor and descen…
jayantj Nov 12, 2017
748288d
Updates LE and reconstruction results for gensim models in eval nb
jayantj Nov 12, 2017
ad5f635
Adds notebook detailing Poincare embedding operations and report
jayantj Nov 12, 2017
9db3d87
Adds images for poincare embedding report
jayantj Nov 12, 2017
de31b3d
Updates image links in poincare report nb
jayantj Nov 12, 2017
94a2a18
try to run tests without autograd
menshikh-iv Nov 13, 2017
7a4ec79
fix PEP8 in poincare.py
menshikh-iv Nov 13, 2017
613ca38
fix PEP8 in test_poincare
menshikh-iv Nov 13, 2017
3029d41
PoincareRelations handles python2 correctly
jayantj Nov 13, 2017
055044c
Bugfix with int division for python2
jayantj Nov 13, 2017
f75491f
Imports mock module for tests correctly in python2
jayantj Nov 13, 2017
59fcf8b
Cleaner implementation of __iter__ for PoincareRelations
jayantj Nov 13, 2017
dcbe7aa
Adds rst file and updates apiref.rst for poincare module
jayantj Nov 13, 2017
b69f51f
Adds clarifying comment to PoincareRelations.__iter__
jayantj Nov 13, 2017
84d3e5e
Adds functions for visualization to poincare_visualization.py
jayantj Nov 13, 2017
e4c2d62
Suppresses certain numpy warnings while training model
jayantj Nov 13, 2017
001ec76
Updates rst file for poincare
jayantj Nov 13, 2017
b548464
Updates poincare report nb with reduced code, section on training, be…
jayantj Nov 13, 2017
9446a05
Renames hypernym pair to relations everywhere
jayantj Nov 13, 2017
930dfd4
Simpler way of detecting duplicates
jayantj Nov 13, 2017
355e521
Minor documentation updates in poincare.py
jayantj Nov 14, 2017
0d5175c
Skips gradients test if autograd not installed, adds test for bytes i…
jayantj Nov 14, 2017
68e872e
Adds results of gensim models on link prediction to eval notebook
jayantj Nov 14, 2017
53f6622
Adds link prediction results to report, more information about training
jayantj Nov 14, 2017
7f9337c
Adds further details to concept and motivation sections, section on f…
jayantj Nov 14, 2017
00ca7ab
Fix flake8 (noqa + remove unused var)
menshikh-iv Nov 14, 2017
8ff23ae
Fix missing mock dependency for win
menshikh-iv Nov 14, 2017
30ac3e6
Fix links in docstrings
menshikh-iv Nov 14, 2017
f0e15ee
Refactors KeyedVectors into KeyedVectorsBase and EuclideanKeyedVectors
jayantj Nov 15, 2017
a928ca1
Changes error message for negative sampling failing
jayantj Nov 15, 2017
dfc19cb
Adds option to specify dtype for PoincareModel and corresponding unit…
jayantj Nov 15, 2017
e967c54
Extends test for dtype to check after training, updates docstring
jayantj Nov 15, 2017
a39781b
Merge branch 'poincare_model' into poincare_model_keyedvectors
jayantj Nov 15, 2017
4920194
Adds tests for new methods in PoincareKeyedVectors
jayantj Nov 15, 2017
8765299
Fixes bug in closest_child implementation
jayantj Nov 15, 2017
9e3190f
Adds similarity and distance to KeyedVectorsBase interface, implement…
jayantj Nov 15, 2017
b1d5aa1
Minor fixes to Poincare report notebook
jayantj Nov 15, 2017
9cb4fa8
Adds method to compute all distances to KeyedVectorsBase, moves most_…
jayantj Nov 15, 2017
abbe77d
Allows PoincareKeyedVectors.distances to accept an optional list of w…
jayantj Nov 15, 2017
6db7b3b
Adds implementation of PoincareKeyedVectors.similarities and tests
jayantj Nov 16, 2017
97bb0a4
Adds restrict_vocab option to most_similar and tests for EuclideanKey…
jayantj Nov 16, 2017
4ba5c83
Adds docstring for tests
jayantj Nov 16, 2017
7fd2518
Adds implementation of EuclideanKeyedVectors.distances and tests, upd…
jayantj Nov 17, 2017
6e6c7df
Moves most_similar_to_given to KeyedVectorsBase, adds tests
jayantj Nov 17, 2017
68435b5
Moves similar_by_vector and similar_by_word to KeyedVectorsBase, adds…
jayantj Nov 17, 2017
8650910
Adds failing tests for similar_by_word and similar_by_vector to Poinc…
jayantj Nov 17, 2017
000b499
Moves multiple methods out of KeyedVectorsBase back to EuclideanKeyed…
jayantj Nov 17, 2017
b4f07bd
Adds test for most_similar with vector input for EuclideanKeyedVectors
jayantj Nov 17, 2017
c1f68e4
Adds failing test for vector input for most_similar for PoincareKeyed…
jayantj Nov 17, 2017
d6f743d
Allows passing in vector input to most_similar and distances methods …
jayantj Nov 17, 2017
9a0b64c
Removes precompute_max_distance and uses simpler formula for similari…
jayantj Nov 17, 2017
5b38f42
Renames PoincareKeyedVectors.poincare_dists to PoincareKeyedVectors.p…
jayantj Nov 17, 2017
db7def8
Fixes error with unclosed file in PoincareRelations
jayantj Nov 17, 2017
b4ae804
Adds tests and method for computing poincare distance between two inp…
jayantj Nov 17, 2017
a64f262
Adds methods and tests for finding position and difference in hierarc…
jayantj Nov 17, 2017
f10b0a7
Merge branch 'poincare' into poincare_model_keyedvectors
jayantj Nov 20, 2017
ed304eb
Fixes unused import, pep8 and docstring issues
jayantj Nov 21, 2017
5568a20
More intuitive naming of arguments for methods in PoincareKeyedVectors
jayantj Nov 21, 2017
ad965b4
Uses w1 and w2 consistently across KeyedVectors methods
jayantj Nov 21, 2017
2b982ab
Removes most_similar from KeyedVectorsBase
jayantj Nov 21, 2017
e31e816
Adds failing tests for words_closer_than and rank for EuclideanKeyedV…
jayantj Nov 21, 2017
d73c0e2
Adds distances method to KeyedVectorsBase and EuclideanKeyedVectors, …
jayantj Nov 22, 2017
235b643
Makes default argument for distances immutable
jayantj Nov 22, 2017
d0b8563
Uses conditional import for pygtrie in LexicalEntailmentEvaluation
jayantj Nov 22, 2017
cedd0e1
Renames position_in_hierarchy to norm with minor change in behaviour,…
jayantj Nov 22, 2017
0317189
Renames poincare_distance and poincare_distance_batch to vector_dista…
jayantj Nov 22, 2017
e693e64
Forces float division for positive_fraction in _sample_negatives
jayantj Nov 22, 2017
e931085
Removes unused method from PoincareKeyedVectors
jayantj Nov 22, 2017
3c8d9f2
Updates report notebook with usage examples of new API methods
jayantj Nov 22, 2017
73ed696
Minor pep8 fix
jayantj Nov 22, 2017
ee92be9
Fixes pep8 issues, unused imports and typo
jayantj Nov 23, 2017
46a7efb
Adds example of saving and loading model to notebook
jayantj Nov 23, 2017
291dac6
Updates docstrings in poincare.py
jayantj Nov 23, 2017
c532e6e
Moves poincare visualization methods to new gensim.viz module
jayantj Nov 27, 2017
c506b96
Updates rst files for poincare viz
jayantj Nov 27, 2017
b4ec393
Adds newline at the end of poincare.py in viz package
jayantj Nov 27, 2017
a7c3080
Adds link to original paper to poincare notebook
jayantj Dec 2, 2017
e53f487
fix viz.poincare & update docs dependencies
menshikh-iv Dec 4, 2017
4775f4d
add link to init file
menshikh-iv Dec 4, 2017
a22c601
fix PEP8
menshikh-iv Dec 4, 2017
6a2da73
fixes for poincare.py
menshikh-iv Dec 4, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
859 changes: 331 additions & 528 deletions docs/notebooks/Poincare Evaluation.ipynb

Large diffs are not rendered by default.

107,447 changes: 107,447 additions & 0 deletions docs/notebooks/Poincare Report.ipynb

Large diffs are not rendered by default.

Binary file added docs/notebooks/poincare/entailment_eval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/notebooks/poincare/entailment_paper.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/notebooks/poincare/example_tree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/notebooks/poincare/link_prediction_eval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/notebooks/poincare/reconstruction_eval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/notebooks/poincare/reconstruction_paper.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
291 changes: 193 additions & 98 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
double, array, vstack, fromstring, sqrt, newaxis,\
ndarray, sum as np_sum, prod, ascontiguousarray,\
argmax
import numpy as np

from gensim import utils, matutils # utility fnc for pickling, common scipy operations etc
from gensim.corpora.dictionary import Dictionary
Expand Down Expand Up @@ -103,28 +104,19 @@ def __str__(self):
return "%s(%s)" % (self.__class__.__name__, ', '.join(vals))


class KeyedVectors(utils.SaveLoad):

class KeyedVectorsBase(utils.SaveLoad):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: too many blank lines

"""
Class to contain vectors and vocab for the Word2Vec training class and other w2v methods not directly
involved in training such as most_similar()
Base class to contain vectors and vocab for any set of vectors which are each associated with a key.

"""

def __init__(self):
self.syn0 = []
self.syn0norm = None
self.vocab = {}
self.index2word = []
self.vector_size = None

@property
def wv(self):
return self

def save(self, *args, **kwargs):
# don't bother storing the cached normalized vectors
kwargs['ignore'] = kwargs.get('ignore', ['syn0norm'])
super(KeyedVectors, self).save(*args, **kwargs)

def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None):
"""
Store the input-hidden weight matrix in the same format used by the original
Expand Down Expand Up @@ -263,6 +255,121 @@ def add_word(word, weights):
logger.info("loaded %s matrix from %s", result.syn0.shape, fname)
return result

def similarity(self, word_1, word_2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters are called word_1, word_2 here, w1, w1 in Euclidean an term_1, term_2 in Poincare. A bit of consistency would be nice. The distance() is the same story.

"""
Compute similarity between vectors of two input words.
To be implemented by child class.
"""
raise NotImplementedError

def distance(self, word_1, word_2):
"""
Compute distance between vectors of two input words.
To be implemented by child class.
"""
raise NotImplementedError

def word_vec(self, word):
"""
Accept a single word as input.
Returns the word's representations in vector space, as a 1D numpy array.

Example::

>>> trained_model['office']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird example. Should be word_vec call instead?

array([ -1.40128313e-02, ...])

"""
if word in self.vocab:
result = self.syn0[self.vocab[word].index]
result.setflags(write=False)
return result
else:
raise KeyError("word '%s' not in vocabulary" % word)

def __getitem__(self, words):
"""
Accept a single word or a list of words as input.

If a single word: returns the word's representations in vector space, as
a 1D numpy array.

Multiple words: return the words' representations in vector space, as a
2d numpy array: #words x #vector_size. Matrix rows are in the same order
as in input.

Example::

>>> trained_model['office']
array([ -1.40128313e-02, ...])

>>> trained_model[['office', 'products']]
array([ -1.40128313e-02, ...]
[ -1.70425311e-03, ...]
...)

"""
if isinstance(words, string_types):
# allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
return self.word_vec(words)

return vstack([self.word_vec(word) for word in words])

def __contains__(self, word):
return word in self.vocab

def most_similar(self, word, topn=10, restrict_vocab=None):
"""
Find the top-N most similar words to the given word, sorted in increasing order of distance.
To be implemented by child classes

"""
raise NotImplementedError

def most_similar_to_given(self, w1, word_list):
"""Return the word from word_list most similar to w1.

Args:
w1 (str): a word
word_list (list): list of words containing a word most similar to w1

Returns:
the word in word_list with the highest similarity to w1

Raises:
KeyError: If w1 or any word in word_list is not in the vocabulary

Example::

>>> trained_model.most_similar_to_given('music', ['water', 'sound', 'backpack', 'mouse'])
'sound'

>>> trained_model.most_similar_to_given('snake', ['food', 'pencil', 'animal', 'phone'])
'animal'

"""
return word_list[argmax([self.similarity(w1, word) for word in word_list])]


class EuclideanKeyedVectors(KeyedVectorsBase):
"""
Class to contain vectors and vocab for the Word2Vec training class and other w2v methods not directly
involved in training such as most_similar()
"""

def __init__(self):
super(EuclideanKeyedVectors, self).__init__()
self.syn0norm = None

@property
def wv(self):
return self

def save(self, *args, **kwargs):
# don't bother storing the cached normalized vectors
kwargs['ignore'] = kwargs.get('ignore', ['syn0norm'])
super(EuclideanKeyedVectors, self).save(*args, **kwargs)

def word_vec(self, word, use_norm=False):
"""
Accept a single word as input.
Expand Down Expand Up @@ -356,6 +463,44 @@ def most_similar(self, positive=None, negative=None, topn=10, restrict_vocab=Non
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
return result[:topn]

def similar_by_word(self, word, topn=10, restrict_vocab=None):
"""
Find the top-N most similar words.

If topn is False, similar_by_word returns the vector of similarity scores.

`restrict_vocab` is an optional integer which limits the range of vectors which
are searched for most-similar values. For example, restrict_vocab=10000 would
only check the first 10000 word vectors in the vocabulary order. (This may be
meaningful if you've sorted the vocabulary by descending frequency.)

Example::

>>> trained_model.similar_by_word('graph')
[('user', 0.9999163150787354), ...]

"""
return self.most_similar(positive=[word], topn=topn, restrict_vocab=restrict_vocab)

def similar_by_vector(self, vector, topn=10, restrict_vocab=None):
"""
Find the top-N most similar words by vector.

If topn is False, similar_by_vector returns the vector of similarity scores.

`restrict_vocab` is an optional integer which limits the range of vectors which
are searched for most-similar values. For example, restrict_vocab=10000 would
only check the first 10000 word vectors in the vocabulary order. (This may be
meaningful if you've sorted the vocabulary by descending frequency.)

Example::

>>> trained_model.similar_by_vector([1,2])
[('survey', 0.9942699074745178), ...]

"""
return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)

def wmdistance(self, document1, document2):
"""
Compute the Word Mover's Distance between two documents. When using this
Expand Down Expand Up @@ -511,46 +656,6 @@ def most_similar_cosmul(self, positive=None, negative=None, topn=10):
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
return result[:topn]

def similar_by_word(self, word, topn=10, restrict_vocab=None):
"""
Find the top-N most similar words.

If topn is False, similar_by_word returns the vector of similarity scores.

`restrict_vocab` is an optional integer which limits the range of vectors which
are searched for most-similar values. For example, restrict_vocab=10000 would
only check the first 10000 word vectors in the vocabulary order. (This may be
meaningful if you've sorted the vocabulary by descending frequency.)

Example::

>>> trained_model.similar_by_word('graph')
[('user', 0.9999163150787354), ...]

"""

return self.most_similar(positive=[word], topn=topn, restrict_vocab=restrict_vocab)

def similar_by_vector(self, vector, topn=10, restrict_vocab=None):
"""
Find the top-N most similar words by vector.

If topn is False, similar_by_vector returns the vector of similarity scores.

`restrict_vocab` is an optional integer which limits the range of vectors which
are searched for most-similar values. For example, restrict_vocab=10000 would
only check the first 10000 word vectors in the vocabulary order. (This may be
meaningful if you've sorted the vocabulary by descending frequency.)

Example::

>>> trained_model.similar_by_vector([1,2])
[('survey', 0.9942699074745178), ...]

"""

return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)

def doesnt_match(self, words):
"""
Which word from the given list doesn't go with the others?
Expand All @@ -574,36 +679,47 @@ def doesnt_match(self, words):
dists = dot(vectors, mean)
return sorted(zip(dists, used_words))[0][1]

def __getitem__(self, words):
@staticmethod
def cosine_similarities(vector_1, vectors_all):
"""
Accept a single word or a list of words as input.
Return cosine similarities between one vector and a set of other vectors.

If a single word: returns the word's representations in vector space, as
a 1D numpy array.
Parameters
----------
vector_1 : numpy.array
vector from which similarities are to be computed.
expected shape (dim,)
vectors_all : numpy.array
for each row in vectors_all, distance from vector_1 is computed.
expected shape (num_vectors, dim)

Multiple words: return the words' representations in vector space, as a
2d numpy array: #words x #vector_size. Matrix rows are in the same order
as in input.
Returns
-------
numpy.array
Contains cosine distance between vector_1 and each row in vectors_all.
shape (num_vectors,)

Example::
"""
norm = np.linalg.norm(vector_1)
all_norms = np.linalg.norm(vectors_all, axis=1)
dot_products = dot(vectors_all, vector_1)
similarities = dot_products / (norm * all_norms)
return similarities

>>> trained_model['office']
array([ -1.40128313e-02, ...])
def distance(self, w1, w2):
"""
Compute cosine distance between two words.

>>> trained_model[['office', 'products']]
array([ -1.40128313e-02, ...]
[ -1.70425311e-03, ...]
...)
Example::

"""
if isinstance(words, string_types):
# allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
return self.word_vec(words)
>>> trained_model.distance('woman', 'man')
0.34

return vstack([self.word_vec(word) for word in words])
>>> trained_model.distance('woman', 'woman')
0.0

def __contains__(self, word):
return word in self.vocab
"""
return 1 - self.similarity(w1, w2)

def similarity(self, w1, w2):
"""
Expand All @@ -620,30 +736,6 @@ def similarity(self, w1, w2):
"""
return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))

def most_similar_to_given(self, w1, word_list):
"""Return the word from word_list most similar to w1.

Args:
w1 (str): a word
word_list (list): list of words containing a word most similar to w1

Returns:
the word in word_list with the highest similarity to w1

Raises:
KeyError: If w1 or any word in word_list is not in the vocabulary

Example::

>>> trained_model.most_similar_to_given('music', ['water', 'sound', 'backpack', 'mouse'])
'sound'

>>> trained_model.most_similar_to_given('snake', ['food', 'pencil', 'animal', 'phone'])
'animal'

"""
return word_list[argmax([self.similarity(w1, word) for word in word_list])]

def n_similarity(self, ws1, ws2):
"""
Compute cosine similarity between two sets of words.
Expand Down Expand Up @@ -873,3 +965,6 @@ def get_keras_embedding(self, train_embeddings=False):
weights=[weights], trainable=train_embeddings
)
return layer

# For backward compatibility
KeyedVectors = EuclideanKeyedVectors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: no newline at the EOF

Loading