Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gensim.models.BaseKeyedVectors.add_entity method for fill KeyedVectors in manual way. Fix #1942 #1957

Conversation

persiyanov
Copy link
Contributor

Pull request related to this issue.

With these changes, it would be possible to add new word vectors into a KeyedVectors object.

@menshikh-iv please, take a look. I will write tests on it after addressing your comments.

Moreover, I have some questions/doubts:

  1. Method name: is it okay right now or maybe it should be named like add_entity/add_word?
  2. Weights parameter: should we check the type and shape of weights and raise an exception in a bad case?
  3. self.vectors contiguity: here vectors list is casted to C-contiguous array. Does numpy preserve C-contiguity after operations such as np.vstack? If so, I think I should add numpy.ascontiguousarray cast in my code.

@menshikh-iv
Copy link
Contributor

Hello @persiyanov

  1. I think both (add_word & add_entity) is OK for this case
  2. I don't think so (because you have no predefined length if you'll create the empty class). Anyway, this needs only for some special cases.

@menshikh-iv menshikh-iv changed the title [Fixes #1942]: Introduce BaseKeyedVectors.add(...) method Add BaseKeyedVectors.add for fill kv in manual way. Fix #1942 Mar 7, 2018
@@ -154,6 +154,19 @@ def get_vector(self, entity):
else:
raise KeyError("'%s' not in vocabulary" % entity)

def add(self, entity, weights):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about re-using this function in (this is duplication right now from https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L182)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, need to add tests to check this functionality:

  • load some kv, add more vectors in a manual way and check that this added fine
  • create empty kv, fill it manually and check that all fine

Copy link
Contributor Author

@persiyanov persiyanov Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv

About reusing this function:

It's a bit difficult because in this function vstack is used to append new word vector to self.vectors, while add_word in utils_any2vec creates vectors array at first (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L180) and then it just inserts vectors into it (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L197).

While it's possible to follow DRY here, the interface of BaseKeyedVectors.add() method will be more complicated (or I can change the logic in utils_any2vec -- not to create vectors = np.zeros(...) but append each word to the array, but it could decrease the performance of load_word2vec_format function).

If some of these two options is okay, I'll implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, thanks for suggestion, let's stay it as is.

@@ -153,6 +153,27 @@ def test_wv_property(self):
"""Test that the deprecated `wv` property returns `self`. To be removed in v4.0.0."""
self.assertTrue(self.vectors is self.vectors.wv)

def test_add_word(self):
"""Test that adding word in a manual way works correctly."""
from numpy.random import randn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove import and use np.random.randn

@@ -154,6 +154,28 @@ def get_vector(self, entity):
else:
raise KeyError("'%s' not in vocabulary" % entity)

def add_entity(self, entity, weights):
"""Accept an entity specified by string tag and vector weights as 1D numpy array with shape (`vector_size`,).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style docstrings

self.vocab[entity] = Vocab(index=entity_id, count=1)
self.index2entity.append(entity)

def add_word(self, entity, weights):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this doesn't need?

@menshikh-iv
Copy link
Contributor

CC: @gojomo wdyt?

@gojomo
Copy link
Collaborator

gojomo commented Mar 8, 2018

This sort of functionality is a natural and useful addition; ideally it'd also be joined with other new APIs to assist initial building of a KeyedVectors from external data sources (rather than the just the direct property-tampering of current usage).

There should be a bulk addition option: otherwise doing lots of single-adds in a loop requires many wasteful reallocations of 1-vector-larger-arrays.

Should perhaps use __setitem__() rather than a named method, or at least have that as an idiomatic option.

Longer-term, a KeyedVectors that actually splits its contents into multiple segments of smaller arrays for more efficient add/delete grow/shrinks would also make sense. But, that'd require more hiding of internal implementation details in a way that could break lots of the current direct-property-accesses by other code.

re: @persiyanov your questions - (1) I'd avoid 'word' in any method names, to stay loyal to intent that this class accepts keys other than just words. (2) if the vstack() exception which results from attempting a mismatched shape is already sufficiently descriptive, no need for extra checking here... but if it's unclear/confusing, then a compat-shape-check would make sense. (3) I don't know but that's worth checking.

@menshikh-iv menshikh-iv changed the title Add BaseKeyedVectors.add for fill kv in manual way. Fix #1942 Add gensim.models.BaseKeyedVectors.add_entity method for fill KeyedVectors in manual way. Fix #1942 Mar 9, 2018
@persiyanov
Copy link
Contributor Author

persiyanov commented Mar 9, 2018

@menshikh-iv @gojomo please, take a look.

I've implemented add_entities method and its alias __setitem__. Also, I've added bool flag replace which specifies vectors replacement strategy for those entities which are already in vocabulary.

@persiyanov
Copy link
Contributor Author

Ping @gojomo @menshikh-iv

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @persiyanov: +1:
I have only several nitpicks about docstrings + not sure about add_entity.

----------
entities : list of str
Entities specified by string tags.
weights: list of np.array or np.array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{list of numpy.ndarray, numpy.ndarray}

@@ -154,6 +154,75 @@ def get_vector(self, entity):
else:
raise KeyError("'%s' not in vocabulary" % entity)

def add_entity(self, entity, weights, replace=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove this method (add_entities looks enough, wdyt @gojomo?)

List of 1D np.array vectors or 2D np.array of vectors.
replace: bool, optional
Boolean flag indicating whether to replace vectors for entities which are already in the vocabulary.
Default, False, means that old vectors for those entities are keeped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to duplicate "default" value for trivial case in docstring, maybe better to write something like

Flag indicating whether to replace vectors for entities which are already in the vocabulary, if True - replace vectors, otherwise - keep old vectors.

self.vectors[in_vocab_idxs] = weights[in_vocab_mask]

def __setitem__(self, entities, weights):
"""Idiomatic way to call `add_entities` with `replace=True`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to write full docstring

@menshikh-iv
Copy link
Contributor

@gojomo please have a look (for me LGTM, except add_entity method, because I don't see any reason for this alias).

@persiyanov
Copy link
Contributor Author

ping @gojomo

in_vocab_idxs = []
out_vocab_entities = []

for idx, entity in zip(range(len(entities)), entities):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumerate() more idiomatic.

if len(self.vectors) == 0:
self.vectors = weights[~in_vocab_mask]
else:
self.vectors = vstack((self.vectors, weights[~in_vocab_mask]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might this line work even in the case where len(self.vectors)==0, making the check/branch unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not obvious how to do that, because when empty KeyedVectors object is created, self.vectors = [] is true. In that case, we can't use vstack(([], weights[~in_vocab_mask])) and ValueError: all the input array dimensions except for the concatenation axis must match exactly is raised.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for an empty KeyedVectors to have a self.vectors that is already a proper-dimensioned (0, vector_size) empty ndarray? (Not sure myself, but would simplify things in later places like this.)


in_vocab_mask = np.zeros(len(entities), dtype=np.bool)
in_vocab_idxs = []
out_vocab_entities = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method might be simpler without separate in_vocab_idxs and out_vocab_entities – just driving those ops from the mask, using options like where() or nonzero().

@gojomo
Copy link
Collaborator

gojomo commented Mar 14, 2018

Functionality seems good.

I think class has a preexisting terminology issue with the use of word 'entity' where often 'key' would be more-consistent/more-specific. Also, as the 'items' inside this are definitionally 'vectors', generally vector/vectors better terms than weights . While I'm not sure this PR can/should fix all of that, I'd prefer add_entity() & add_entities() be replaced with a single add(keys, vectors, replace=True) (that could also for convenience tolerate a single key/vector).

@persiyanov
Copy link
Contributor Author

persiyanov commented Mar 16, 2018

@gojomo @menshikh-iv please, take a look.

I've removed add_entities & add_entity but kept add method which can be used for all cases. Also, I got rid of out_vocab_entities variable, doing several np.nonzero(...) calls.

I also think that operating with keys/vectors instead of entities/weights is better, but it's not related to this task and better to be done in another PR.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, great work @persiyanov 👍

replace: bool, optional
Flag indicating whether to replace vectors for entities which are already in the vocabulary,
if True - replace vectors, otherwise - keep old vectors.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: multiline docstring should ends with empty line, i.e.

"""
...
last text

"""

"""
if isinstance(entities, string_types):
entities = [entities]
weights = weights.reshape(1, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, should be weights = np.array(weights).reshape(1, -1) for case if weights, for example, list of floats

@persiyanov
Copy link
Contributor Author

@menshikh-iv Fixed

@@ -73,7 +73,7 @@
PYEMD_EXT = False

from numpy import dot, zeros, float32 as REAL, empty, memmap as np_memmap, \
double, array, vstack, sqrt, newaxis, integer, \
double, array, zeros, vstack, sqrt, newaxis, integer, \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zeros imported twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo yeah, i've fixed it

@menshikh-iv
Copy link
Contributor

Congratz with first time contribution @persiyanov 👍 🥇

@menshikh-iv menshikh-iv merged commit 58d560b into piskvorky:develop Mar 20, 2018
@gojomo
Copy link
Collaborator

gojomo commented Mar 20, 2018

@persiyanov Yes, and thanks for your patience through all the subtle refinements!

@persiyanov persiyanov deleted the feature/add-word-method-to-keyed-vectors branch March 21, 2018 11:12
@piskvorky
Copy link
Owner

piskvorky commented Mar 27, 2018

Is the word entity still in? I'm strongly -1 on that, we should not merge that.

"Entity" has an established meaning in NLP, it comes with certain connotations. Introducing a new concept into gensim like this muddles the waters. Especially since the concept is not really introduced at all in the PR AFAICS—what is an "entity" here? I only skimmed the docstrings and they only mention the word, never explain it. This is confusing and inconsistent with our other docs.

@gojomo
Copy link
Collaborator

gojomo commented Mar 27, 2018

The widespread use of 'entity' was introduced in #1777; this PR is just following the code's existing practice. The code desperately needs a refactoring for consistency/clarity, even moreso since the #1777 attempt!

@menshikh-iv
Copy link
Contributor

@piskvorky this modification for BaseKeyedVectors class (that doesn't say anything about words, only abstract "entities" that can be "words" in the subclasses).

Check out other code from this class, for example

https://github.com/RaRe-Technologies/gensim/blob/2e08f4d3b218c9675d4f842f724af40a4f4ec1ee/gensim/models/keyedvectors.py#L124-L144

BaseKeyedVectors used as base class for PoincareKeyedVectors (where we have no "words")
https://github.com/RaRe-Technologies/gensim/blob/2e08f4d3b218c9675d4f842f724af40a4f4ec1ee/gensim/models/poincare.py#L772

@piskvorky
Copy link
Owner

piskvorky commented Mar 29, 2018

I agree with @gojomo , the code desperately needs a proper refactoring. The current post-#1777 situation seems untenable. @gojomo any chance you could take this up?

@menshikh-iv your snippet doesn't explain entity in any way, so not sure how that's helpful. I'm strongly -1 on introducing new fundamental concepts in this way.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 29, 2018

@piskvorky I don't explain it, I just show that the current code mimics the already existing and doesn't introduce any new concepts.

@piskvorky
Copy link
Owner

Understood. It's an issue with #1777 , not this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants