-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
#1957
Add gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
#1957
Changes from 7 commits
99bcf44
06955c4
089d346
f428571
0aff584
f6e5e79
d4b0ffe
912d462
3611320
437a142
737cd36
070fbed
2294c07
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -154,6 +154,75 @@ def get_vector(self, entity): | |
else: | ||
raise KeyError("'%s' not in vocabulary" % entity) | ||
|
||
def add_entity(self, entity, weights, replace=False): | ||
"""Add entity vector in a manual way. | ||
If `entity` is already in the vocabulary, old vector is keeped unless `replace` flag is True. | ||
|
||
Parameters | ||
---------- | ||
entity : str | ||
Entity specified by string tag. | ||
weights : np.array | ||
1D numpy array with shape (`vector_size`,) | ||
replace: bool, optional | ||
Boolean flag indicating whether to replace old vector if entity is already in the vocabulary. | ||
Default, False, means that old vector is keeped. | ||
""" | ||
self.add_entities([entity], weights.reshape(1, -1), replace=replace) | ||
|
||
def add_entities(self, entities, weights, replace=False): | ||
"""Add entities and theirs vectors in a manual way. | ||
If some entity is already in the vocabulary, old vector is keeped unless `replace` flag is True. | ||
|
||
Parameters | ||
---------- | ||
entities : list of str | ||
Entities specified by string tags. | ||
weights: list of np.array or np.array | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
List of 1D np.array vectors or 2D np.array of vectors. | ||
replace: bool, optional | ||
Boolean flag indicating whether to replace vectors for entities which are already in the vocabulary. | ||
Default, False, means that old vectors for those entities are keeped. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need to duplicate "default" value for trivial case in docstring, maybe better to write something like
|
||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: multiline docstring should ends with empty line, i.e.
|
||
if isinstance(weights, list): | ||
weights = np.array(weights) | ||
|
||
in_vocab_mask = np.zeros(len(entities), dtype=np.bool) | ||
in_vocab_idxs = [] | ||
out_vocab_entities = [] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method might be simpler without separate |
||
|
||
for idx, entity in zip(range(len(entities)), entities): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
if entity in self.vocab: | ||
in_vocab_mask[idx] = True | ||
in_vocab_idxs.append(self.vocab[entity].index) | ||
else: | ||
out_vocab_entities.append(entity) | ||
|
||
# add new entities to the vocab | ||
for entity in out_vocab_entities: | ||
entity_id = len(self.vocab) | ||
self.vocab[entity] = Vocab(index=entity_id, count=1) | ||
self.index2entity.append(entity) | ||
|
||
# add vectors for new entities | ||
if len(self.vectors) == 0: | ||
self.vectors = weights[~in_vocab_mask] | ||
else: | ||
self.vectors = vstack((self.vectors, weights[~in_vocab_mask])) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might this line work even in the case where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's not obvious how to do that, because when empty There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible for an empty |
||
|
||
# change vectors for in_vocab entities if `replace` flag is specified | ||
if replace: | ||
self.vectors[in_vocab_idxs] = weights[in_vocab_mask] | ||
|
||
def __setitem__(self, entities, weights): | ||
"""Idiomatic way to call `add_entities` with `replace=True`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. better to write full docstring |
||
""" | ||
if not isinstance(entities, list): | ||
entities = [entities] | ||
weights = weights.reshape(1, -1) | ||
|
||
self.add_entities(entities, weights, replace=True) | ||
|
||
def __getitem__(self, entities): | ||
""" | ||
Accept a single entity (string tag) or list of entities as input. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove this method (
add_entities
looks enough, wdyt @gojomo?)