Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow initialization with max_final_vocab in lieu of min_count for gensim.models.Word2Vec. Fix #465 #1915

Merged
merged 29 commits into from
Mar 22, 2018
Merged
Changes from 5 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
e249ed4
handle deprecation
aneesh-joshi Feb 8, 2018
62f6c82
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
aneesh-joshi Feb 14, 2018
1677e98
handle max_count
aneesh-joshi Feb 18, 2018
e8c08f8
change flag name
aneesh-joshi Feb 18, 2018
258d033
make flake8 compatible
aneesh-joshi Feb 18, 2018
875c65c
move max_vocab to prepare vocab
aneesh-joshi Feb 20, 2018
0aa8426
correct max_vocab semantics
aneesh-joshi Feb 20, 2018
390f333
remove unnecessary nextline
aneesh-joshi Feb 20, 2018
8c508c7
fix bug and make flake8 complaint
aneesh-joshi Feb 21, 2018
c826b19
refactor code and change sorting to key based
aneesh-joshi Feb 22, 2018
35dc681
add tests
aneesh-joshi Mar 5, 2018
67f6a14
introduce effective_min_count
aneesh-joshi Mar 5, 2018
7b1f612
make flake8 compliant
aneesh-joshi Mar 5, 2018
fafee70
remove clobbering of min_count
aneesh-joshi Mar 7, 2018
9d99660
remove min_count assertion
aneesh-joshi Mar 7, 2018
6c06fbc
.\gensim\models\word2vec.py
aneesh-joshi Mar 7, 2018
c5a0e6e
Revert ".\gensim\models\word2vec.py"
aneesh-joshi Mar 7, 2018
fdd2aab
rename max_vocab to max_final_vocab
aneesh-joshi Mar 7, 2018
974d587
update test to max_final_vocab
aneesh-joshi Mar 7, 2018
ddb3556
move and modify comment docs
aneesh-joshi Mar 7, 2018
c54d8a9
make flake8 compliant
aneesh-joshi Mar 7, 2018
f379616
refactor word2vec.py
aneesh-joshi Mar 8, 2018
46d3885
handle possible old model load errors
aneesh-joshi Mar 11, 2018
2cf5625
include effective_min_count tests
aneesh-joshi Mar 11, 2018
8578e3d
make flake compliant
aneesh-joshi Mar 11, 2018
a43fea3
remove check for max_final_vocab
aneesh-joshi Mar 13, 2018
340a8cf
include backward compat for 3.3 models
aneesh-joshi Mar 15, 2018
0b62407
remove unnecessary newline
aneesh-joshi Mar 15, 2018
5b7a6c2
add test case for max_final_vocab
aneesh-joshi Mar 19, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 28 additions & 3 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,7 +425,8 @@ class Word2Vec(BaseWordEmbeddingsModel):
def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=()):
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(),
use_max_vocab=False, max_vocab=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be implemented only for word2vec (or for other *2vec models too)?
CC: @gojomo

"""
Initialize the model from an iterable of `sentences`. Each sentence is a
list of words (unicode strings) that will be used for training.
Expand Down Expand Up @@ -510,14 +511,16 @@ def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
>>> say_vector = model['say'] # get vector for word

"""
self.use_max_vocab = use_max_vocab
self.max_vocab = max_vocab

self.callbacks = callbacks
self.load = call_on_class_only

self.wv = Word2VecKeyedVectors(size)
self.vocabulary = Word2VecVocab(
max_vocab_size=max_vocab_size, min_count=min_count, sample=sample,
sorted_vocab=bool(sorted_vocab), null_word=null_word)
sorted_vocab=bool(sorted_vocab), null_word=null_word, use_max_vocab=use_max_vocab, max_vocab=max_vocab)
self.trainables = Word2VecTrainables(seed=seed, vector_size=size, hashfxn=hashfxn)

super(Word2Vec, self).__init__(
Expand Down Expand Up @@ -1131,14 +1134,17 @@ def __iter__(self):


class Word2VecVocab(utils.SaveLoad):
def __init__(self, max_vocab_size=None, min_count=5, sample=1e-3, sorted_vocab=True, null_word=0):
def __init__(self, max_vocab_size=None, min_count=5, sample=1e-3, sorted_vocab=True, null_word=0,
use_max_vocab=False, max_vocab=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add 2 parameters, max_vocab already enough.

self.max_vocab_size = max_vocab_size
self.min_count = min_count
self.sample = sample
self.sorted_vocab = sorted_vocab
self.null_word = null_word
self.cum_table = None # for negative sampling
self.raw_vocab = None
self.use_max_vocab = use_max_vocab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

problem with backward compatibility, here and above (when you add the new attribute, you should modify load function for the case when a user load old model (without this attribute) with new code (with new attribute)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where I should make changes?

        try:
            return super(Word2Vec, cls).load(*args, **kwargs)
        except AttributeError:
            logger.info('Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.')
            from gensim.models.deprecated.word2vec import load_old_word2vec
            return load_old_word2vec(*args, **kwargs)

self.max_vocab = max_vocab

def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
"""Do an initial scan of all words appearing in sentences."""
Expand Down Expand Up @@ -1176,6 +1182,25 @@ def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
)
corpus_count = sentence_no + 1
self.raw_vocab = vocab

if self.use_max_vocab:
import operator

if self.max_vocab is None:
raise ValueError('max_vocab not defined')

sorted_vocab = sorted(vocab.items(), key=operator.itemgetter(1), reverse=True)
curr_count = 0
final_vocab = {}
for item in sorted_vocab:
curr_count += item[1]
if curr_count < self.max_vocab:
final_vocab[item[0]] = item[1]
else:
break

self.raw_vocab = final_vocab

return total_words, corpus_count

def sort_vocab(self, wv):
Expand Down