Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Computes training loss for Word2Vec model (fixes issue #999) #1201

Merged
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a3b57f3
computes training loss for skip gram
chinmayapancholi13 Mar 9, 2017
0cfc672
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
chinmayapancholi13 Apr 29, 2017
501647c
synced word2vec.py with gensim_main
chinmayapancholi13 Apr 29, 2017
03fff61
removed unnecessary keep_bocab_item import
chinmayapancholi13 Mar 24, 2017
ed78b06
synced word2vec.py with gensim_main
chinmayapancholi13 Apr 29, 2017
dcd80f2
Merge remote-tracking branch 'refs/remotes/origin/develop' into develop
chinmayapancholi13 May 15, 2017
c455d18
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
chinmayapancholi13 May 16, 2017
64ececd
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
chinmayapancholi13 May 23, 2017
dcae99d
Merge branch 'word2vec_skipgram_loss' of https://github.com/chinmayap…
chinmayapancholi13 May 23, 2017
0939b32
PEP8 changes
chinmayapancholi13 May 23, 2017
8949749
added Python-only implementation for skip-gram model
chinmayapancholi13 May 24, 2017
d2620fd
updated param name to 'compute_loss'
chinmayapancholi13 May 24, 2017
4d01f78
removed 'raise ImportError' statement from prev commit
chinmayapancholi13 May 24, 2017
3fdd2e9
[WIP] partial changes for loss computation for skipgram case
chinmayapancholi13 Jun 12, 2017
e0fc9f2
[WIP] updated cython code
chinmayapancholi13 Jun 13, 2017
ca4aa69
added unit test for training loss computation
chinmayapancholi13 Jun 13, 2017
96f28fc
added loss computation for neg sampling
chinmayapancholi13 Jun 13, 2017
4a686de
removed unnecessary 'raise ImportError' stmt
chinmayapancholi13 Jun 13, 2017
5ab89b0
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
chinmayapancholi13 Jun 13, 2017
c3db4fa
added .c and .pyx to flake8 ignore list
chinmayapancholi13 Jun 13, 2017
4e8ecac
added loss computation for CBOW model in Python path
chinmayapancholi13 Jun 13, 2017
e71401a
added loss computation for CBOW model in Cython path
chinmayapancholi13 Jun 13, 2017
b80e183
PEP8 (F811) fix due to var 'prod'
chinmayapancholi13 Jun 13, 2017
cc6e0ea
updated w2v ipynb for training loss computation and benchmarking
Jun 29, 2017
8c84680
resolved merge conflict in 'flake8_diff.sh'
Jun 29, 2017
dda1911
updated .c files
Jun 29, 2017
0acd3d6
added benchmark results
Jun 29, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion continuous_integration/travis/flake8_diff.sh
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,6 @@ check_files() {
if [[ "$MODIFIED_FILES" == "no_match" ]]; then
echo "No file has been modified"
else
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.txt,*.csv,Dockerfile*"
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.txt,*.csv,*.vec,Dockerfile*,*.c,*.pyx"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure .pyx should be here? I didn't see what kind of warnings flake was generating, but as cython syntax is mostly python, and most of our enforceable conventions should still be in effect, we may want some style-enforcement there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo flake8 can't correctly check pyx files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo We were getting errors like these due to flake8 :
flake8_2

flake8_1

So although I do agree that there is some style-checking that we might want to do in .pyx files (in the python-like code), to avoid getting errors due to cases similar to the above cases, I thought it would be better to ignore .pyx for flake8 tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. There's an SO answer that implies it may be possible to turn off just certain warnings for .pyx files – https://stackoverflow.com/questions/31269527/running-pep8-or-pylint-on-cython-code – though the full example file is a broken link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this link. :) I can try to use the config specified in the answer here to check if all the undesired warnings/errors are turned off using it.

fi
echo -e "No problem detected by flake8\n"
526 changes: 346 additions & 180 deletions docs/notebooks/word2vec.ipynb

Large diffs are not rendered by default.

450 changes: 225 additions & 225 deletions gensim/models/doc2vec_inner.c

Large diffs are not rendered by default.

63 changes: 49 additions & 14 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@
FAST_VERSION = -1
MAX_WORDS_IN_BATCH = 10000

def train_batch_sg(model, sentences, alpha, work=None):
def train_batch_sg(model, sentences, alpha, work=None, compute_loss=False):
"""
Update skip-gram model by training on a sequence of sentences.

Expand All @@ -163,11 +163,12 @@ def train_batch_sg(model, sentences, alpha, work=None):
for pos2, word2 in enumerate(word_vocabs[start:(pos + model.window + 1 - reduced_window)], start):
# don't train on the `word` itself
if pos2 != pos:
train_sg_pair(model, model.wv.index2word[word.index], word2.index, alpha)
train_sg_pair(model, model.wv.index2word[word.index], word2.index, alpha, compute_loss=compute_loss)

result += len(word_vocabs)
return result

def train_batch_cbow(model, sentences, alpha, work=None, neu1=None):
def train_batch_cbow(model, sentences, alpha, work=None, neu1=None, compute_loss=False):
"""
Update CBOW model by training on a sequence of sentences.

Expand All @@ -190,7 +191,7 @@ def train_batch_cbow(model, sentences, alpha, work=None, neu1=None):
l1 = np_sum(model.wv.syn0[word2_indices], axis=0) # 1 x vector_size
if word2_indices and model.cbow_mean:
l1 /= len(word2_indices)
train_cbow_pair(model, word, word2_indices, l1, alpha)
train_cbow_pair(model, word, word2_indices, l1, alpha, compute_loss=compute_loss)
result += len(word_vocabs)
return result

Expand Down Expand Up @@ -255,7 +256,7 @@ def score_sentence_cbow(model, sentence, alpha, work=None, neu1=None):


def train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_hidden=True,
context_vectors=None, context_locks=None):
context_vectors=None, context_locks=None, compute_loss=False):
if context_vectors is None:
context_vectors = model.wv.syn0
if context_locks is None:
Expand All @@ -273,12 +274,19 @@ def train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_h
if model.hs:
# work on the entire tree at once, to push as much work into numpy's C routines as possible (performance)
l2a = deepcopy(model.syn1[predict_word.point]) # 2d matrix, codelen x layer1_size
fa = expit(dot(l1, l2a.T)) # propagate hidden -> output
prod_term = dot(l1, l2a.T)
fa = expit(prod_term) # propagate hidden -> output
ga = (1 - predict_word.code - fa) * alpha # vector of error gradients multiplied by the learning rate
if learn_hidden:
model.syn1[predict_word.point] += outer(ga, l1) # learn hidden -> output
neu1e += dot(ga, l2a) # save error

# loss component corresponding to hierarchical softmax
if compute_loss:
sgn = (-1.0)**predict_word.code # `ch` function, 0 -> 1, 1 -> -1
lprob = -log(expit(-sgn * prod_term))
model.running_training_loss += sum(lprob)

if model.negative:
# use this word (label = 1) + `negative` other random words not from this sentence (label = 0)
word_indices = [predict_word.index]
Expand All @@ -287,28 +295,40 @@ def train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_h
if w != predict_word.index:
word_indices.append(w)
l2b = model.syn1neg[word_indices] # 2d matrix, k+1 x layer1_size
fb = expit(dot(l1, l2b.T)) # propagate hidden -> output
prod_term = dot(l1, l2b.T)
fb = expit(prod_term) # propagate hidden -> output
gb = (model.neg_labels - fb) * alpha # vector of error gradients multiplied by the learning rate
if learn_hidden:
model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output
neu1e += dot(gb, l2b) # save error

# loss component corresponding to negative sampling
if compute_loss:
model.running_training_loss -= sum(log(expit(-1 * prod_term[1:]))) # for the sampled words
model.running_training_loss -= log(expit(prod_term[0])) # for the output word

if learn_vectors:
l1 += neu1e * lock_factor # learn input -> hidden (mutates model.wv.syn0[word2.index], if that is l1)
return neu1e


def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True):
def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True, compute_loss=False):
neu1e = zeros(l1.shape)

if model.hs:
l2a = model.syn1[word.point] # 2d matrix, codelen x layer1_size
fa = expit(dot(l1, l2a.T)) # propagate hidden -> output
prod_term = dot(l1, l2a.T)
fa = expit(prod_term) # propagate hidden -> output
ga = (1. - word.code - fa) * alpha # vector of error gradients multiplied by the learning rate
if learn_hidden:
model.syn1[word.point] += outer(ga, l1) # learn hidden -> output
neu1e += dot(ga, l2a) # save error

# loss component corresponding to hierarchical softmax
if compute_loss:
sgn = (-1.0)**word.code # ch function, 0-> 1, 1 -> -1
model.running_training_loss += sum(-log(expit(-sgn * prod_term)))

if model.negative:
# use this word (label = 1) + `negative` other random words not from this sentence (label = 0)
word_indices = [word.index]
Expand All @@ -317,12 +337,18 @@ def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=Tr
if w != word.index:
word_indices.append(w)
l2b = model.syn1neg[word_indices] # 2d matrix, k+1 x layer1_size
fb = expit(dot(l1, l2b.T)) # propagate hidden -> output
prod_term = dot(l1, l2b.T)
fb = expit(prod_term) # propagate hidden -> output
gb = (model.neg_labels - fb) * alpha # vector of error gradients multiplied by the learning rate
if learn_hidden:
model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output
neu1e += dot(gb, l2b) # save error

# loss component corresponding to negative sampling
if compute_loss:
model.running_training_loss -= sum(log(expit(-1 * prod_term[1:]))) # for the sampled words
model.running_training_loss -= log(expit(prod_term[0])) # for the output word

if learn_vectors:
# learn input -> hidden, here for all words in the window separately
if not model.cbow_mean and input_word_indices:
Expand Down Expand Up @@ -365,7 +391,7 @@ def __init__(
self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH):
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False):
"""
Initialize the model from an iterable of `sentences`. Each sentence is a
list of words (unicode strings) that will be used for training.
Expand Down Expand Up @@ -471,6 +497,8 @@ def __init__(
self.sorted_vocab = sorted_vocab
self.batch_words = batch_words
self.model_trimmed_post_training = False
self.compute_loss = compute_loss
self.running_training_loss = 0
if sentences is not None:
if isinstance(sentences, GeneratorType):
raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
Expand Down Expand Up @@ -754,9 +782,9 @@ def _do_train_job(self, sentences, alpha, inits):
work, neu1 = inits
tally = 0
if self.sg:
tally += train_batch_sg(self, sentences, alpha, work)
tally += train_batch_sg(self, sentences, alpha, work, self.compute_loss)
else:
tally += train_batch_cbow(self, sentences, alpha, work, neu1)
tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
return tally, self._raw_word_count(sentences)

def _raw_word_count(self, job):
Expand All @@ -766,7 +794,7 @@ def _raw_word_count(self, job):
def train(self, sentences, total_examples=None, total_words=None,
epochs=None, start_alpha=None, end_alpha=None,
word_count=0,
queue_factor=2, report_delay=1.0):
queue_factor=2, report_delay=1.0, compute_loss=None):
"""
Update the model's neural weights from a sequence of sentences (can be a once-only generator stream).
For Word2Vec, each sentence must be a list of unicode strings. (Subclasses may accept other examples.)
Expand All @@ -792,6 +820,10 @@ def train(self, sentences, total_examples=None, total_words=None,
self.neg_labels = zeros(self.negative + 1)
self.neg_labels[0] = 1.

if compute_loss:
self.compute_loss = compute_loss
self.running_training_loss = 0

logger.info(
"training model with %i workers on %i vocabulary and %i features, "
"using sg=%s hs=%s sample=%s negative=%s window=%s",
Expand Down Expand Up @@ -1423,6 +1455,9 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False):
"""Deprecated. Use model.wv.save_word2vec_format instead."""
raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")

def get_latest_training_loss(self):
return self.running_training_loss


class BrownCorpus(object):
"""Iterate over sentences from the Brown corpus (part of NLTK data)."""
Expand Down
Loading