Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss through each iteration in skip gram #999

Closed
tmylk opened this issue Nov 9, 2016 · 56 comments
Closed

Loss through each iteration in skip gram #999

tmylk opened this issue Nov 9, 2016 · 56 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@tmylk
Copy link
Contributor

tmylk commented Nov 9, 2016

Keep track of loss and output to log.
Frequent request on the mailing list

@tmylk tmylk added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Nov 9, 2016
@devashishd12
Copy link
Contributor

devashishd12 commented Nov 9, 2016

@tmylk I'll take this up. Will be able to get to it after some time though.

@devashishd12
Copy link
Contributor

devashishd12 commented Nov 22, 2016

@tmylk I've started work on this but just needed some help regarding how to go forward. I was thinking of having an enable_loss_logging parameter which will enable a code snippet in train_sg_pair() to logger.info() the loss. This parameter will be an initialization parameter which be false by default. Is this fine?

@tmylk
Copy link
Contributor Author

tmylk commented Nov 24, 2016

Sounds good.

Also see #686

@gojomo
Copy link
Collaborator

gojomo commented Nov 25, 2016

INFO-logging the loss on each train_sg_pair() would lead to an excessive amount of log-lines, an immense slowdown in training, and only affect the pure-Python code-path.

Also, it's probably not what people asking for this want. I suspect what they really want is some form of either (1) total/average loss on one text-example (or over some range of examples) – as a sort of indicator of 'fit to model'; or (2) running/cumulative loss over a range of examples, perhaps up to a full training epoch, as a sort of readout on how well training is progressing.

I'd suggest asking a few people who've asked about this to better understand their need, or trying to copy the running-loss readouts offered by some of the alternate word2vec implementations out there. (I think Fasttext and maybe some tensorFlow word2vec examples display a 'loss' while training is in progress.)

@devashishd12
Copy link
Contributor

@gojomo yes I was also sceptical about the points you mentioned. I saw the tensorflow example implementation of word2vec. It seems they evaluate nce-loss which is evaluated on each pair of words. What would be the best way to move forward?

@dietmar
Copy link

dietmar commented Nov 30, 2016

I found this issue while looking for a way to get (2), i.e., cumulative loss after each epoch, in order to decide on the number of epochs and alpha. In my case actually for doc2vec.

@devashishd12
Copy link
Contributor

@dietmar thanks! @gojomo @tmylk should I start working on this then? Just to confirm, I would need to replicate code on the .py and .pyx as well to affect both paths right?

@tmylk
Copy link
Contributor Author

tmylk commented Dec 1, 2016

Confirm.

@gojomo
Copy link
Collaborator

gojomo commented Dec 1, 2016

A cumulative loss per epoch seems useful (if not too much of a performance drag), but it may be useful to keep looking for other expressions of this need/capability to understand what people really want. (IIRC the fasttext running-loss display updates more frequently than once-per-pass, but not after every example...)

@RishabGargeya
Copy link

Was looking for this functionality to plot the graph of loss per epoch during training (option 2). Would definitely be useful to also evaluate a loss value when providing an example to a pretrained model. @dsquareindia Any update on this feature?

@gojomo
Copy link
Collaborator

gojomo commented Jan 5, 2017

@RishabGargeya - I believe 'loss value when providing an example to a pretrained model' is essentially what the existing score() function does, at least for the mode (Word2Vec/hs) where it is implemented. (It'd be interesting but perhaps not quite as theoretically grounded to extend that to work in all modes of Word2Vec/Doc2Vec.)

@chinmayapancholi13
Copy link
Contributor

Hey @tmylk, I would like to work on this issue. I have gone through the relevant code in the train_sg_pair function and also seen the earlier discussions in this thread as well as the Google group mailing list. This is the solution that I think could work.
We can add a boolean parameter enable_loss_logging in the function train_sg_pair to toggle whether we want to log the loss or not. The corresponding changes in the code to compute and display the loss value could be something like :

def train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_hidden=True,
                  context_vectors=None, context_locks=None, enable_loss_logging=False):

    if context_vectors is None:
        context_vectors = model.wv.syn0
    if context_locks is None:
        context_locks = model.syn0_lockf

    if word not in model.wv.vocab:
        return
    predict_word = model.wv.vocab[word]  # target word (NN output)

    l1 = context_vectors[context_index]  # input word (NN input/projection layer)
    lock_factor = context_locks[context_index]

    neu1e = zeros(l1.shape)

    if enable_loss_logging :
        train_error_value = 0

    if model.hs:
        # work on the entire tree at once, to push as much work into numpy's C routines as possible (performance)
        l2a = deepcopy(model.syn1[predict_word.point])  # 2d matrix, codelen x layer1_size
        fa = expit(dot(l1, l2a.T))  # propagate hidden -> output
        ga = (1 - predict_word.code - fa) * alpha  # vector of error gradients multiplied by the learning rate
        if learn_hidden:
            model.syn1[predict_word.point] += outer(ga, l1)  # learn hidden -> output
        neu1e += dot(ga, l2a)  # save error

        if enable_loss_logging :
            sign_l2a = 1. - predict_word.code
            for i in range(len(sign_l2a)) :
                if sign_l2a[i] == 0.0 :
                    sign_l2a[i] = -1.0

            sign_adjusted_l2a = deepcopy(model.syn1[predict_word.point])
            sign_adjusted_l2a = sign_l2a.reshape(len(sign_l2a), 1) * sign_adjusted_l2a

            train_error_value -= sum(log(expit(dot(l1, sign_adjusted_l2a.T))))

    if model.negative:
        # use this word (label = 1) + `negative` other random words not from this sentence (label = 0)
        word_indices = [predict_word.index]
        while len(word_indices) < model.negative + 1:
            w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1]))
            if w != predict_word.index:
                word_indices.append(w)
        l2b = model.syn1neg[word_indices]  # 2d matrix, k+1 x layer1_size
        prod_term = dot(l1, l2b.T)
        fb = expit(prod_term)  # propagate hidden -> output
        gb = (model.neg_labels - fb) * alpha  # vector of error gradients multiplied by the learning rate
        if learn_hidden:
            model.syn1neg[word_indices] += outer(gb, l1)  # learn hidden -> output
        neu1e += dot(gb, l2b)  # save error

        if enable_loss_logging :
            train_error_value -= sum(log(expit(-1 * prod_term[range(1, len(prod_term))])))
            train_error_value -= log(expit(prod_term[0]))
            logger.info("current training error : %f", train_error_value)

    if learn_vectors:
        l1 += neu1e * lock_factor  # learn input -> hidden (mutates model.wv.syn0[word2.index], if that is l1)
    return neu1e

Here, I have computed the loss value for the particular pair for which the function is called (error would have 2 components : corresponding to hierarchical softmax and negative sampling).

However, as mentioned by @gojomo above, printing loss value for each pair would lead to excessive amount of log lines. So to take care of this, we could add a parameter print_freq to the function train_batch_sg to specify the frequency at which we log the loss value. So if print_freq is 10000 (say) then we would log the loss value after every 10000 pairs.

Could you please guide me regarding this approach and tell me if I am on the right track, so that the approach could be refined further? Also, this just affects the Python path so if the above approach is correct then I would make appropriate changes so that other paths are taken care of as well.

@tmylk
Copy link
Contributor Author

tmylk commented Mar 9, 2017

@chinmayapancholi13 Thanks for your suggestion. Could you please add coments to your code and submit a PR? It's easier to leave comments that way.
Also what is the reason for not using the existing score_sentence_* code?

@gojomo
Copy link
Collaborator

gojomo commented Mar 9, 2017

@chinmayapancholi13 - I would ask some of the people who've requested variants of this for what kind of info they expect. I'm not sure if it's just an occasional log-line of a single skip-gram example – it may be more of a running-average, including something that can be read from the model, as opposed to just logged.

@tmylk - the score code may work as a model for inspiration, but using it directly wouldn't provide the during-bulk-training running-indication that people seem to want.

Also as a general note, to truly address this feature need, it should be an option for CBOW mode as well, and work when using the cython-optimized paths. And, for Doc2Vec.

@chinmayapancholi13
Copy link
Contributor

@gojomo I agree. But the previous relevant discussions (both on Github as well as on the Google mailing list) that I have come across haven't been conclusive as such. As soon as we are able to decide on the exact manner we want the loss value to be utilized, we could progress further.

@gojomo
Copy link
Collaborator

gojomo commented Mar 9, 2017

Yes, more clarity is required. I think that only people who've been positively craving this feature, for specific uses, can provide it – so we've got to find them and have them express their motivations in more detail. Otherwise we may dream up something that's a waste of effort, or even worse, an extra bit of drag/complication in the code that fits no-one's actual need.

@tmylk
Copy link
Contributor Author

tmylk commented Mar 13, 2017

Anything that helps users decide on the number of epochs and alpha is useful.

Two people in this thread and one person on the mailing list have requested a "cumulative loss after each epoch". @gojomo What more information is needed from them? Starting with a log output is a good first step. If there are requests to store it as dict Keras-style later than that is an easy update.

@gojomo
Copy link
Collaborator

gojomo commented Mar 13, 2017

"Cumulative loss after each epoch" is a good start at more specifics! Do they want it just logged, or stored somewhere in the model for programmatic access? (I think many times people want this to drive their own alpha-decisions or choice-to-stop.) Is a comparable value printed from another implementation (like say fasttext), such that when running in a similar mode, we'd receive confirmation the number was meaningful if it was similar from the other code?

@tmylk
Copy link
Contributor Author

tmylk commented Mar 13, 2017

Lets just have it logged at first. We don't have any other standard way yet in gensim to report back on loss/perplexity during training. Getting a dict back as on keras would be good, but that's another pr/issue

@chinmayapancholi13
Copy link
Contributor

@tmylk @gojomo Here, we want to compute the loss value for EACH EPOCH i.e. the granularity that we are looking for is '1 EPOCH'. However, currently for training the model we first create jobs (using the function job_producer) where each job is a collection of sentences and has maximum MAX_WORDS_IN_BATCH number of words. And then we create self.workers number of threads to take up these jobs from the job queue and train the model. But by doing this, there is no demarcation between different epochs i.e. each individual job can have sentences spanning across different epochs and we (right now) have no way to differentiate among them. So although this setting suffices for training the model, where we are just concerned about 1 sentence at a time, it creates a problem when we want to sum the training loss for all the sentences EPOCHWISE.

So, I suggest while adding each sentence to a job batch, we could add some additional data/marker with the sentence which determines which epoch the particular sentence belongs to. Then after calculating the training loss for the sentence, we could add this loss value to the value for the sentence's corresponding epoch.

Does this sound good to you? Please let me know if I am missing / misunderstanding anything here.

@gojomo
Copy link
Collaborator

gojomo commented Mar 23, 2017

I'm not sure that the people who want this need it strictly "per epoch" as opposed to "recently" or "for the last call to train()", etc. – so before implementation focuses too much on that level-of-granularity, I'd prefer more feedback from people eager to use this value.

Currently we don't provide any way to perform reporting/other-tasks between epochs, unless the user is calling train() repeatedly themselves. The idea of a callback for this is mentioned in #1139; making it work would require some refactoring of the already somewhat twisty current worker-thread code.

@tmylk
Copy link
Contributor Author

tmylk commented Mar 23, 2017

Let's log it for the last call to train first and then see if there is any feedback after that.

@chinmayapancholi13
Copy link
Contributor

@tmylk So does "last call to train()" mean that we sum the loss for all the epochs together and log this overall value?

@dietmar
Copy link

dietmar commented Mar 24, 2017

IMHO it is reasonable that you can only get this if you have your own loop that repeatedly calls train(). The "shortcut" of just calling the constructor which does everything for you is there for convenience; People who want more control / more details can live with giving up some convenience.

I would prefer the possibility of getting the loss value programmatically over just producing a log line, because I would want to plot the loss over time. Otherwise I'd have to parse the log lines. Also possible, but not as elegant.

Concretely, I'd like to be able to do something like this:

d2v = Doc2Vec(iter=1)
d2v.build_vocab(my_data_generator())
losses = [] 
alpha = ALPHASTART
for epoch in range(EPOCHS):
    d2v.train(my_data_generator(), alpha=alpha)
    losses.append(d2v.get_latest_loss())
    my_plot_function(losses)
    alpha = my_update_alpha_method(alpha, epoch)

Because iter is 1, "for the last call to train()" and "per epoch" are the same thing here, so: great.

@chinmayapancholi13 good question, I don't know what's the most reasonable thing to do when people call train() with iter > 1. For my purposes it doesn't matter, because I wouldn't do that.

@gojomo
Copy link
Collaborator

gojomo commented Mar 25, 2017

@chinmayapancholi13 If some sort of running-average loss is simply accumulated in one place, oblivious to the epoch, whatever that value is when train() ends may be what's of use. (There's no epoch-specific losses to be summed.)

menshikh-iv pushed a commit that referenced this issue Jun 29, 2017
* computes training loss for skip gram

* synced word2vec.py with gensim_main

* removed unnecessary keep_bocab_item import

* synced word2vec.py with gensim_main

* PEP8 changes

* added Python-only implementation for skip-gram model

* updated param name to 'compute_loss'

* removed 'raise ImportError' statement from prev commit

* [WIP] partial changes for loss computation for skipgram case

* [WIP] updated cython code

* added unit test for training loss computation

* added loss computation for neg sampling

* removed unnecessary 'raise ImportError' stmt

* added .c and .pyx to flake8 ignore list

* added loss computation for CBOW model in Python path

* added loss computation for CBOW model in Cython path

* PEP8 (F811) fix due to var 'prod'

* updated w2v ipynb for training loss computation and benchmarking

* updated .c files

* added benchmark results
@menshikh-iv
Copy link
Contributor

Fixed in #1201

@gojomo
Copy link
Collaborator

gojomo commented Jun 29, 2017

As noted in my comment on #1201, I have no confidence this implementation actually provides value to the people who requested it. It's more like exploratory progress that's a 1st step to meeting the professed need.

@menshikh-iv
Copy link
Contributor

I agree with you @gojomo. I hope we will discuss this feature with interested persons.

@jmhessel
Copy link
Contributor

jmhessel commented Jul 6, 2017

by the way, in case it is of interest (I spent several hours debugging this, actually) -- for computing the HSM loss, I've noticed that numpy's logaddexp function is far more numerically stable than doing log(exp(x) + exp(y)). In my fork of gensim, I was trying to get the numpy/cython outputs to match, and, it turns out that, at least in my case, to get the outputs to match, I had to use a more numerically stable version. I used the one here: http://software.ligo.org/docs/lalsuite/lalinference/logaddexp_8h_source.html

@jmhessel
Copy link
Contributor

jmhessel commented Jul 6, 2017

If you want to see how I implemented the numerically stable version in cython (that matches the numpy output) you can check out my scoring functions here. If you don't use the numerically stable version of log add exp, I've found that you can get extremely different answers from numpy/cython. This makes them match (for like 5-6 sigfigs). You could modify the scoring function to use a precomputed log table if runtime is a concern, but this works for me.

@chinmayapancholi13
Copy link
Contributor

Hi @jmhessel! Yes, while working on #1201, I did notice a difference in the values obtained from the Cython/Python paths. However, I attributed it to the facts that we are using the initially populated tables (EXP_TABLE and LOG_TABLE) rather than numpy/scipy functions like logaddexp, expit, log (as you have mentioned in your comment above) and other randomness issues as stated in the comment above.
I see that you have also created #1472 related to this. I'll be following the comments here as well on the issue. If required, I'll be happy to make a change to resolve this difference in the values for score as well as loss obtained.

@jmhessel
Copy link
Contributor

jmhessel commented Jul 6, 2017

I'll be looking into it more. It might result from the tables. But the output log probabilities differ by as much as 1.0 per token! That's quite a bit if you're trying to do a perplexity evaluation.

@jmhessel
Copy link
Contributor

jmhessel commented Jul 6, 2017

BTW -- I implemented all of the doc2vec scoring functions in numpy and cython. The outputs match when I use the more numerically stable version.

@chinmayapancholi13
Copy link
Contributor

But calling logaddexp many times would be slower than using an already populated table to get those values IMO. But still, if there is a big difference in the score/loss values obtained, this probably needs to be updated.

Also, if I understand correctly, the output score values match (upto ~5,6 significant digits) when you use (as done in Python code) logaddexp function in Cython code as well, instead of using EXP_TABLE and LOG_TABLE. Only this change suffices for getting the same values, is that correct?

BTW, have you checked if for some cases in your codepath, conditions like if f <= -MAX_EXP or f >= MAX_EXP on such lines become true? If it is so, then because of the continue statement, updates to work[0] wouldn't be done for those iterations.

@jmhessel
Copy link
Contributor

jmhessel commented Jul 6, 2017

Lets think... this is the version of logaddexp that matches numpy, and it's always called with x=0.

cdef REAL_t logaddexp(REAL_t x, REAL_t y) nogil:
    cdef REAL_t tmp
    tmp = x - y
    if tmp > 0:
        return x + log1p(exp(-tmp))
    elif tmp <= 0:
        return y + log1p(exp(tmp))
    else:
        return x + y

So, tmp is always -y, which can be positive or negative, depending on the dot product. log1p returns the log of 1 plus the argument. So, there could be precomputed tables for log1p(exp(x)) for a variety of x. Or, there may be some simplification that can be done here too, with the knowledge that x=0 is always true.

@jmhessel
Copy link
Contributor

jmhessel commented Jul 6, 2017

And the difference exists even for the MAX_EXP checking.

@chinmayapancholi13
Copy link
Contributor

@jmhessel I was trying to use the logaddexp function you have shared above and I was curious that when you say "The outputs match when I use the more numerically stable version", what corpus/data are you using for this case?

@jmhessel
Copy link
Contributor

jmhessel commented Jul 7, 2017

I don't think the corpus matters too much -- I modified word2vec/doc2vec.py's scoring function to call both the numpy and cython versions of the scoring function for each input sentence to score. When I didn't use the more stable version, the outputs matched at around 1-3 significant figures. When I used the numerically stable version, that became 4-6. When you sum over a large number of sentences these differences add up. So -- they don't match perfectly ever, but this seems to help a lot.

saparina pushed a commit to saparina/gensim that referenced this issue Jul 9, 2017
)

* computes training loss for skip gram

* synced word2vec.py with gensim_main

* removed unnecessary keep_bocab_item import

* synced word2vec.py with gensim_main

* PEP8 changes

* added Python-only implementation for skip-gram model

* updated param name to 'compute_loss'

* removed 'raise ImportError' statement from prev commit

* [WIP] partial changes for loss computation for skipgram case

* [WIP] updated cython code

* added unit test for training loss computation

* added loss computation for neg sampling

* removed unnecessary 'raise ImportError' stmt

* added .c and .pyx to flake8 ignore list

* added loss computation for CBOW model in Python path

* added loss computation for CBOW model in Cython path

* PEP8 (F811) fix due to var 'prod'

* updated w2v ipynb for training loss computation and benchmarking

* updated .c files

* added benchmark results
@dietmar
Copy link

dietmar commented Jul 26, 2017

I have played with the new functionality implemented in #1201. Unfortunately, it is not yet implemented for doc2vec (right?), so I experimented with word2vec. Generally, it does what I wanted, so great job and thank you!

I did find a weird problem though: with small learning rates, the loss actually increases over time. Here is my test script with some description and plots: https://github.com/dietmar/gensim_word2vec_loss. It uses my German One Million Posts corpus.

Am I using the new functionality right? That shouldn't happen, right, that too small a learning rate makes the algorithm diverge? In my expectation, if the learning rate is too small, the loss should still go down, but very slowly, such that you would probably need a huge number of epochs to get anywhere.

@gojomo
Copy link
Collaborator

gojomo commented Jul 26, 2017

@dietmar Thanks for testing! The rise in summed error is odd – I would suspect some problem with alpha management (though it looks roughly correct from a quick look at your script). Perhaps have the test script show the effective alpha values being used for each train() call? Also, given how you've precalculated the alpha ranges for each pass, you could be using a smoother alpha->min_alpha decay each pass, though I don't know if that'd make much difference in practice.

It could be interesting to probe for what ranges of starting-alphas, in your setup, show the expected error-improvement, versus the unexpected error-worsening.

I also think that if the corpus itself was non-randomly ordered – for example all short docs, or all positive-sentiment docs, etc at the front or back, making early and late documents very different from each other – then there might be non-intuitive effects on error trends, and those might be more pronounced if operating always at a tiny learning-rate. (Still, I wouldn't expect the reverse trend of the magnitude your logs show.) Shuffling the texts, even if just once before all training, might rule out this sort of effect.

@dietmar
Copy link

dietmar commented Jul 27, 2017

@gojomo Some additional findings from preparing the above:

  • Keeping alpha fixed across all epochs still showed the increasing behavior
  • My corpus data is ordered chronologically (as users submitted their comments), so quasi-randomly ordered in terms of document properties
  • I also tried feeding in the English Moby Dick from Project Gutenberg (about 200k words), which resulted in the same behavior
  • skip-gram (sg=1) does not seem to show this behavior

Yes, printing the alpha values is a good idea. What do you mean by "smoother" alpha decay? Logarithmic instead of linear decay?

I might test some more if/when I have time.

@gojomo
Copy link
Collaborator

gojomo commented Jul 27, 2017

Re: alpha

If you were to let the Word2Vec/Doc2Vec train() method do all the iterations for you, it would be linearly-decaying the effective rate from the starting 0.025 value to the ending 0.0001 value with each internal batch fed to a worker thread. These batches are typically/preferably smaller than one full pass, so in fact training occurs at many, many steps between those values. Your code is only using exactly 5 step-values – and the entire last epoch uses the 0.001 value, rather than (in the train()-natively-managed case) just the last batch being at 0.001+epsilon. This may not make a big difference, or explain the rising error, but you could call train() each time with the interval (like start_alpha=0.025, end_alpha=0.020 on 1st pass, etc), so that each epoch is still linear rather than a single value.

That skip-gram doesn't show the behavior is curious; is a contrast with CBOW evident across many choices of alpha/window/iterations/etc? (If so, maybe a CBOW-specific tallying bug.)

@dietmar
Copy link

dietmar commented Jul 28, 2017

I see, thanks for the explanation with regard to alpha.

I will need to experiment more to make a confident statement about skip-gram vs. CBOW, stay tuned.

@giahung24
Copy link

giahung24 commented Nov 17, 2017

Hello,

I have read the whole thread, and may I ask why can't you add the compute_loss in doc2vec model?

I am working with doc2vec and really need to know the loss , so I have tried to modify the fast_document_dm_neg() by following the fast_sentence_cbow_neg(). But it doesn't work easily that way.

Can you tell me why can't you add this feature into doc2vec (yet)?
(I am using gensim 3.0.1 from conda)

Thank you.

@menshikh-iv
Copy link
Contributor

@chinmayapancholi13 this question for you

@jonrosner
Copy link

jonrosner commented Feb 26, 2018

Hi,

I have the same question as @giahung24 . Is it now possible to access the loss of a doc2vec model eg. by callbacks?

Using 3.3.0 from MacOS

@menshikh-iv
Copy link
Contributor

@JohnRGermany yes, callback receive model as input parameter -> you have access to all fields (including running_training_loss that you need).

@jonrosner
Copy link

@menshikh-iv but there is no running_training_loss in doc2vec yet, right?

@menshikh-iv
Copy link
Contributor

@JohnRGermany right, this is only for w2v.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

10 participants