-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss through each iteration in skip gram #999
Comments
@tmylk I'll take this up. Will be able to get to it after some time though. |
@tmylk I've started work on this but just needed some help regarding how to go forward. I was thinking of having an |
Sounds good. Also see #686 |
INFO-logging the loss on each Also, it's probably not what people asking for this want. I suspect what they really want is some form of either (1) total/average loss on one text-example (or over some range of examples) – as a sort of indicator of 'fit to model'; or (2) running/cumulative loss over a range of examples, perhaps up to a full training epoch, as a sort of readout on how well training is progressing. I'd suggest asking a few people who've asked about this to better understand their need, or trying to copy the running-loss readouts offered by some of the alternate word2vec implementations out there. (I think Fasttext and maybe some tensorFlow word2vec examples display a 'loss' while training is in progress.) |
@gojomo yes I was also sceptical about the points you mentioned. I saw the tensorflow example implementation of word2vec. It seems they evaluate |
I found this issue while looking for a way to get (2), i.e., cumulative loss after each epoch, in order to decide on the number of epochs and alpha. In my case actually for doc2vec. |
Confirm. |
A cumulative loss per epoch seems useful (if not too much of a performance drag), but it may be useful to keep looking for other expressions of this need/capability to understand what people really want. (IIRC the fasttext running-loss display updates more frequently than once-per-pass, but not after every example...) |
Was looking for this functionality to plot the graph of loss per epoch during training (option 2). Would definitely be useful to also evaluate a loss value when providing an example to a pretrained model. @dsquareindia Any update on this feature? |
@RishabGargeya - I believe 'loss value when providing an example to a pretrained model' is essentially what the existing |
Hey @tmylk, I would like to work on this issue. I have gone through the relevant code in the def train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_hidden=True,
context_vectors=None, context_locks=None, enable_loss_logging=False):
if context_vectors is None:
context_vectors = model.wv.syn0
if context_locks is None:
context_locks = model.syn0_lockf
if word not in model.wv.vocab:
return
predict_word = model.wv.vocab[word] # target word (NN output)
l1 = context_vectors[context_index] # input word (NN input/projection layer)
lock_factor = context_locks[context_index]
neu1e = zeros(l1.shape)
if enable_loss_logging :
train_error_value = 0
if model.hs:
# work on the entire tree at once, to push as much work into numpy's C routines as possible (performance)
l2a = deepcopy(model.syn1[predict_word.point]) # 2d matrix, codelen x layer1_size
fa = expit(dot(l1, l2a.T)) # propagate hidden -> output
ga = (1 - predict_word.code - fa) * alpha # vector of error gradients multiplied by the learning rate
if learn_hidden:
model.syn1[predict_word.point] += outer(ga, l1) # learn hidden -> output
neu1e += dot(ga, l2a) # save error
if enable_loss_logging :
sign_l2a = 1. - predict_word.code
for i in range(len(sign_l2a)) :
if sign_l2a[i] == 0.0 :
sign_l2a[i] = -1.0
sign_adjusted_l2a = deepcopy(model.syn1[predict_word.point])
sign_adjusted_l2a = sign_l2a.reshape(len(sign_l2a), 1) * sign_adjusted_l2a
train_error_value -= sum(log(expit(dot(l1, sign_adjusted_l2a.T))))
if model.negative:
# use this word (label = 1) + `negative` other random words not from this sentence (label = 0)
word_indices = [predict_word.index]
while len(word_indices) < model.negative + 1:
w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1]))
if w != predict_word.index:
word_indices.append(w)
l2b = model.syn1neg[word_indices] # 2d matrix, k+1 x layer1_size
prod_term = dot(l1, l2b.T)
fb = expit(prod_term) # propagate hidden -> output
gb = (model.neg_labels - fb) * alpha # vector of error gradients multiplied by the learning rate
if learn_hidden:
model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output
neu1e += dot(gb, l2b) # save error
if enable_loss_logging :
train_error_value -= sum(log(expit(-1 * prod_term[range(1, len(prod_term))])))
train_error_value -= log(expit(prod_term[0]))
logger.info("current training error : %f", train_error_value)
if learn_vectors:
l1 += neu1e * lock_factor # learn input -> hidden (mutates model.wv.syn0[word2.index], if that is l1)
return neu1e Here, I have computed the loss value for the particular pair for which the function is called (error would have 2 components : corresponding to hierarchical softmax and negative sampling). However, as mentioned by @gojomo above, printing loss value for each pair would lead to excessive amount of log lines. So to take care of this, we could add a parameter Could you please guide me regarding this approach and tell me if I am on the right track, so that the approach could be refined further? Also, this just affects the Python path so if the above approach is correct then I would make appropriate changes so that other paths are taken care of as well. |
@chinmayapancholi13 Thanks for your suggestion. Could you please add coments to your code and submit a PR? It's easier to leave comments that way. |
@chinmayapancholi13 - I would ask some of the people who've requested variants of this for what kind of info they expect. I'm not sure if it's just an occasional log-line of a single skip-gram example – it may be more of a running-average, including something that can be read from the model, as opposed to just logged. @tmylk - the Also as a general note, to truly address this feature need, it should be an option for CBOW mode as well, and work when using the cython-optimized paths. And, for Doc2Vec. |
@gojomo I agree. But the previous relevant discussions (both on Github as well as on the Google mailing list) that I have come across haven't been conclusive as such. As soon as we are able to decide on the exact manner we want the loss value to be utilized, we could progress further. |
Yes, more clarity is required. I think that only people who've been positively craving this feature, for specific uses, can provide it – so we've got to find them and have them express their motivations in more detail. Otherwise we may dream up something that's a waste of effort, or even worse, an extra bit of drag/complication in the code that fits no-one's actual need. |
Anything that helps users decide on the number of epochs and alpha is useful. Two people in this thread and one person on the mailing list have requested a "cumulative loss after each epoch". @gojomo What more information is needed from them? Starting with a log output is a good first step. If there are requests to store it as dict Keras-style later than that is an easy update. |
"Cumulative loss after each epoch" is a good start at more specifics! Do they want it just logged, or stored somewhere in the model for programmatic access? (I think many times people want this to drive their own alpha-decisions or choice-to-stop.) Is a comparable value printed from another implementation (like say fasttext), such that when running in a similar mode, we'd receive confirmation the number was meaningful if it was similar from the other code? |
Lets just have it logged at first. We don't have any other standard way yet in gensim to report back on loss/perplexity during training. Getting a dict back as on keras would be good, but that's another pr/issue |
@tmylk @gojomo Here, we want to compute the loss value for EACH EPOCH i.e. the granularity that we are looking for is '1 EPOCH'. However, currently for training the model we first create jobs (using the function So, I suggest while adding each sentence to a job batch, we could add some additional data/marker with the sentence which determines which epoch the particular sentence belongs to. Then after calculating the training loss for the sentence, we could add this loss value to the value for the sentence's corresponding epoch. Does this sound good to you? Please let me know if I am missing / misunderstanding anything here. |
I'm not sure that the people who want this need it strictly "per epoch" as opposed to "recently" or "for the last call to Currently we don't provide any way to perform reporting/other-tasks between epochs, unless the user is calling |
Let's log it for the last call to |
@tmylk So does "last call to |
IMHO it is reasonable that you can only get this if you have your own loop that repeatedly calls I would prefer the possibility of getting the loss value programmatically over just producing a log line, because I would want to plot the loss over time. Otherwise I'd have to parse the log lines. Also possible, but not as elegant. Concretely, I'd like to be able to do something like this:
Because @chinmayapancholi13 good question, I don't know what's the most reasonable thing to do when people call |
@chinmayapancholi13 If some sort of running-average loss is simply accumulated in one place, oblivious to the epoch, whatever that value is when |
* computes training loss for skip gram * synced word2vec.py with gensim_main * removed unnecessary keep_bocab_item import * synced word2vec.py with gensim_main * PEP8 changes * added Python-only implementation for skip-gram model * updated param name to 'compute_loss' * removed 'raise ImportError' statement from prev commit * [WIP] partial changes for loss computation for skipgram case * [WIP] updated cython code * added unit test for training loss computation * added loss computation for neg sampling * removed unnecessary 'raise ImportError' stmt * added .c and .pyx to flake8 ignore list * added loss computation for CBOW model in Python path * added loss computation for CBOW model in Cython path * PEP8 (F811) fix due to var 'prod' * updated w2v ipynb for training loss computation and benchmarking * updated .c files * added benchmark results
Fixed in #1201 |
As noted in my comment on #1201, I have no confidence this implementation actually provides value to the people who requested it. It's more like exploratory progress that's a 1st step to meeting the professed need. |
I agree with you @gojomo. I hope we will discuss this feature with interested persons. |
by the way, in case it is of interest (I spent several hours debugging this, actually) -- for computing the HSM loss, I've noticed that numpy's logaddexp function is far more numerically stable than doing log(exp(x) + exp(y)). In my fork of gensim, I was trying to get the numpy/cython outputs to match, and, it turns out that, at least in my case, to get the outputs to match, I had to use a more numerically stable version. I used the one here: http://software.ligo.org/docs/lalsuite/lalinference/logaddexp_8h_source.html |
If you want to see how I implemented the numerically stable version in cython (that matches the numpy output) you can check out my scoring functions here. If you don't use the numerically stable version of log add exp, I've found that you can get extremely different answers from numpy/cython. This makes them match (for like 5-6 sigfigs). You could modify the scoring function to use a precomputed log table if runtime is a concern, but this works for me. |
Hi @jmhessel! Yes, while working on #1201, I did notice a difference in the values obtained from the Cython/Python paths. However, I attributed it to the facts that we are using the initially populated tables ( |
I'll be looking into it more. It might result from the tables. But the output log probabilities differ by as much as 1.0 per token! That's quite a bit if you're trying to do a perplexity evaluation. |
BTW -- I implemented all of the doc2vec scoring functions in numpy and cython. The outputs match when I use the more numerically stable version. |
But calling Also, if I understand correctly, the output score values match (upto ~5,6 significant digits) when you use (as done in Python code) BTW, have you checked if for some cases in your codepath, conditions like |
Lets think... this is the version of logaddexp that matches numpy, and it's always called with
So, |
And the difference exists even for the |
@jmhessel I was trying to use the |
I don't think the corpus matters too much -- I modified word2vec/doc2vec.py's scoring function to call both the numpy and cython versions of the scoring function for each input sentence to score. When I didn't use the more stable version, the outputs matched at around 1-3 significant figures. When I used the numerically stable version, that became 4-6. When you sum over a large number of sentences these differences add up. So -- they don't match perfectly ever, but this seems to help a lot. |
) * computes training loss for skip gram * synced word2vec.py with gensim_main * removed unnecessary keep_bocab_item import * synced word2vec.py with gensim_main * PEP8 changes * added Python-only implementation for skip-gram model * updated param name to 'compute_loss' * removed 'raise ImportError' statement from prev commit * [WIP] partial changes for loss computation for skipgram case * [WIP] updated cython code * added unit test for training loss computation * added loss computation for neg sampling * removed unnecessary 'raise ImportError' stmt * added .c and .pyx to flake8 ignore list * added loss computation for CBOW model in Python path * added loss computation for CBOW model in Cython path * PEP8 (F811) fix due to var 'prod' * updated w2v ipynb for training loss computation and benchmarking * updated .c files * added benchmark results
I have played with the new functionality implemented in #1201. Unfortunately, it is not yet implemented for doc2vec (right?), so I experimented with word2vec. Generally, it does what I wanted, so great job and thank you! I did find a weird problem though: with small learning rates, the loss actually increases over time. Here is my test script with some description and plots: https://github.com/dietmar/gensim_word2vec_loss. It uses my German One Million Posts corpus. Am I using the new functionality right? That shouldn't happen, right, that too small a learning rate makes the algorithm diverge? In my expectation, if the learning rate is too small, the loss should still go down, but very slowly, such that you would probably need a huge number of epochs to get anywhere. |
@dietmar Thanks for testing! The rise in summed error is odd – I would suspect some problem with alpha management (though it looks roughly correct from a quick look at your script). Perhaps have the test script show the effective It could be interesting to probe for what ranges of starting-alphas, in your setup, show the expected error-improvement, versus the unexpected error-worsening. I also think that if the corpus itself was non-randomly ordered – for example all short docs, or all positive-sentiment docs, etc at the front or back, making early and late documents very different from each other – then there might be non-intuitive effects on error trends, and those might be more pronounced if operating always at a tiny learning-rate. (Still, I wouldn't expect the reverse trend of the magnitude your logs show.) Shuffling the texts, even if just once before all training, might rule out this sort of effect. |
@gojomo Some additional findings from preparing the above:
Yes, printing the alpha values is a good idea. What do you mean by "smoother" alpha decay? Logarithmic instead of linear decay? I might test some more if/when I have time. |
Re: alpha If you were to let the Word2Vec/Doc2Vec That skip-gram doesn't show the behavior is curious; is a contrast with CBOW evident across many choices of alpha/window/iterations/etc? (If so, maybe a CBOW-specific tallying bug.) |
I see, thanks for the explanation with regard to alpha. I will need to experiment more to make a confident statement about skip-gram vs. CBOW, stay tuned. |
Hello, I have read the whole thread, and may I ask why can't you add the compute_loss in doc2vec model? I am working with doc2vec and really need to know the loss , so I have tried to modify the fast_document_dm_neg() by following the fast_sentence_cbow_neg(). But it doesn't work easily that way. Can you tell me why can't you add this feature into doc2vec (yet)? Thank you. |
@chinmayapancholi13 this question for you |
Hi, I have the same question as @giahung24 . Is it now possible to access the loss of a doc2vec model eg. by callbacks? Using 3.3.0 from MacOS |
@JohnRGermany yes, callback receive |
@menshikh-iv but there is no |
@JohnRGermany right, this is only for w2v. |
Keep track of loss and output to log.
Frequent request on the mailing list
The text was updated successfully, but these errors were encountered: