-
Notifications
You must be signed in to change notification settings - Fork 74.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RMSProp optimization support for sparse tensors #464
Comments
Based on reading the code, I think it is expected: there is not yet support for SparseTensors with those three optimizers, since _apply_sparse() function isn't implemented. Turning this into a feature request. |
OK, thank you. But since ADAM work for this, there should not be any major issue preventing to use RMSProp as well, right? (I would be quite interested in using RMSProp with nce_loss). Also, if this is expected, I think you should try to document this. I do not think it is mentioned in the doc that only 3 out of the 6 default optimizers support sparse update. The doc of nce_loss (and of sampled softmax as well, I suppose), could also mention it. |
Assigning to someone who knows more about this part of the codebase |
After checking bug #505, it seems that there are actually two bugs going on here. One has to do with tf.reshape + Adagrad/Momentum , and the other has to do with RMSPropOptimizer not handling sparse gradient updates (which happens in cases of embeddings or sampled loss, I guess). Here is a self-contained example demonstrating the RMSProp error (just a slight modification of the one I posted for bug #505):
|
I think the issue here is related to a bug in |
Thank you for fixing this :-) However will this also fix the RMSProp error? It seems to be of a slightly different nature... |
Indeed, it will only fix the issues with adagrad and momentum (and other optimizers that support sparse data using embeddings). |
What kind of behavior would you expect from sparse tensors in RMSProp?
The first might not be correct (though, not sure), the second is very slow. What I typically do now is to split my variables into two groups, and train the dense part with any optimizer I want and the rest with GradientDescent or AdaGrad. |
The gradient function was previously generating an invalid IndexedSlices, whereby `IndexedSlices.indices` tensor was not a vector. This change reshapes the indices and gradient so that they can correctly be interpreted as an IndexedSlices and applied to the embedding variable. Added a multi-dimensional gradient test in embedding_ops_test.py. Fixes #505. Partially addresses #464. Change: 110364370
It is quite possible that I do not fully understand the issue, but I was not expecting RMSProp to be much more difficult to use than Adagrad. I will try to briefly state how I see things (and please excuse me if I write something obviously stupid). As for the two options you gave, I take it that you are discussing the update of the running average of the squared gradient, right? "ignore the momentum" would consist in not updating the running average for dimensions for which the gradient is zero due to sparsity. And "apply momentum for the whole embedding" would be to actually update the running average of all dimensions at each iteration. Then I think there is a third option. You could keep track of the last iteration in which the running average was updated for each dimension. Let us call last(d) the iteration step at which the running average of dimension d was last updated. Then, when you want to update dimension d at iteration step i, you can update the running average of the square gradient of d by rms(d) = rms(d) * 0.9 ^ ( i - last(d)) * 0.9 + 0.1 * grad_i(d)^2. Then set last(d) = i. The 0.9 ^ ( i - last(d)) part account for the modification in the running gradient since the last time we saw a non-zero gradient for dimension d (null gradient (i - last(d) times). The values of last(d) and rms(d) only need to be updated when there is a non-null gradient for dimension d. Therefore the updates should be efficient in a sparse context. |
You're right, the third option you suggested is probably strictly better than the second one. That's not how it's implemented in Adam, though (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py#L138) For Adagrad, we're only updating non-zero elements and I think this is a common way to do it (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/training_ops.cc#L381) |
Yes, for Adagrad it is enough to not upgrade the squared gradient sum (and keep track of the global total number of updates). Would you consider implementing this? I could probably do it myself at some point when I have time by looking at the existing code for RMSProp and Adam. But with lack of time and the annoyance of getting a Corporate Contributor Agreement, I would probably not contribute that anytime soon... In any case, if nothing is changed, I think the docs should mention that ADAM is slower than Adagrad on sparse updates; and that RMSProp is not compatible with those. |
Renaming this bug since the AdaGrad and Momentum optimizers should now work. |
Hey what is the status of this? |
I'm going to mark this contributions welcome, since I don't know of anyone working on it. |
@fabiencro, is this still an issue? If not, or if it is not important, I will close it in a few days. |
Nagging Awaiting Response: It has been 14 days with no activityand the |
1 similar comment
Nagging Awaiting Response: It has been 14 days with no activityand the |
It has been 14 days with no activity and the |
Closing as per previous comment: #464 (comment) |
It seems that tf.nce_loss is not compatible with the optimizers RMSProp, ADAGRAD and Momentum. (while SGD, ADAM and FTRL works fine).
When using rmsprop, I get this error:
When using adagrad or momentum, I get this error:
Is that expected?
The exact same code works perfectly fine with adam or sgd optimizers, so I do not think I made a mistake when constructing the graph.
The text was updated successfully, but these errors were encountered: