-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Padam optimizer #46
Padam optimizer #46
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enjoyed reading the paper and the results look compelling. I'm curious how they would generalize across different networks and problems.
I guess, since we have documentation in function_types.md
and optimizers.md
, that we should probably update those with the information for Padam
before we merge.
Also, do you want to add a note to HISTORY.md
?
* Construct the Padam update policy with the given parameters. | ||
* | ||
* @param epsilon The epsilon value used to initialise the squared gradient | ||
* parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more accurate to say something like "Epsilon is the minimum allowed gradient"? My understanding of epsilon is that it's necessary to prevent dividing by very small values in the case where v_t is very small. (Perhaps that might be a documentation update to fix in the AdamUpdate and other similar optimizers---not sure?)
vImproved = arma::max(vImproved, v); | ||
|
||
iterate -= (stepSize * std::sqrt(biasCorrection2) / biasCorrection1) * | ||
m / arma::pow(arma::sqrt(vImproved) + epsilon, partial * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible I have misunderstood something here, but my reading of Algorithm 1 in the paper suggests that this could be m / arma::pow(vImproved + epsilon, partial)
. But I think this depends on how the epsilon
correction should be applied. Personally I don't think it makes particularly much difference whether it is applied to the square root of vImproved
or not (the suggestion I wrote would be a little quicker in practice). But maybe there is some reason I don't know of to write it the way you did?
(In the end I'm indifferent on how it gets implemented, mostly I am just curious here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point this is what we did for the AMSGrad update rule, it shouldn't make much of a difference, but I will change it.
++iteration; | ||
|
||
// And update the iterate. | ||
m *= beta1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to update beta1
as beta1 *= lambda
like the theory for Theorem 4.2 requires for convergence? I guess that Algorithm 1 does not include it, so perhaps it is not necessary. It also seems like they did not decay beta1
in the experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked the reference implementation, and it looks like they don't used a decay rate for the beta value, so I at this point I think it's fine to leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. It can always be changed later if someone wants.
include/ensmallen_bits/adam/adam.hpp
Outdated
@@ -186,6 +187,8 @@ using NadaMax = AdamType<NadaMaxUpdate>; | |||
|
|||
using OptimisticAdam = AdamType<OptimisticAdamUpdate>; | |||
|
|||
using Padam = AdamType<PadamUpdate>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that the user does not have any easy way through the constructor to set partial
; do you think it is worth writing an extra constructor or something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will do that.
Also, we should probably notify the authors of the paper that we have their optimizer implemented, just so that they know. (I guess we should probably do that in general!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. We should make this part of the 1.12.0 release which can also include #60.
Implementation of "Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks" by Jinghui Chen and Quanquan Gu.