Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Padam optimizer #46

Merged
merged 7 commits into from
Dec 7, 2018
Merged

Padam optimizer #46

merged 7 commits into from
Dec 7, 2018

Conversation

zoq
Copy link
Member

@zoq zoq commented Nov 6, 2018

Implementation of "Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks" by Jinghui Chen and Quanquan Gu.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enjoyed reading the paper and the results look compelling. I'm curious how they would generalize across different networks and problems.

I guess, since we have documentation in function_types.md and optimizers.md, that we should probably update those with the information for Padam before we merge.

Also, do you want to add a note to HISTORY.md?

* Construct the Padam update policy with the given parameters.
*
* @param epsilon The epsilon value used to initialise the squared gradient
* parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to say something like "Epsilon is the minimum allowed gradient"? My understanding of epsilon is that it's necessary to prevent dividing by very small values in the case where v_t is very small. (Perhaps that might be a documentation update to fix in the AdamUpdate and other similar optimizers---not sure?)

vImproved = arma::max(vImproved, v);

iterate -= (stepSize * std::sqrt(biasCorrection2) / biasCorrection1) *
m / arma::pow(arma::sqrt(vImproved) + epsilon, partial * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible I have misunderstood something here, but my reading of Algorithm 1 in the paper suggests that this could be m / arma::pow(vImproved + epsilon, partial). But I think this depends on how the epsilon correction should be applied. Personally I don't think it makes particularly much difference whether it is applied to the square root of vImproved or not (the suggestion I wrote would be a little quicker in practice). But maybe there is some reason I don't know of to write it the way you did?

(In the end I'm indifferent on how it gets implemented, mostly I am just curious here.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point this is what we did for the AMSGrad update rule, it shouldn't make much of a difference, but I will change it.

++iteration;

// And update the iterate.
m *= beta1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to update beta1 as beta1 *= lambda like the theory for Theorem 4.2 requires for convergence? I guess that Algorithm 1 does not include it, so perhaps it is not necessary. It also seems like they did not decay beta1 in the experiments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked the reference implementation, and it looks like they don't used a decay rate for the beta value, so I at this point I think it's fine to leave it as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. It can always be changed later if someone wants.

@@ -186,6 +187,8 @@ using NadaMax = AdamType<NadaMaxUpdate>;

using OptimisticAdam = AdamType<OptimisticAdamUpdate>;

using Padam = AdamType<PadamUpdate>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that the user does not have any easy way through the constructor to set partial; do you think it is worth writing an extra constructor or something like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will do that.

@rcurtin
Copy link
Member

rcurtin commented Dec 3, 2018

Also, we should probably notify the authors of the paper that we have their optimizer implemented, just so that they know. (I guess we should probably do that in general!)

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. We should make this part of the 1.12.0 release which can also include #60.

doc/optimizers.md Outdated Show resolved Hide resolved
@zoq zoq merged commit d067ce4 into mlpack:master Dec 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants