Padam optimizer #46

zoq · 2018-11-06T19:28:33Z

Implementation of "Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks" by Jinghui Chen and Quanquan Gu.

rcurtin

I enjoyed reading the paper and the results look compelling. I'm curious how they would generalize across different networks and problems.

I guess, since we have documentation in function_types.md and optimizers.md, that we should probably update those with the information for Padam before we merge.

Also, do you want to add a note to HISTORY.md?

rcurtin · 2018-11-19T00:44:11Z

include/ensmallen_bits/adam/padam_update.hpp

+   * Construct the Padam update policy with the given parameters.
+   *
+   * @param epsilon The epsilon value used to initialise the squared gradient
+   *        parameter.


Would it be more accurate to say something like "Epsilon is the minimum allowed gradient"? My understanding of epsilon is that it's necessary to prevent dividing by very small values in the case where v_t is very small. (Perhaps that might be a documentation update to fix in the AdamUpdate and other similar optimizers---not sure?)

rcurtin · 2018-11-19T00:45:06Z

include/ensmallen_bits/adam/padam_update.hpp

+    vImproved = arma::max(vImproved, v);
+
+    iterate -= (stepSize * std::sqrt(biasCorrection2) / biasCorrection1) *
+        m / arma::pow(arma::sqrt(vImproved) + epsilon, partial * 2);


It's possible I have misunderstood something here, but my reading of Algorithm 1 in the paper suggests that this could be m / arma::pow(vImproved + epsilon, partial). But I think this depends on how the epsilon correction should be applied. Personally I don't think it makes particularly much difference whether it is applied to the square root of vImproved or not (the suggestion I wrote would be a little quicker in practice). But maybe there is some reason I don't know of to write it the way you did?

(In the end I'm indifferent on how it gets implemented, mostly I am just curious here.)

Good point this is what we did for the AMSGrad update rule, it shouldn't make much of a difference, but I will change it.

rcurtin · 2018-11-19T00:45:39Z

include/ensmallen_bits/adam/padam_update.hpp

+    ++iteration;
+
+    // And update the iterate.
+    m *= beta1;


Do we also need to update beta1 as beta1 *= lambda like the theory for Theorem 4.2 requires for convergence? I guess that Algorithm 1 does not include it, so perhaps it is not necessary. It also seems like they did not decay beta1 in the experiments.

Just checked the reference implementation, and it looks like they don't used a decay rate for the beta value, so I at this point I think it's fine to leave it as it is.

Sounds good to me. It can always be changed later if someone wants.

rcurtin · 2018-11-19T00:46:01Z

include/ensmallen_bits/adam/adam.hpp

@@ -186,6 +187,8 @@ using NadaMax = AdamType<NadaMaxUpdate>;

 using OptimisticAdam = AdamType<OptimisticAdamUpdate>;

+using Padam = AdamType<PadamUpdate>;


I guess that the user does not have any easy way through the constructor to set partial; do you think it is worth writing an extra constructor or something like this?

Agreed, will do that.

rcurtin · 2018-12-03T23:17:27Z

Also, we should probably notify the authors of the paper that we have their optimizer implemented, just so that they know. (I guess we should probably do that in general!)

…ep (formula from the paper).

rcurtin

Looks great to me. We should make this part of the 1.12.0 release which can also include #60.

doc/optimizers.md

zoq added 2 commits November 3, 2018 23:48

Add Padam implementation.

1068583

Add Padam test suite.

8225228

rcurtin reviewed Nov 19, 2018

View reviewed changes

zoq added 3 commits December 4, 2018 08:32

Add new constructor to set partial parameter and adjust the update st…

061ad3e

…ep (formula from the paper).

Merge branch 'master' into ens_padam

ab38046

Add Padam to the optimizer documentation.

1f6be55

rcurtin approved these changes Dec 4, 2018

View reviewed changes

doc/optimizers.md Outdated Show resolved Hide resolved

zoq added 2 commits December 4, 2018 11:48

Use alphabetical ordering.

ecb423b

Update parameter names.

8252c0e

zoq merged commit d067ce4 into mlpack:master Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padam optimizer #46

Padam optimizer #46

zoq commented Nov 6, 2018

rcurtin left a comment

rcurtin Nov 19, 2018

rcurtin Nov 19, 2018

zoq Dec 3, 2018

rcurtin Nov 19, 2018

zoq Dec 3, 2018

rcurtin Dec 3, 2018

rcurtin Nov 19, 2018

zoq Dec 3, 2018

rcurtin commented Dec 3, 2018

rcurtin left a comment

		@@ -186,6 +187,8 @@ using NadaMax = AdamType<NadaMaxUpdate>;

		using OptimisticAdam = AdamType<OptimisticAdamUpdate>;

		using Padam = AdamType<PadamUpdate>;

Padam optimizer #46

Padam optimizer #46

Conversation

zoq commented Nov 6, 2018

rcurtin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin commented Dec 3, 2018

rcurtin left a comment

Choose a reason for hiding this comment