-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different usage of eps between "A quick look at the algorithm" and the code #32
Comments
Hi, thanks a lot for comment, please refer to the PyTorch code whenever there's a difference, because it's what I used for experiments. In the Algorithm Part, when adding eps to s_t, it will finally pass to \hat{s_t} in the update, which is used in the denominator for update. I would take the code to be correct for now. (Also note that in the algo, eps is directly added to s_t before bias correction, which match the code. It might be more precise to think about whether adding eps before or after bias correction.) Similarly for the max_exp_avg_var, the eps is added to max_exp_avg_var because it's the denominator, though adding to exp_arg_var(s_t) will finally pass to the denominator (I have not tested it yet). Also note that in our experiments we found ams_grad never helps for AdaBelief, so I never carefully tested ams_grad in following experiments. PS: the implementation in Tensorflow is different from PyTorch, even for Adam. That's why they use different default eps. |
Thunk you for your reply. I note whether eps is added to s_{t-1} of next update, not bias correction. In "A quick look at the algorithm", eps is added to shat_t instead of s_t, so s_{t-1} of the next update have not been added eps. Here is the script that specifies inf for eps and checks if eps is added to exp_avg_ *: import sys
import torch
import math
from torch.optim import Adam
from adabelief_pytorch import AdaBelief
def test(opt_class, state_name, amsgrad=False):
param = torch.nn.Parameter(torch.ones([]))
optim = opt_class([param], lr=1, betas=(0.0, 0.999), eps=math.inf, amsgrad=amsgrad)
print(f"with {opt_class.__name__}, amsgrad = {amsgrad}", file=sys.stderr)
for i in range(3):
param.grad = torch.ones([])
optim.step()
state = optim.state[param]
msg = f"grad = {param.grad}, {state_name} = {state[state_name]}"
if amsgrad:
msg += f", max_{state_name} = {state['max_'+state_name]}"
print(msg, file=sys.stderr)
test(Adam, 'exp_avg_sq', amsgrad=False)
test(AdaBelief, 'exp_avg_var', amsgrad=False)
test(Adam, 'exp_avg_sq', amsgrad=True)
test(AdaBelief, 'exp_avg_var', amsgrad=True) The stderr shown below.
The following states have eps added.
No eps has been added to the following states:
I used the following version.
PS: I didn't know the Adam implementation in Tensorflow is different from PyTorch. Thank you. |
Thanks for clarification, please take the code with inplace operation as the correct one, since I have not tested the non-inplace operation version. This does not affect the theoretical analysis though, since in theory we only assume the denominator is lower bounded, but I'm not sure what's the difference in practice. |
Thunk you for your reply. I got it. I will use inplace operation version. However I think that inplace operation version amsgrad is broken. |
Thanks for the suggestion, this could be the reason why amsgrad never helps in my experiments, I'll try to fix it and see if the new version of amsgrad helps. |
@tatsuhiko-inoue We have updated the amsgrad version in adabelief-pytorch==0.2.0 and adabelief-tf==0.2.0. The eps is added to exp_avg_var with in-place operation, then max_exp_avg_var is the element-wise maximum with exp_avg_var. It behaves similarly to the AdaBelief version without amsgrad. Hopefully this could somehow help resolve the issue. |
I confirmed this fix. Thank you! Will be you change "A quick look at the algorithm" in README.md to match the pytorch code? |
Yeah, we will fix the readme and the paper. |
Fixed. |
I confirmed README.md. Thank you! |
Hi
I have a question.
In "A quick look at the algorithm" in README.md, eps is added to shat_t.
But in the code, eps is added to s_t(exp_avg_var) instead of shat_t.
Also only the code for pytorch, if amsgrad is True, eps is added to max_exp_avg_var instead of exp_arg_var(s_t) or shat_t.
Which behavior is correct?
pytorch code
tensorflow code
The text was updated successfully, but these errors were encountered: