Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于训练过程中loss突然变成nan,acc变成0的问题 #17

Open
isunLt opened this issue Oct 24, 2020 · 9 comments
Open

关于训练过程中loss突然变成nan,acc变成0的问题 #17

isunLt opened this issue Oct 24, 2020 · 9 comments

Comments

@isunLt
Copy link

isunLt commented Oct 24, 2020

您好,感谢您分享的代码,我在训练模型的过程中出现了loss突然变成nan,acc变成0的问题,我分别从头开始进行了两次训练,但是还是产生了一样的问题。
我的训练环境是:

  • Nvidia RTX 2080Ti cuda10.1+cudnn7.6.3
  • python3.8.5+pytorch1.5.1
  • 数据集是ILSVRC2015
  • 由于显存只有11G,所以我在config里把batch_size改成了64
  • 其他不变

请问您知道可能的原因是什么吗?您用的ImageNet-LT是由ILSVRC2015提取的吗?
loss2nan

loss2nan2

@isunLt
Copy link
Author

isunLt commented Oct 25, 2020

不好意思打扰了,我把python换成3.7,pytorch换成1.6之后就没问题了。

@isunLt isunLt closed this as completed Oct 25, 2020
@isunLt isunLt reopened this Oct 26, 2020
@isunLt
Copy link
Author

isunLt commented Oct 26, 2020

抱歉我又来了,在换成python3.7、pytorch1.6以后,到训练的最后还是出现了老问题。
微信截图_20201026093554

@KaihuaTang
Copy link
Owner

不好意思没有遇到过类似问题,我也不知道为什么

@KaihuaTang
Copy link
Owner

可能是因为改了batch size,learning rate也需要对应的修改?

@KaihuaTang
Copy link
Owner

还有种可能是要在所有normalize的分母处加一个 1e-9 或者 1e-12。因为不知道什么原因分母的norm值训练的太小了,但是我自己没遇到类似问题。

image

@isunLt
Copy link
Author

isunLt commented Oct 26, 2020

还有种可能是要在所有normalize的分母处加一个 1e-9 或者 1e-12。因为不知道什么原因分母的norm值训练的太小了,但是我自己没遇到类似问题。

image

谢谢您,我去试一下。

@deepkun
Copy link

deepkun commented Jul 23, 2021

请问您问题解决了吗?我改了norm还是会出现nan,我的loss下降很快,在一个epoch内就变nan了

@isunLt
Copy link
Author

isunLt commented Jul 23, 2021

请问您问题解决了吗?我改了norm还是会出现nan,我的loss下降很快,在一个epoch内就变nan了

太久了,我忘记了,不好意思

@yufu
Copy link

yufu commented Oct 6, 2021

还有种可能是要在所有normalize的分母处加一个 1e-9 或者 1e-12。因为不知道什么原因分母的norm值训练的太小了,但是我自己没遇到类似问题。

image

I had the same problem and I fixed it by following Tang's advice. That's really helpful, thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants