-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The loss doesn't decrease when using multi nodes. #30
Comments
Same issue here, loss stuck on +- 6.23. Everything works fine when training on single node. |
I have the same multi-node loss issue. Any solution for this problem? |
hi guys, you only have the problem with multiple nodes? I get the same issue even on a single node but multiple processes(ranks). Any suggestion? |
same issue here, could you please tell me whether you figure it out with any solution? |
Maybe a bug in class NT_Xent(nn.Module) when using multi-gpus. The |
I agree. To make the implementation work on multi node or multi processes, I think the GatherLayer should be applied to z_i and z_j independently before the concatenation( |
Hey guys, I have adjusted some code of the forward function in class NT_Xent and now it can work, but I just found the multi-gpu performance is mush worse than only using one gpu, do you know the reason?
|
ok .... I think this question has been solved ... the ddp model did not replace the origin one by mistake, so it did not work well. By properly setting the training model, this function is well for the multi-gpu training in ddp. |
@wooozihui , can you elaborate on where/how you replaced or properly set the training model? |
Hello guys, I've been plagued by the inexplicable code in nt_xent.py for a long time too. Finally I found this issue. I agree to @dltkddn0525 's opinion. The I've made a pull request of this issue, hope this will help! |
When i use one node, the code runs well. However, when I use 2 nodes and set the batch_size to 64, the loss is always around 5.545 and doesn't decrease. As 5.545 is the value of ln(512), it seems like that the network never get new knowledge during training. I have checked that the parameters are not fixed. I think maybe there is something wrong with the GatherLayer but i can not find it out. Have you met this problem?
The text was updated successfully, but these errors were encountered: