-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP --sync-bn
bug with torch 1.9.0
#3998
Comments
Got the same problem a week ago, it would be stuck if I use |
Encountered the same problem here. 🌗 |
@simba0703 @wudashuo @imyhxy thanks for the notice guys. Yes --sync is broken with torch 1.9.0, I can't figure out what the problem is though :( If you you guys find a solution please let us know! In the meantime I'll add an assert to let users know this is a known issue. You can still train DDP normally however, which I would recommend anyway, as all of the official models were trained without --sync. |
--sync-bn
bug with torch 1.9.0
This may be the cause: pytorch/pytorch#37930 |
@simba0703 @wudashuo @imyhxy @jfpuget good news 😃! Your original issue may now be fixed ✅ in PR #4615. We discovered of the DPP This means DDP training now works without issue with or without To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
When i use 'python -m torch.distributed.launch --nproc_per_node 3 train.py --batch-size 12 --data data/coco128.yaml --weights yolov5m6.pt --device 1,2,3 --adam --sync-bn',the training process will be blocked at epoch 0. And if i do not use '--sync-bn',the training process goes well.
🐛 Bug
A clear and concise description of what the bug is.
To Reproduce (REQUIRED)
Input:
Output:
Expected behavior
A clear and concise description of what you expected to happen.
Environment
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: