-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU Training error #2461
Comments
@blueskywwc sorry to hear about your training problems! The --img-weights argument has not been tested on Multi-GPU so it's possible the two may be incompatible. I'll add a TODO here to investigate, but since this is not a common use case we may not get around to fixing this for a while unfortunately. If you could help debug and figure out a good fix to help everyone else it would much appreciated! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@blueskywwc good news 😃! Your original issue may now been fixed ✅ in PR #3275. This provides improved error handling to notify the user than DDP is not compatible with the
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
thanks,I will try and update! |
@glenn-jocher 1.Multi-GPU Training: File "train.py", line 529, in They are still not compatible ,thanks! |
@blueskywwc yes everything is working as intended now! The two arguments are not compatible, you are correct. Now the error handling is improved so that now the users can understand better the cause of the problem and avoid this pairing. |
@blueskywwc so the natural solution is to train --image-weights on single GPU. |
thanks,I see |
1.Multi-GPU Training:
python -m torch.distributed.launch --master_port 42342 --nproc_per_node 2 train.py --device 0,1
When I set image-weights to true, I got the error: Tensors must be CUDA and dense
When I set image-weights to false,It's normal
2.Single-GPU Training:
python train.py --device 0
When I set image-weights to true,It's normal
Why can't image-weights be set to true during multi-gpu training? Thank you !
The text was updated successfully, but these errors were encountered: