Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Training error #2461

Closed
blueskywwc opened this issue Mar 14, 2021 · 8 comments · Fixed by #3275
Closed

Multi-GPU Training error #2461

blueskywwc opened this issue Mar 14, 2021 · 8 comments · Fixed by #3275
Labels
question Further information is requested Stale Stale and schedule for closing soon

Comments

@blueskywwc
Copy link

1.Multi-GPU Training:
python -m torch.distributed.launch --master_port 42342 --nproc_per_node 2 train.py --device 0,1

When I set image-weights to true, I got the error: Tensors must be CUDA and dense

When I set image-weights to false,It's normal

2.Single-GPU Training:
python train.py --device 0

When I set image-weights to true,It's normal

Why can't image-weights be set to true during multi-gpu training? Thank you !

@blueskywwc blueskywwc added the question Further information is requested label Mar 14, 2021
@glenn-jocher
Copy link
Member

@blueskywwc sorry to hear about your training problems! The --img-weights argument has not been tested on Multi-GPU so it's possible the two may be incompatible. I'll add a TODO here to investigate, but since this is not a common use case we may not get around to fixing this for a while unfortunately.

If you could help debug and figure out a good fix to help everyone else it would much appreciated!

@glenn-jocher glenn-jocher added the TODO High priority items label Mar 14, 2021
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Apr 14, 2021
@glenn-jocher glenn-jocher reopened this May 21, 2021
@glenn-jocher glenn-jocher linked a pull request May 21, 2021 that will close this issue
@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2021

@blueskywwc good news 😃! Your original issue may now been fixed ✅ in PR #3275. This provides improved error handling to notify the user than DDP is not compatible with the --image-weights training argument. To receive this update you can:

  • git pull from within your yolov5/ directory
  • git clone https://github.com/ultralytics/yolov5 again
  • Force-reload PyTorch Hub: model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • View our updated notebooks: Open In Colab Open In Kaggle

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@glenn-jocher glenn-jocher removed the TODO High priority items label May 21, 2021
@blueskywwc
Copy link
Author

thanks,I will try and update!

@blueskywwc
Copy link
Author

blueskywwc commented May 24, 2021

@glenn-jocher
git clone https://github.com/ultralytics/yolov5 again
python 3.8.5 torch1.7.1+cu101

1.Multi-GPU Training:
python -m torch.distributed.launch --master_port 42342 --nproc_per_node 2 train.py --device 0,1
set image-weights to true, I got the error:

File "train.py", line 529, in
assert not opt.image_weights, '--image-weights argument is not compatible with DDP training'
AssertionError: --image-weights argument is not compatible with DDP training

They are still not compatible ,thanks!

@glenn-jocher
Copy link
Member

@blueskywwc yes everything is working as intended now!

The two arguments are not compatible, you are correct. Now the error handling is improved so that now the users can understand better the cause of the problem and avoid this pairing.

@glenn-jocher
Copy link
Member

@blueskywwc so the natural solution is to train --image-weights on single GPU.

@blueskywwc
Copy link
Author

thanks,I see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale Stale and schedule for closing soon
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants