-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Support for running on arbitrary CUDA device. #537
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thanks for the PR! This is indeed something that should be fixed, but I never got into doing this because I always run the code with CUDA_VISIBLE_DEVICES=1
, which works fine.
I have an ask: could you use the CUDAGuard
from c10 instead of manually setting cudaSetDevice
etc? This is the new pattern that we are using in PyTorch 1.0.
Here is an example of utilization https://github.com/pytorch/pytorch/blob/1154506533bfe9428600fd69fa2e71dd172b7fec/aten/src/ATen/native/cuda/Copy.cu#L181
Thanks again!
Hi Massa, |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
It is better to add a description in README.md, I also want to run on arbitrary CUDA device, but still don't know how to use. Thanks |
Yeah, it's necessary to tell the usage for users. |
Here is an example for Mask R-CNN R-50 FPN quick on the second device (CUDA:1): | ||
```bash | ||
# for training | ||
python tools/train_net.py --config-file=configs/quick_schedules/e2e_mask_rcnn_R_50_FPN_quick.yaml MODEL.DEVICE cuda:1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not recommend this as the first method for running on the a different GPU.
Indeed, this used to still allocate some memory on the first GPU. Nowadays this is still the case but much less (~10MB from what I tried), while using CUDA_VISIBLE_DEVICES
really doesn't use anything.
After you make this change and add first a note on this env var then I think this PR is good to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we should offer suitable methods and tell the prons and cons to users.
* support for any one cuda device * Revert "support for any one cuda device" This reverts commit 0197e4e. * support runnning for anyone cuda device * using safe CUDAGuard rather than intrinsic CUDASetDevice * supplement a header dependency (test passed) * Support for arbitrary GPU device. * Support for arbitrary GPU device. * add docs for two method to control devices
…ch#537)" (facebookresearch#608) This reverts commit f031879.
* support for any one cuda device * Revert "support for any one cuda device" This reverts commit 0197e4e. * support runnning for anyone cuda device * using safe CUDAGuard rather than intrinsic CUDASetDevice * supplement a header dependency (test passed) * Support for arbitrary GPU device. * Support for arbitrary GPU device. * add docs for two method to control devices
…ch#537)" (facebookresearch#608) This reverts commit f031879.
The current version could not normally run someone CUDA device (e.g. cuda:1).
The program will show such errors if we only add a extra flag
MODEL.DEVICE cuda:1
2019-03-06 19:41:24,027 maskrcnn_benchmark.trainer INFO: Start training THCudaCheck FAIL file=/home/dl/KJ/dl/core/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/nms.cu line=103 error=77 : an illegal memory access was encountered
The issue is that the current executable device (on CUDA) is device:0, while the memory data of tensors (boxes, etc.) is allocated on device:1. Therefore, we need temporary switch device 0 to 1.
After fixed the issue:
2019-03-06 19:51:47,698 maskrcnn_benchmark.trainer INFO: Start training 2019-03-06 19:51:55,449 maskrcnn_benchmark.trainer INFO: eta: 3 days, 5:29:26 iter: 20 loss: 1.8457 (2.2689) loss_classifier: 0.4143 (0.8737) loss_box_reg: 0.0372 (0.0519) loss_mask: 0.7941 (0.8378) loss_objectness: 0.3711 (0.4117) loss_rpn_box_reg: 0.0559 (0.0938) time: 0.3500 (0.3875) data: 0.0087 (0.0445) lr: 0.001793 max mem: 3362 2019-03-06 19:52:02,983 maskrcnn_benchmark.trainer INFO: eta: 3 days, 4:24:40 iter: 40 loss: 1.3515 (1.9323) loss_classifier: 0.3346 (0.6586) loss_box_reg: 0.0855 (0.0689) loss_mask: 0.7014 (0.7690) loss_objectness: 0.2152 (0.3520) loss_rpn_box_reg: 0.0460 (0.0838) time: 0.3729 (0.3821) data: 0.0100 (0.0277) lr: 0.001927 max mem: 3466
The support could help some developer run the program in some limited environments.