Support for running on arbitrary CUDA device. #537

atranitell · 2019-03-06T12:16:06Z

The current version could not normally run someone CUDA device (e.g. cuda:1).
The program will show such errors if we only add a extra flag MODEL.DEVICE cuda:1
2019-03-06 19:41:24,027 maskrcnn_benchmark.trainer INFO: Start training THCudaCheck FAIL file=/home/dl/KJ/dl/core/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/nms.cu line=103 error=77 : an illegal memory access was encountered
The issue is that the current executable device (on CUDA) is device:0, while the memory data of tensors (boxes, etc.) is allocated on device:1. Therefore, we need temporary switch device 0 to 1.

After fixed the issue:
2019-03-06 19:51:47,698 maskrcnn_benchmark.trainer INFO: Start training 2019-03-06 19:51:55,449 maskrcnn_benchmark.trainer INFO: eta: 3 days, 5:29:26 iter: 20 loss: 1.8457 (2.2689) loss_classifier: 0.4143 (0.8737) loss_box_reg: 0.0372 (0.0519) loss_mask: 0.7941 (0.8378) loss_objectness: 0.3711 (0.4117) loss_rpn_box_reg: 0.0559 (0.0938) time: 0.3500 (0.3875) data: 0.0087 (0.0445) lr: 0.001793 max mem: 3362 2019-03-06 19:52:02,983 maskrcnn_benchmark.trainer INFO: eta: 3 days, 4:24:40 iter: 40 loss: 1.3515 (1.9323) loss_classifier: 0.3346 (0.6586) loss_box_reg: 0.0855 (0.0689) loss_mask: 0.7014 (0.7690) loss_objectness: 0.2152 (0.3520) loss_rpn_box_reg: 0.0460 (0.0838) time: 0.3729 (0.3821) data: 0.0100 (0.0277) lr: 0.001927 max mem: 3466
The support could help some developer run the program in some limited environments.

This reverts commit 0197e4e.

facebook-github-bot · 2019-03-06T12:16:19Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

fmassa

Hi,

Thanks for the PR! This is indeed something that should be fixed, but I never got into doing this because I always run the code with CUDA_VISIBLE_DEVICES=1, which works fine.

I have an ask: could you use the CUDAGuard from c10 instead of manually setting cudaSetDevice etc? This is the new pattern that we are using in PyTorch 1.0.
Here is an example of utilization https://github.com/pytorch/pytorch/blob/1154506533bfe9428600fd69fa2e71dd172b7fec/aten/src/ATen/native/cuda/Copy.cu#L181

Thanks again!

atranitell · 2019-03-07T06:53:36Z

Hi Massa,
Aha, in the past I have also employed set a environment variable to control device usage.
The CUDAGuard is indeed an elegant method to maintain the executive devices.
I have added it to the commit.
Best,
Kai J

facebook-github-bot · 2019-03-07T08:10:10Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

atranitell

done.

jario-jin · 2019-03-07T15:30:33Z

It is better to add a description in README.md, I also want to run on arbitrary CUDA device, but still don't know how to use. Thanks

atranitell · 2019-03-08T07:46:44Z

It is better to add a description in README.md, I also want to run on arbitrary CUDA device, but still don't know how to use. Thanks

Yeah, it's necessary to tell the usage for users.

fmassa · 2019-03-25T10:26:54Z

README.md

+Here is an example for Mask R-CNN R-50 FPN quick on the second device (CUDA:1):
+```bash
+# for training
+python tools/train_net.py --config-file=configs/quick_schedules/e2e_mask_rcnn_R_50_FPN_quick.yaml MODEL.DEVICE cuda:1


I would not recommend this as the first method for running on the a different GPU.

Indeed, this used to still allocate some memory on the first GPU. Nowadays this is still the case but much less (~10MB from what I tried), while using CUDA_VISIBLE_DEVICES really doesn't use anything.

After you make this change and add first a note on this env var then I think this PR is good to go

fmassa

Thanks!

atranitell

Indeed, we should offer suitable methods and tell the prons and cons to users.

This reverts commit f031879.

* support for any one cuda device * Revert "support for any one cuda device" This reverts commit 0197e4e. * support runnning for anyone cuda device * using safe CUDAGuard rather than intrinsic CUDASetDevice * supplement a header dependency (test passed) * Support for arbitrary GPU device. * Support for arbitrary GPU device. * add docs for two method to control devices

…ch#537)" (facebookresearch#608) This reverts commit f031879.

* support for any one cuda device * Revert "support for any one cuda device" This reverts commit 0197e4e. * support runnning for anyone cuda device * using safe CUDAGuard rather than intrinsic CUDASetDevice * supplement a header dependency (test passed) * Support for arbitrary GPU device. * Support for arbitrary GPU device. * add docs for two method to control devices

…ch#537)" (facebookresearch#608) This reverts commit f031879.

atranitell added 3 commits March 6, 2019 19:53

support for any one cuda device

0197e4e

Revert "support for any one cuda device"

fe3e87d

This reverts commit 0197e4e.

support runnning for anyone cuda device

391552d

fmassa suggested changes Mar 6, 2019

View reviewed changes

using safe CUDAGuard rather than intrinsic CUDASetDevice

a73c51d

supplement a header dependency (test passed)

11d45bb

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Mar 7, 2019

atranitell commented Mar 7, 2019

View reviewed changes

atranitell added 2 commits March 8, 2019 15:43

Support for arbitrary GPU device.

c630882

Support for arbitrary GPU device.

377bb17

fmassa suggested changes Mar 25, 2019

View reviewed changes

add docs for two method to control devices

fa27313

fmassa approved these changes Mar 26, 2019

View reviewed changes

fmassa merged commit f031879 into facebookresearch:master Mar 26, 2019

atranitell commented Mar 26, 2019

View reviewed changes

fmassa added a commit that referenced this pull request Mar 26, 2019

Revert "Support for running on arbitrary CUDA device. (#537)"

099228f

This reverts commit f031879.

fmassa mentioned this pull request Mar 26, 2019

Revert "Support for running on arbitrary CUDA device." #608

Merged

fmassa added a commit that referenced this pull request Mar 26, 2019

Revert "Support for running on arbitrary CUDA device. (#537)" (#608)

05feadf

This reverts commit f031879.

eskjorg pushed a commit to eskjorg/maskrcnn-benchmark that referenced this pull request Mar 27, 2019

Revert "Support for running on arbitrary CUDA device. (facebookresear…

4507f55

…ch#537)" (facebookresearch#608) This reverts commit f031879.

Lyears pushed a commit to Lyears/maskrcnn-benchmark that referenced this pull request Jun 28, 2020

Revert "Support for running on arbitrary CUDA device. (facebookresear…

0dcd23e

…ch#537)" (facebookresearch#608) This reverts commit f031879.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for running on arbitrary CUDA device. #537

Support for running on arbitrary CUDA device. #537

atranitell commented Mar 6, 2019

facebook-github-bot commented Mar 6, 2019

fmassa left a comment

atranitell commented Mar 7, 2019 •

edited

Loading

facebook-github-bot commented Mar 7, 2019

atranitell left a comment

jario-jin commented Mar 7, 2019

atranitell commented Mar 8, 2019

fmassa Mar 25, 2019

fmassa left a comment

atranitell left a comment

Support for running on arbitrary CUDA device. #537

Support for running on arbitrary CUDA device. #537

Conversation

atranitell commented Mar 6, 2019

facebook-github-bot commented Mar 6, 2019

fmassa left a comment

Choose a reason for hiding this comment

atranitell commented Mar 7, 2019 • edited Loading

facebook-github-bot commented Mar 7, 2019

atranitell left a comment

Choose a reason for hiding this comment

jario-jin commented Mar 7, 2019

atranitell commented Mar 8, 2019

fmassa Mar 25, 2019

Choose a reason for hiding this comment

fmassa left a comment

Choose a reason for hiding this comment

atranitell left a comment

Choose a reason for hiding this comment

atranitell commented Mar 7, 2019 •

edited

Loading