Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Support for running on arbitrary CUDA device. #537

Merged
merged 8 commits into from
Mar 26, 2019
Merged

Support for running on arbitrary CUDA device. #537

merged 8 commits into from
Mar 26, 2019

Conversation

atranitell
Copy link
Contributor

The current version could not normally run someone CUDA device (e.g. cuda:1).
The program will show such errors if we only add a extra flag MODEL.DEVICE cuda:1
2019-03-06 19:41:24,027 maskrcnn_benchmark.trainer INFO: Start training THCudaCheck FAIL file=/home/dl/KJ/dl/core/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/nms.cu line=103 error=77 : an illegal memory access was encountered
The issue is that the current executable device (on CUDA) is device:0, while the memory data of tensors (boxes, etc.) is allocated on device:1. Therefore, we need temporary switch device 0 to 1.

After fixed the issue:
2019-03-06 19:51:47,698 maskrcnn_benchmark.trainer INFO: Start training 2019-03-06 19:51:55,449 maskrcnn_benchmark.trainer INFO: eta: 3 days, 5:29:26 iter: 20 loss: 1.8457 (2.2689) loss_classifier: 0.4143 (0.8737) loss_box_reg: 0.0372 (0.0519) loss_mask: 0.7941 (0.8378) loss_objectness: 0.3711 (0.4117) loss_rpn_box_reg: 0.0559 (0.0938) time: 0.3500 (0.3875) data: 0.0087 (0.0445) lr: 0.001793 max mem: 3362 2019-03-06 19:52:02,983 maskrcnn_benchmark.trainer INFO: eta: 3 days, 4:24:40 iter: 40 loss: 1.3515 (1.9323) loss_classifier: 0.3346 (0.6586) loss_box_reg: 0.0855 (0.0689) loss_mask: 0.7014 (0.7690) loss_objectness: 0.2152 (0.3520) loss_rpn_box_reg: 0.0460 (0.0838) time: 0.3729 (0.3821) data: 0.0100 (0.0277) lr: 0.001927 max mem: 3466
The support could help some developer run the program in some limited environments.

@facebook-github-bot
Copy link

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thanks for the PR! This is indeed something that should be fixed, but I never got into doing this because I always run the code with CUDA_VISIBLE_DEVICES=1, which works fine.

I have an ask: could you use the CUDAGuard from c10 instead of manually setting cudaSetDevice etc? This is the new pattern that we are using in PyTorch 1.0.
Here is an example of utilization https://github.com/pytorch/pytorch/blob/1154506533bfe9428600fd69fa2e71dd172b7fec/aten/src/ATen/native/cuda/Copy.cu#L181

Thanks again!

@atranitell
Copy link
Contributor Author

atranitell commented Mar 7, 2019

Hi Massa,
Aha, in the past I have also employed set a environment variable to control device usage.
The CUDAGuard is indeed an elegant method to maintain the executive devices.
I have added it to the commit.
Best,
Kai J

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Mar 7, 2019
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

Copy link
Contributor Author

@atranitell atranitell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@jario-jin
Copy link
Contributor

It is better to add a description in README.md, I also want to run on arbitrary CUDA device, but still don't know how to use. Thanks

@atranitell
Copy link
Contributor Author

It is better to add a description in README.md, I also want to run on arbitrary CUDA device, but still don't know how to use. Thanks

Yeah, it's necessary to tell the usage for users.

Here is an example for Mask R-CNN R-50 FPN quick on the second device (CUDA:1):
```bash
# for training
python tools/train_net.py --config-file=configs/quick_schedules/e2e_mask_rcnn_R_50_FPN_quick.yaml MODEL.DEVICE cuda:1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not recommend this as the first method for running on the a different GPU.

Indeed, this used to still allocate some memory on the first GPU. Nowadays this is still the case but much less (~10MB from what I tried), while using CUDA_VISIBLE_DEVICES really doesn't use anything.

After you make this change and add first a note on this env var then I think this PR is good to go

Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fmassa fmassa merged commit f031879 into facebookresearch:master Mar 26, 2019
Copy link
Contributor Author

@atranitell atranitell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we should offer suitable methods and tell the prons and cons to users.

fmassa added a commit that referenced this pull request Mar 26, 2019
fmassa added a commit that referenced this pull request Mar 26, 2019
eskjorg pushed a commit to eskjorg/maskrcnn-benchmark that referenced this pull request Mar 27, 2019
* support for any one cuda device

* Revert "support for any one cuda device"

This reverts commit 0197e4e.

* support runnning for anyone cuda device

* using safe CUDAGuard rather than intrinsic CUDASetDevice

* supplement a header dependency (test passed)

* Support for  arbitrary GPU device.

* Support for arbitrary GPU device.

* add docs for two method to control devices
eskjorg pushed a commit to eskjorg/maskrcnn-benchmark that referenced this pull request Mar 27, 2019
Lyears pushed a commit to Lyears/maskrcnn-benchmark that referenced this pull request Jun 28, 2020
* support for any one cuda device

* Revert "support for any one cuda device"

This reverts commit 0197e4e.

* support runnning for anyone cuda device

* using safe CUDAGuard rather than intrinsic CUDASetDevice

* supplement a header dependency (test passed)

* Support for  arbitrary GPU device.

* Support for arbitrary GPU device.

* add docs for two method to control devices
Lyears pushed a commit to Lyears/maskrcnn-benchmark that referenced this pull request Jun 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants