Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Problem with multi-GPU training #58

Open
robpal1990 opened this issue Oct 29, 2018 · 23 comments
Open

Problem with multi-GPU training #58

robpal1990 opened this issue Oct 29, 2018 · 23 comments

Comments

@robpal1990
Copy link

robpal1990 commented Oct 29, 2018

Hello,

I have successfully built maskrcnn_benchmark on Ubuntu 16.04. My workstation has 4x1080Ti (CUDA 9.2, cuDNN 7, Nvidia drivers 410.48) and I tried to train on COCO dataset on multiple GPUs. I used the script provided in "Perform training on COCO dataset" section.

One GPU worked fine with:

python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

Then I used

export NGPUS=2
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 4 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

(the same train_net.py file and the same config, changed images per batch to 4) and everything worked fine.

Next I tried the same thing for 3 GPUs (changed NGPUS=3, images per batch to 6) and the training gets stuck during the first interation. I have the following logging information and it does not change:

2018-10-29 17:20:55,722 maskrcnn_benchmark.trainer INFO: Start training
2018-10-29 17:20:58,453 maskrcnn_benchmark.trainer INFO: eta: 22 days, 18:04:18  iter: 0  loss: 6.7175 (6.7175)  loss_classifier: 4.4688 (4.4688)  loss_box_reg: 0.0044 (0.0044)  loss_mask: 1.4084 (1.4084)  loss_objectness: 0.7262 (0.7262)  loss_rpn_box_reg: 0.1097 (0.1097)  time: 2.7304 (2.7304)  data: 2.4296 (2.4296)  lr: 0.000833  max mem: 1749

The GPU memory is used, the temperature goes up, but nothing is happening (I tried multiple times and then gave up).

Any ideas? I'd be grateful for help.

@sneakerkg
Copy link

I met the same issue. I'm using an 8GPU machine and it works fine with NGPUS=2, but stuck after first iteration when using NGPUS=4 or 8.
torch (1.0.0a0+ff608a9)
torchvision (0.2.1)
CUDA 9.2

@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

Hi,

@robpal1990 So the training get's stuck once you use 3 GPUs, is that right? Did you try using 4 GPUs as well?

In the past we had hangs as well, but those were due to a driver bug that was fixed in versions 384, 390 and 396.

The driver versions with the fix are >=384.139 (for cuda 9.0) and >=396.26 (for cuda 9.2). They are out already. If you can, it's better to move to 396, if not, update 384

Could you check that your drivers satisfy those requirements?

@zxphistory
Copy link

I also met the same issue, I am using 3 GPUs machine. It works fine when I use 2GPUs, but stuck when using 3 GPUs.

PyTorch version: 1.0.0.dev20181029
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
GPU 2: TITAN V

Nvidia driver version: 390.59
cuDNN version: Could not collect

@fmassa
Copy link
Contributor

fmassa commented Oct 30, 2018

I'll try reproducing this and report back

@robpal1990
Copy link
Author

Hi,

@robpal1990 So the training get's stuck once you use 3 GPUs, is that right? Did you try using 4 GPUs as well?

In the past we had hangs as well, but those were due to a driver bug that was fixed in versions 384, 390 and 396.

The driver versions with the fix are >=384.139 (for cuda 9.0) and >=396.26 (for cuda 9.2). They are out already. If you can, it's better to move to 396, if not, update 384

Could you check that your drivers satisfy those requirements?

I tried using 4 and run into the same issue.
As I mentioned in the first post, I have the 410.48 version of the drivers. I will try downgrading to a 396.xx version.

Thanks a lot for your help.

@robpal1990
Copy link
Author

@fmassa
I first upgraded the drivers to the latest 410.xx version and the training on 3 GPUs did not work.
I then re-compiled CUDA 9.2 with 396.54 drivers and the training on 3 GPUs works.

You need to do this manually, from the local .deb file found on Nvidia website, the cuda-9-2 package found in the Ubuntu repo automatically installs the nvidia-410 and nvidia-410-dev packages, overwriting the 396 drivers.

Clearly a drivers issue, seems like the framework is not compatible with 410.

@fmassa
Copy link
Contributor

fmassa commented Oct 30, 2018

@robpal1990 thanks a lot for the info.

So this seems to be related to the hangs that we were facing in the past. They were due to a problem with the driver when some cudnn convolution was selected.

@slayton58 @ngimel Are you aware of this hang with 410.48 drivers?

@chengyangfu
Copy link
Contributor

chengyangfu commented Oct 31, 2018

I have the same problem too.

Environment:
Python: 3.5
GPU : 4 1080Ti.
CUDA : 9.0 (with all the patches)
CuDNN: 7.1
NCCL2: download from Nvidia
Nvidia Driver: 390, 396, 410
PyTorch: Compiled from the source ( v1.0rc0, and v1.0rc1)
Ubuntu : 16.04

The bug is weird to me. If I only two GPUs, everything is fine. If I try to use 4 GPUs, sometimes it occurs.
P.S. I also found out when I use Nvidia driver 410, the frequency is much lower.

@robpal1990
Copy link
Author

robpal1990 commented Oct 31, 2018

A follow-up, since it seems that the issue is still present is some settings.

I haven't mentioned in my first post that I have everything installed without conda (simply with pip3, python3). Training on 3 GPUs works.

Then, I created a Docker build, extending the image pytorch/pytorch:nightly-devel-cuda9.2-cudnn7 (this one has conda inside). I followed the instruction as before, but in the docker container training on 3 GPUs gets stuck on the first iteration again (1 and 2 GPUs work fine) and as of now I don't know how to fix it. From what I know, Docker shares the drivers (396.54) with the host machine, so it's surprising.

@yelantf
Copy link
Contributor

yelantf commented Nov 12, 2018

Well, in my situation, training on 3 GPUs works fine, while training on 4 GPUs stucks. The codes are run on a server with 4 Titan Xp cards. I hope this issue can be fixed soon.

@fmassa
Copy link
Contributor

fmassa commented Nov 12, 2018

I think this is a bad mix with CUDA / CUDNN and the NVidia driver, and I'm not sure there is anything we can do. You might need to update a few things in your system.

For info, here is the setup I use and which works fine:

PyTorch version: 1.0.0a0+dd2c487
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.2.88
GPU models and configuration: 8 Tesla V100
Nvidia driver version: 396.51

CUDNN 7.0

@yelantf
Copy link
Contributor

yelantf commented Nov 12, 2018

Isn't CUDA and CUDNN built in pytorch? I thought the version I installed on my server manually would have no influence on that built in pytorch itself. Here is my setup info.
@fmassa

PyTorch version: 1.0.0.dev20181108
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: 
GPU 0: TITAN Xp
GPU 1: TITAN Xp
GPU 2: TITAN Xp
GPU 3: TITAN Xp

Nvidia driver version: 390.87
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

@fmassa
Copy link
Contributor

fmassa commented Nov 12, 2018

So, I think that 390.87 is a driver for CUDA 9.1, and it might be prior to the fix that I mentioned.

Updating the nvidia driver to the latest release should fix the issue.

@yelantf
Copy link
Contributor

yelantf commented Nov 15, 2018

So, I think that 390.87 is a driver for CUDA 9.1, and it might be prior to the fix that I mentioned.

Updating the nvidia driver to the latest release should fix the issue.

It's weird. I did not touch the driver but only installed all patches of my CUDA(9.0). After that was done, this issue was gone. I installed pytorch by using pip and I thought CUDA is built in pytorch if I installed it in that way. If that is true, how would my own local CUDA patches influence the behavior of pytorch codes?

@zimenglan-sysu-512
Copy link
Contributor

hi @robpal1990
how do manually degrade the driver ?

@robpal1990
Copy link
Author

hi @robpal1990
how do manually degrade the driver ?

I had it installed via apt on Ubuntu, so simply type sudo apt remove nvidia-3xx nvidia-3xx-dev (fill xx with your version.

Then I downloaded and installed the drivers from Nvidia website (https://www.nvidia.com/Download/index.aspx). In my case I also had to build CUDA with these drivers. Download the one from Nvidia website since the one in apt repo will overwrite your drivers with 410 version.

@zimenglan-sysu-512
Copy link
Contributor

hi @fmassa
right now support cuda 10?

@fmassa
Copy link
Contributor

fmassa commented Nov 16, 2018

PyTorch supports CUDA 10, but you need to compile PyTorch from source I think

@zimenglan-sysu-512
Copy link
Contributor

hi @chengyangfu @fmassa
i find that after installing the new version of nccl, the hang doesn't occur. it seems to be solved.

@Godricly
Copy link
Contributor

May I ask what CUDA and driver version are you using? I'm stuck with this issue in CUDA8.0.61, cudnn 7102 , driver 390.97 even with 2 1280ti cards. I tried both nightly and stable version pytorch.

@Godricly
Copy link
Contributor

I solved my case. When no positive example is presented in training, it blows up. 😞 I think it's related to the following issue.
One tricky solution is to increase you batch size. 👿

@chengyangfu
Copy link
Contributor

Recently, I updated my PyTorch to v1.0.0 and it solved this problem.

Driver Version: 415.27
CUDA version: cuda_9.2.148_396.37 + patch 1
CUDNN version: cudnn-9.2-linux-x64-v7.3.1
NCCL version: nccl_2.3.7-1+cuda9.2

@lanfeng4659
Copy link

hi @chengyangfu @fmassa
i find that after installing the new version of nccl, the hang doesn't occur. it seems to be solved.
could you tell me the version of those packages? pytorch, cuda, nvidia driver and nccl, thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants