-
Notifications
You must be signed in to change notification settings - Fork 2.5k
multi GPU training problem #208
Comments
Hi, Thanks, |
Update - |
when train, it's ok to use 1 or 2 or 3 gpus, but when use 4 gpus or 8 gpus, it's easy to hang. so i try to print something in do_train func, find that the procedure gets stuck in those lines. it looks like something wrong in dataloader. i set |
@zimenglan-sysu-512 what's your system configuration? Please copy and paste the output from the You can get the script and run it with:
|
hi @fmassa
btw, i don't use |
Can you launch your jobs with
and paste the result here? |
output as below:
|
Can you paste the rest of the stack trace? |
|
I don't have an idea yet of what it could be. |
I also meet the same problem with 4 X P40 and 4 X V100 servers. It seems related to data loading. On a server where the disk is slow, it is more likely to hang. But I am not totally sure whether it is due to data loading. I have tried sleep before main and use spawn. But neither solves the problem. |
This is very weird. I've not experienced deadlocks for a while. |
The Pytorch version is: |
Can you try changing the distributed backend to use the new c10d backend? |
hi @fmassa |
@fmassa After replacing torch.distributed.deprecated with torch.distributed, I run the previous experiments 3 times and did not encounter the deadlocks. I will run more experiments to see whether c10d solves the problem. |
Awesome, thanks for the information! If it indeed solves all the problems for you, would you mind sending a PR replacing the deprecated backend with the new one? |
after replacing all instances of |
Great, I'll be merging the PR that replaces |
i encountered the similar issue, but the torch.distributed.deprecated has already been replaced with torch.distributed. I thought it was pytorch issue and posted it here. Any more suggestion on the fix? https://discuss.pytorch.org/t/distributed-training-hangs/46263 |
❓ Questions and Help
Hi,
I tried to run the benchmark on a 4 p100 GPU machine several times.
Most of the time (six out of eight tries) the GPUs were stuck - utility 100%, memory occupied, seems like working - but nothing is done (not a single iteration log was printed, no checkpoint was saved).
In two of my tries the network was trained (regular successful train - with log and checkpoints etc.)
I noticed that on the successful tries the "Start training" line was printed just before the training. example:
creating index...
index created!
Done (t=4.16s)
creating index...
index created!
Done (t=4.23s)
creating index...
index created!
2018-11-24 07:51:25,647 maskrcnn_benchmark.trainer INFO: Start training
2018-11-24 07:51:33,212 maskrcnn_benchmark.trainer INFO: eta: 1 day, 13:48:52 iter: 20 loss: 1.1634 (1.5222) loss_classifier: 0.5153 (0.8516) loss_box_reg: 0.0399 (0.0403) loss_objectness: 0.4983 (0.4964) loss_rpn_box_reg: 0.1370 (0.1338) time: 0.2600 (0.3782) data: 0.0060 (0.1302) lr: 0.001793 max mem: 1840
and on the unsuccessful runs the location of the start training line was different:
loading annotations into memory...
loading annotations into memory...
Done (t=3.61s)
creating index...
Done (t=3.59s)
creating index...
index created!
index created!
2018-11-24 07:42:22,458 maskrcnn_benchmark.trainer INFO: Start training
Done (t=4.03s)
creating index...
Done (t=4.06s)
creating index...
index created!
index created!
end of log
Is this problem known?
Can I bypass it somehow?
I'm working with:
Nvidia driver version: 396.26
CUDA used to build PyTorch: 9.0.176
OS: CentOS Linux 7 (Core)
Thanks
The text was updated successfully, but these errors were encountered: