training across mutiple nodes does not work #22

forever208 · 2022-01-08T16:26:06Z

if the number of GPUs > 8 (each node has 8 GPUs), then I have to train in several nodes

In this case, run by mpiexec -n 16 python script/image_train.py doesn't work.

It says the error of nccl

The text was updated successfully, but these errors were encountered:

forever208 · 2022-01-12T18:04:52Z

instead of using mpiexec -n 16
using the following setting works in 2 nodes by 16 GPUs:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8 # 8 gpus for each node

then use mpirun python script/image_train.py

forever208 · 2022-06-17T13:20:32Z

another issue people might have is that, do remember to change the parameter GPUS_PER_NODE in the script guided_diffusion/dist_util.py if your cluster is not 8GPUs/node

The default value is 8 set by the author.

furkan-celik · 2022-06-30T23:06:18Z

Hi, I am having some trouble on mpiexec with this repo. I installed libopenmpi package on docker image with ubuntu and trying to run with multiple gpus right now. However, when I try to run it with mpiexec -n 2 python script/image_train.py, it doesn't start training and stuck at somewhere. I wonder whether I need to make a specific setting at docker image. Can you help me about this?

Germany321 · 2022-10-02T16:36:40Z

Hi, I try your mpirun method and set the nodes to 2 by 16 GPUs. However, NCCL error occurs when I do this.

File "scripts/image_train.py", line 59, in main batch_size_sample=args.batch_size_sample, File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/train_util.py", line 75, in __init__ self._load_and_sync_parameters() File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/train_util.py", line 130, in _load_and_sync_parameters dist_util.sync_params(self.model.parameters()) File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/dist_util.py", line 83, in sync_params dist.broadcast(p, 0) File "/mnt/cache/anaconda3/envs/improved_diffusion/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

forever208 changed the title ~~issue with training across mutiple nodes~~ training across mutiple nodes does not work Jan 8, 2022

forever208 closed this as completed Jan 12, 2022

forever208 mentioned this issue Jun 27, 2022

is this repo support training on multiple nodes? openai/improved-diffusion#12

Closed

forever208 mentioned this issue Feb 3, 2023

Multi-node training does not work forever208/DDPM-IP#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training across mutiple nodes does not work #22

training across mutiple nodes does not work #22

forever208 commented Jan 8, 2022 •

edited

Loading

forever208 commented Jan 12, 2022 •

edited

Loading

forever208 commented Jun 17, 2022 •

edited

Loading

furkan-celik commented Jun 30, 2022

Germany321 commented Oct 2, 2022

training across mutiple nodes does not work #22

training across mutiple nodes does not work #22

Comments

forever208 commented Jan 8, 2022 • edited Loading

forever208 commented Jan 12, 2022 • edited Loading

forever208 commented Jun 17, 2022 • edited Loading

furkan-celik commented Jun 30, 2022

Germany321 commented Oct 2, 2022

forever208 commented Jan 8, 2022 •

edited

Loading

forever208 commented Jan 12, 2022 •

edited

Loading

forever208 commented Jun 17, 2022 •

edited

Loading