You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have some questions about the multi-node training.
Specificaly, I try your script (mpiexec -n 16 or mpirun ) in 2 nodes by 16 GPUs for the imagenet, but NCCL error still occurs.
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
The text was updated successfully, but these errors were encountered:
hi @zkcys001 , have you changed the parameter to GPUS_PER_NODE = 8 in the script dist_util.py ?
If so, please try the guided-diffusion repo for 2 nodes 16GPUs training, and see if you have the same NCCL issue.
If so, please also go to the repo guided-diffusion to look for a similar issue and solution.
For example, this issue discussion might be helpful for you issues 22
I think the NCCL problem is usually caused by the GPU cluster and NCCL version.
Personally, I also had some issues with multiple node training when I use their code.
Thanks for your good work!
I have some questions about the multi-node training.
Specificaly, I try your script (mpiexec -n 16 or mpirun ) in 2 nodes by 16 GPUs for the imagenet, but NCCL error still occurs.
Script:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
mpirun python scripts/image_train.py --data_dir /input/datasets/imagenet/train.zip --attention_resolutions 32,16,8 --class_cond True
--diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4
--num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --lr 1e-4 --batch_size 8 --logger_dir '/input/guide_diffusion/image128con'
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
mpiexec -n 16 python scripts/image_train.py --data_dir /input/datasets/imagenet/train.zip --attention_resolutions 32,16,8 --class_cond True
--diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4
--num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --lr 1e-4 --batch_size 8 --logger_dir '/input/guide_diffusion/image128con'
Error:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
The text was updated successfully, but these errors were encountered: