Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node training does not work #2

Closed
zkcys001 opened this issue Feb 3, 2023 · 1 comment
Closed

Multi-node training does not work #2

zkcys001 opened this issue Feb 3, 2023 · 1 comment

Comments

@zkcys001
Copy link

zkcys001 commented Feb 3, 2023

Thanks for your good work!

I have some questions about the multi-node training.
Specificaly, I try your script (mpiexec -n 16 or mpirun ) in 2 nodes by 16 GPUs for the imagenet, but NCCL error still occurs.

Script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8

export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1

mpirun python scripts/image_train.py --data_dir /input/datasets/imagenet/train.zip --attention_resolutions 32,16,8 --class_cond True
--diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4
--num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --lr 1e-4 --batch_size 8 --logger_dir '/input/guide_diffusion/image128con'

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8

export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1

mpiexec -n 16 python scripts/image_train.py --data_dir /input/datasets/imagenet/train.zip --attention_resolutions 32,16,8 --class_cond True
--diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4
--num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --lr 1e-4 --batch_size 8 --logger_dir '/input/guide_diffusion/image128con'

Error:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

@forever208
Copy link
Owner

forever208 commented Feb 3, 2023

hi @zkcys001 , have you changed the parameter to GPUS_PER_NODE = 8 in the script dist_util.py ?
If so, please try the guided-diffusion repo for 2 nodes 16GPUs training, and see if you have the same NCCL issue.
If so, please also go to the repo guided-diffusion to look for a similar issue and solution.
For example, this issue discussion might be helpful for you issues 22

I think the NCCL problem is usually caused by the GPU cluster and NCCL version.
Personally, I also had some issues with multiple node training when I use their code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants