Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training across mutiple nodes does not work #22

Closed
forever208 opened this issue Jan 8, 2022 · 4 comments
Closed

training across mutiple nodes does not work #22

forever208 opened this issue Jan 8, 2022 · 4 comments

Comments

@forever208
Copy link

forever208 commented Jan 8, 2022

if the number of GPUs > 8 (each node has 8 GPUs), then I have to train in several nodes

In this case, run by mpiexec -n 16 python script/image_train.py doesn't work.

It says the error of nccl

@forever208 forever208 changed the title issue with training across mutiple nodes training across mutiple nodes does not work Jan 8, 2022
@forever208
Copy link
Author

forever208 commented Jan 12, 2022

instead of using mpiexec -n 16
using the following setting works in 2 nodes by 16 GPUs:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8 # 8 gpus for each node

then use mpirun python script/image_train.py

@forever208
Copy link
Author

forever208 commented Jun 17, 2022

another issue people might have is that, do remember to change the parameter GPUS_PER_NODE in the script guided_diffusion/dist_util.py if your cluster is not 8GPUs/node

The default value is 8 set by the author.

@furkan-celik
Copy link

Hi, I am having some trouble on mpiexec with this repo. I installed libopenmpi package on docker image with ubuntu and trying to run with multiple gpus right now. However, when I try to run it with mpiexec -n 2 python script/image_train.py, it doesn't start training and stuck at somewhere. I wonder whether I need to make a specific setting at docker image. Can you help me about this?

@Germany321
Copy link

Hi, I try your mpirun method and set the nodes to 2 by 16 GPUs. However, NCCL error occurs when I do this.

File "scripts/image_train.py", line 59, in main batch_size_sample=args.batch_size_sample, File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/train_util.py", line 75, in __init__ self._load_and_sync_parameters() File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/train_util.py", line 130, in _load_and_sync_parameters dist_util.sync_params(self.model.parameters()) File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/dist_util.py", line 83, in sync_params dist.broadcast(p, 0) File "/mnt/cache/anaconda3/envs/improved_diffusion/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants