-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training across mutiple nodes does not work #22
Comments
instead of using #SBATCH --nodes=2 then use |
another issue people might have is that, do remember to change the parameter The default value is 8 set by the author. |
Hi, I am having some trouble on mpiexec with this repo. I installed libopenmpi package on docker image with ubuntu and trying to run with multiple gpus right now. However, when I try to run it with mpiexec -n 2 python script/image_train.py, it doesn't start training and stuck at somewhere. I wonder whether I need to make a specific setting at docker image. Can you help me about this? |
Hi, I try your mpirun method and set the nodes to 2 by 16 GPUs. However, NCCL error occurs when I do this.
|
if the number of GPUs > 8 (each node has 8 GPUs), then I have to train in several nodes
In this case, run by
mpiexec -n 16 python script/image_train.py
doesn't work.It says the error of nccl
The text was updated successfully, but these errors were encountered: