Skip to content

Commit

Permalink
fix rdzv_id
Browse files Browse the repository at this point in the history
rdzv_id may be not equal in multi node setup using RANDOM
  • Loading branch information
Mddct authored May 6, 2024
1 parent e48aaeb commit d23693a
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions distributed/ddp-tutorial-series/slurm/sbatch_run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,11 @@ head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
echo Node IP: $head_node_ip
export LOGLEVEL=INFO

job_id=2024
srun torchrun \
--nnodes 4 \
--nproc_per_node 1 \
--rdzv_id $RANDOM \
--rdzv_id ${jobid} \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
/shared/examples/multinode_torchrun.py 50 10
/shared/examples/multinode_torchrun.py 50 10

0 comments on commit d23693a

Please sign in to comment.