-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze on infiniband with slurm #1489
Comments
Update: Turns out that there's something wrong with processing of the command line. If I create a YAML file with the same parameters per node, I get it to run just fine: So, this works: #!/bin/bash
#SBATCH --account=training2306
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4
# Without this, srun does not inherit cpus-per-task from sbatch.
export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"
# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"
export MASTER_PORT=7010
export GPUS_PER_NODE=4
# Make sure we are on the right directory
cd $HOME/2023-may-intro-to-supercompting-jsc/src
# This loads modules and python packages
source sc_venv_template/activate.sh
# Set up accelerate config.
export ACCELERATE_CONFIG_YAML=accelerate_config_"$SLURM_JOB_ID".yaml
srun bash -c "((\$SLURM_PROCID)) || cat <<EOT > \"\$ACCELERATE_CONFIG_YAML\"
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: \$SLURM_NODEID
main_process_ip: '\$MASTER_ADDR'
main_process_port: \$MASTER_PORT
main_training_function: main
mixed_precision: 'no'
num_machines: \$SLURM_JOB_NUM_NODES
num_processes: \$((SLURM_JOB_NUM_NODES * GPUS_PER_NODE))
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
EOT"
# Run the demo
time srun bash -c 'accelerate launch \
--config_file=$ACCELERATE_CONFIG_YAML \
distrib.py' |
cc @muellerzr for the CLI |
The issue is that the command-line settings don't set properly the rdvz_backend (which defaults to |
@surak can you try again, installing accelerate with |
Thanks @muellerzr for the quick response, this was really lacking from |
Np @janEbert! It was on my list to get to, with how our CLI works as long as our argparser knows the param to use, that is usually all it takes to get it up and going (you can see literally I added a single line in the PR here: #1490) (I also want to mildly change the internals so it's not as static as it is rn, but that's a when-i-have-time) |
I tried with
And I tried the script like this: time srun bash -c 'accelerate launch \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--multi_gpu \
--mixed_precision no \
--num_processes=$(($SLURM_JOB_NUM_NODES * 4)) \
--dynamo_backend=no \
--num_machines=$SLURM_JOB_NUM_NODES \
--machine_rank=$SLURM_NODEID \
--rdzv_backend c10d \
--same_network false \
distrib.py' While it does set the backend correctly, there are things missing which stop it from working. This is the log when using the generated yaml file described in previous comments. This worked: entrypoint : distrib.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 4
run_id : none
rdzv_backend : static
rdzv_endpoint : 10.13.31.231:7010
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {} And this is with the command-line parameters above. This DOES NOT work: entrypoint : false
min_nodes : 2
max_nodes : 2
nproc_per_node : 4
run_id : none
rdzv_backend : c10d
rdzv_endpoint :
rdzv_configs : {'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {} Two things:
|
What's really happening here I believe is the new version fixed a broken issue, aka To test our theory, run your code with raw torchrun please 😄 : srun bash -c 'torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--nproc_per_node=4 \
--nnodes=$NNODES \
--rdzv_backend c10d \
distrib.py' This should fail, and only work when you use srun bash -c 'torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--nproc_per_node=4 \
--nnodes=$NNODES \
--rdzv_backend static \
distrib.py' |
|
@muellerz I'm pretty sure it's correct that |
The issue I linked to is patched with the PyTorch module we supply on the machine, but it still requires usage of |
Got it, I'm with you now. |
In summary, the command @surak posted should be the following (just removing the
|
If we can verify that works, then I'll go ahead and merge the PR :) |
Works for me indeed! Thanks a lot! |
Does that grant a new point release? :-) |
@surak we'll have a new release soon on our usual release cycle |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Fails with multi-node.
This machine has the infiniband interfaces suffixed with
i
, so a compute node responds to hostname with something likejuwels07
but the right interface isjuwels07i
. There's some script magic for that on the script.Slurm launch is:
This is the error output, and it stays like this until the job times out (the identical job but with only one node works):
This is the tail of the NCCL_DEBUG output:
The full error and output are here https://gist.github.com/surak/5f3f236616e5db48f19d31df457b4350
Expected behavior
A similar script works in an ethernet cluster. I would like to see what is actually frozen, but there's no output other than that.
The text was updated successfully, but these errors were encountered: