Freeze on infiniband with slurm #1489

surak · 2023-05-30T14:55:14Z

System Info

`Accelerate` version: 0.19.0
- Platform: Linux-4.18.0-425.13.1.el8_7.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.4
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.12.0 (False)
- System RAM: 503.49 GB
- `Accelerate` default config:
	Not found

Using the command-line parameters:

srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision=no \
    --num_processes=$(($NNODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$NNODES  \
    --machine_rank=$SLURM_NODEID \
    --rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT,rdzv_backend=c10d" \
    distrib.py'

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Fails with multi-node.

This machine has the infiniband interfaces suffixed with i, so a compute node responds to hostname with something like juwels07 but the right interface is juwels07i. There's some script magic for that on the script.

Slurm launch is:

#!/bin/bash -x
#SBATCH --account=training2306
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4

# srun doesnot inherit cpus-per-task from sbatch
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"

export MASTER_PORT=7010
export GPUS_PER_NODE=4
export NNODES=$SLURM_JOB_NUM_NODES
# do not remove or the training will hang and nodes will be lost w/o this workaround
export CUDA_LAUNCH_BLOCKING=1

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1

# handle timeouts
export NCCL_IB_TIMEOUT=20

# Make sure we are on the right directory
cd $HOME/2023-may-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/activate.sh

export LOGLEVEL=INFO
# Run the demo
time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision=no \
    --num_processes=$(($NNODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$NNODES  \
    --machine_rank=$SLURM_PROCID \
    --rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT rdzv_backend=c10d" \
    distrib.py'

This is the error output, and it stays like this until the job times out (the identical job but with only one node works):

+ 
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : distrib.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.13.23.78:7010
  rdzv_configs     : {'rdzv_endpoint': '10.13.23.78:7010', 'rdzv_backend': 'c10d', 'rank': 1, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : distrib.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.13.23.78:7010
  rdzv_configs     : {'rdzv_endpoint': '10.13.23.78:7010', 'rdzv_backend': 'c10d', 'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic__7nlgv08/none_jscf2i4f
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.13.23.78
  master_port=7010
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[4, 5, 6, 7]
  global_ranks=[4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/3/error.json
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_j88hm8vm/none_tgket3up
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.13.23.78
  master_port=7010
  group_rank=0
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/3/error.json
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).

This is the tail of the NCCL_DEBUG output:

jwb0093:11954:12012 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
jwb0093:11954:12012 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
jwb0093:11953:12015 [1] NCCL INFO Connected all trees
jwb0093:11953:12015 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
jwb0093:11953:12015 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
jwb0092:26778:26839 [2] NCCL INFO comm 0x3792dff0 rank 2 nranks 8 cudaDev 2 busId 84000 - Init COMPLETE
jwb0092:26776:26834 [0] NCCL INFO comm 0x37fb73a0 rank 0 nranks 8 cudaDev 0 busId 3000 - Init COMPLETE
jwb0092:26777:26840 [1] NCCL INFO comm 0x37e42e20 rank 1 nranks 8 cudaDev 1 busId 44000 - Init COMPLETE
jwb0093:11955:12013 [3] NCCL INFO comm 0x36e7c930 rank 7 nranks 8 cudaDev 3 busId c4000 - Init COMPLETE
jwb0092:26779:26841 [3] NCCL INFO comm 0x38ee4b30 rank 3 nranks 8 cudaDev 3 busId c4000 - Init COMPLETE
jwb0093:11953:12015 [1] NCCL INFO comm 0x379b7bc0 rank 5 nranks 8 cudaDev 1 busId 44000 - Init COMPLETE
jwb0093:11952:12014 [0] NCCL INFO comm 0x36f49380 rank 4 nranks 8 cudaDev 0 busId 3000 - Init COMPLETE
jwb0093:11954:12012 [2] NCCL INFO comm 0x36be8130 rank 6 nranks 8 cudaDev 2 busId 84000 - Init COMPLETE

The full error and output are here https://gist.github.com/surak/5f3f236616e5db48f19d31df457b4350

Expected behavior

A similar script works in an ethernet cluster. I would like to see what is actually frozen, but there's no output other than that.

The text was updated successfully, but these errors were encountered:

surak · 2023-05-30T17:44:25Z

Update:

Turns out that there's something wrong with processing of the command line.

If I create a YAML file with the same parameters per node, I get it to run just fine:

So, this works:

#!/bin/bash
#SBATCH --account=training2306
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4

# Without this, srun does not inherit cpus-per-task from sbatch.
export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"

# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"

export MASTER_PORT=7010

export GPUS_PER_NODE=4

# Make sure we are on the right directory
cd $HOME/2023-may-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/activate.sh

# Set up accelerate config.
export ACCELERATE_CONFIG_YAML=accelerate_config_"$SLURM_JOB_ID".yaml
srun bash -c "((\$SLURM_PROCID)) || cat <<EOT > \"\$ACCELERATE_CONFIG_YAML\"
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: \$SLURM_NODEID
main_process_ip: '\$MASTER_ADDR'
main_process_port: \$MASTER_PORT
main_training_function: main
mixed_precision: 'no'
num_machines: \$SLURM_JOB_NUM_NODES
num_processes: \$((SLURM_JOB_NUM_NODES * GPUS_PER_NODE))
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
EOT"

# Run the demo
time srun bash -c 'accelerate launch \
    --config_file=$ACCELERATE_CONFIG_YAML \
    distrib.py'

sgugger · 2023-05-30T17:45:08Z

cc @muellerzr for the CLI

surak · 2023-05-30T18:01:45Z

The issue is that the command-line settings don't set properly the rdvz_backend (which defaults to static), while the yaml file does.

muellerzr · 2023-05-30T18:13:11Z

@surak can you try again, installing accelerate with pip install git+https://github.com/huggingface/accelerate@rdzv-endpoint? You should be able to pass and use --rdzv_backend now

janEbert · 2023-05-30T18:17:27Z

Thanks @muellerzr for the quick response, this was really lacking from accelerate. :)
There were other issues regarding --rdzv_backend that were closed (e.g. #1337), but it makes total sense to have this parameter.

muellerzr · 2023-05-30T18:20:03Z

Np @janEbert! It was on my list to get to, with how our CLI works as long as our argparser knows the param to use, that is usually all it takes to get it up and going (you can see literally I added a single line in the PR here: #1490)

(I also want to mildly change the internals so it's not as static as it is rn, but that's a when-i-have-time)

surak · 2023-05-30T18:57:30Z

I tried with

branch 'rdzv-endpoint' set up to track 'origin/rdzv-endpoint'.
  Resolved https://github.com/huggingface/accelerate to commit 37b368e8cc34dffbb49e3ee3d5c021fabfa64a50

And I tried the script like this:

time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision no \
    --num_processes=$(($SLURM_JOB_NUM_NODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$SLURM_JOB_NUM_NODES \
    --machine_rank=$SLURM_NODEID \
    --rdzv_backend c10d \
    --same_network false \
    distrib.py'

While it does set the backend correctly, there are things missing which stop it from working.

This is the log when using the generated yaml file described in previous comments. This worked:

  entrypoint       : distrib.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.13.31.231:7010
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

And this is with the command-line parameters above. This DOES NOT work:

  entrypoint       : false
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : c10d
  rdzv_endpoint    : 
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

Two things:

the virtual environment with the new branch broke functionality with the working yaml script, for whatever reason.
shouldn't the rdzv_endpoint, config and entry point settings be filled properly by the command-line settings as it is by the YAML file?

muellerzr · 2023-05-30T19:22:30Z

What's really happening here I believe is the new version fixed a broken issue, aka c10d is your issue for your timeout, your working version has always been using static because Accelerate wasn't actually using it.

To test our theory, run your code with raw torchrun please 😄 :

srun bash -c 'torchrun \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    --nproc_per_node=4 \
    --nnodes=$NNODES  \
    --rdzv_backend c10d \
    distrib.py'

This should fail, and only work when you use static as the backend.

srun bash -c 'torchrun \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    --nproc_per_node=4 \
    --nnodes=$NNODES  \
    --rdzv_backend static \
    distrib.py'

janEbert · 2023-05-30T19:23:01Z

--same_backend is an argument with action=store_true. If it is specified, rdzv_endpoint will not be set.

janEbert · 2023-05-30T19:25:16Z

@muellerz I'm pretty sure it's correct that rdzv_backend was the issue. Our machine has had trouble since forever with torch.distributed.run (which accelerate uses): pytorch/pytorch#73656
That's the reason why I forced c10d usage via the YAML and it started working.

janEbert · 2023-05-30T19:27:24Z

The issue I linked to is patched with the PyTorch module we supply on the machine, but it still requires usage of --rdzv_backend c10d and --rdzv_endpoint [...], IIRC. That's why --rdzv_backend static can't work on our machine. It's complex and nasty, but that's how we manage to work around the issue. :p

muellerzr · 2023-05-30T19:28:56Z

Got it, I'm with you now.

janEbert · 2023-05-30T19:29:00Z

In summary, the command @surak posted should be the following (just removing the --same_network line):

time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision no \
    --num_processes=$(($SLURM_JOB_NUM_NODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$SLURM_JOB_NUM_NODES \
    --machine_rank=$SLURM_NODEID \
    --rdzv_backend c10d \
    distrib.py'

muellerzr · 2023-05-30T20:45:04Z

If we can verify that works, then I'll go ahead and merge the PR :)

surak · 2023-05-31T11:49:38Z

Works for me indeed! Thanks a lot!

surak · 2023-06-01T12:48:55Z

Does that grant a new point release? :-)

muellerzr · 2023-06-01T15:39:04Z

@surak we'll have a new release soon on our usual release cycle

surak mentioned this issue May 30, 2023

Feature request - SLURM support #1239

Open

muellerzr self-assigned this May 30, 2023

muellerzr added the bug Something isn't working label May 30, 2023

muellerzr mentioned this issue May 30, 2023

Add rdzv-backend #1490

Merged

muellerzr added the solved The bug or feature request has been solved, but the issue is still opened label May 31, 2023

surak closed this as completed Jun 1, 2023

surak mentioned this issue Jun 11, 2023

DDP window TCP bug [socket.cpp:558] [c10d] The client socket has failed to pytorch/pytorch#77523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeze on infiniband with slurm #1489

Freeze on infiniband with slurm #1489

surak commented May 30, 2023

surak commented May 30, 2023

sgugger commented May 30, 2023

surak commented May 30, 2023

muellerzr commented May 30, 2023

janEbert commented May 30, 2023

muellerzr commented May 30, 2023 •

edited

Loading

surak commented May 30, 2023 •

edited

Loading

muellerzr commented May 30, 2023

janEbert commented May 30, 2023

janEbert commented May 30, 2023

janEbert commented May 30, 2023

muellerzr commented May 30, 2023

janEbert commented May 30, 2023

muellerzr commented May 30, 2023

surak commented May 31, 2023 •

edited

Loading

surak commented Jun 1, 2023

muellerzr commented Jun 1, 2023

Freeze on infiniband with slurm #1489

Freeze on infiniband with slurm #1489

Comments

surak commented May 30, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

surak commented May 30, 2023

sgugger commented May 30, 2023

surak commented May 30, 2023

muellerzr commented May 30, 2023

janEbert commented May 30, 2023

muellerzr commented May 30, 2023 • edited Loading

surak commented May 30, 2023 • edited Loading

muellerzr commented May 30, 2023

janEbert commented May 30, 2023

janEbert commented May 30, 2023

janEbert commented May 30, 2023

muellerzr commented May 30, 2023

janEbert commented May 30, 2023

muellerzr commented May 30, 2023

surak commented May 31, 2023 • edited Loading

surak commented Jun 1, 2023

muellerzr commented Jun 1, 2023

muellerzr commented May 30, 2023 •

edited

Loading

surak commented May 30, 2023 •

edited

Loading

surak commented May 31, 2023 •

edited

Loading