Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze on infiniband with slurm #1489

Closed
4 tasks
surak opened this issue May 30, 2023 · 17 comments
Closed
4 tasks

Freeze on infiniband with slurm #1489

surak opened this issue May 30, 2023 · 17 comments
Assignees
Labels
bug Something isn't working solved The bug or feature request has been solved, but the issue is still opened

Comments

@surak
Copy link

surak commented May 30, 2023

System Info

`Accelerate` version: 0.19.0
- Platform: Linux-4.18.0-425.13.1.el8_7.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.4
- Numpy version: 1.22.3
- PyTorch version (GPU?): 1.12.0 (False)
- System RAM: 503.49 GB
- `Accelerate` default config:
	Not found

Using the command-line parameters:

srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision=no \
    --num_processes=$(($NNODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$NNODES  \
    --machine_rank=$SLURM_NODEID \
    --rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT,rdzv_backend=c10d" \
    distrib.py'

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Fails with multi-node.

This machine has the infiniband interfaces suffixed with i, so a compute node responds to hostname with something like juwels07 but the right interface is juwels07i. There's some script magic for that on the script.

Slurm launch is:

#!/bin/bash -x
#SBATCH --account=training2306
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4

# srun doesnot inherit cpus-per-task from sbatch
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"

export MASTER_PORT=7010
export GPUS_PER_NODE=4
export NNODES=$SLURM_JOB_NUM_NODES
# do not remove or the training will hang and nodes will be lost w/o this workaround
export CUDA_LAUNCH_BLOCKING=1

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1

# handle timeouts
export NCCL_IB_TIMEOUT=20

# Make sure we are on the right directory
cd $HOME/2023-may-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/activate.sh

export LOGLEVEL=INFO
# Run the demo
time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision=no \
    --num_processes=$(($NNODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$NNODES  \
    --machine_rank=$SLURM_PROCID \
    --rdzv_conf "rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT rdzv_backend=c10d" \
    distrib.py'

This is the error output, and it stays like this until the job times out (the identical job but with only one node works):

+ 
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : distrib.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.13.23.78:7010
  rdzv_configs     : {'rdzv_endpoint': '10.13.23.78:7010', 'rdzv_backend': 'c10d', 'rank': 1, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : distrib.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.13.23.78:7010
  rdzv_configs     : {'rdzv_endpoint': '10.13.23.78:7010', 'rdzv_backend': 'c10d', 'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic__7nlgv08/none_jscf2i4f
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.13.23.78
  master_port=7010
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[4, 5, 6, 7]
  global_ranks=[4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic__7nlgv08/none_jscf2i4f/attempt_0/3/error.json
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_j88hm8vm/none_tgket3up
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.13.23.78
  master_port=7010
  group_rank=0
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_j88hm8vm/none_tgket3up/attempt_0/3/error.json
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [jwb0092i.juwels]:7010 (errno: 97 - Address family not supported by protocol).

This is the tail of the NCCL_DEBUG output:

jwb0093:11954:12012 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
jwb0093:11954:12012 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
jwb0093:11953:12015 [1] NCCL INFO Connected all trees
jwb0093:11953:12015 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
jwb0093:11953:12015 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
jwb0092:26778:26839 [2] NCCL INFO comm 0x3792dff0 rank 2 nranks 8 cudaDev 2 busId 84000 - Init COMPLETE
jwb0092:26776:26834 [0] NCCL INFO comm 0x37fb73a0 rank 0 nranks 8 cudaDev 0 busId 3000 - Init COMPLETE
jwb0092:26777:26840 [1] NCCL INFO comm 0x37e42e20 rank 1 nranks 8 cudaDev 1 busId 44000 - Init COMPLETE
jwb0093:11955:12013 [3] NCCL INFO comm 0x36e7c930 rank 7 nranks 8 cudaDev 3 busId c4000 - Init COMPLETE
jwb0092:26779:26841 [3] NCCL INFO comm 0x38ee4b30 rank 3 nranks 8 cudaDev 3 busId c4000 - Init COMPLETE
jwb0093:11953:12015 [1] NCCL INFO comm 0x379b7bc0 rank 5 nranks 8 cudaDev 1 busId 44000 - Init COMPLETE
jwb0093:11952:12014 [0] NCCL INFO comm 0x36f49380 rank 4 nranks 8 cudaDev 0 busId 3000 - Init COMPLETE
jwb0093:11954:12012 [2] NCCL INFO comm 0x36be8130 rank 6 nranks 8 cudaDev 2 busId 84000 - Init COMPLETE

The full error and output are here https://gist.github.com/surak/5f3f236616e5db48f19d31df457b4350

Expected behavior

A similar script works in an ethernet cluster. I would like to see what is actually frozen, but there's no output other than that.

@surak
Copy link
Author

surak commented May 30, 2023

Update:

Turns out that there's something wrong with processing of the command line.

If I create a YAML file with the same parameters per node, I get it to run just fine:

So, this works:

#!/bin/bash
#SBATCH --account=training2306
#SBATCH --nodes=2
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4

# Without this, srun does not inherit cpus-per-task from sbatch.
export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"

# so processes know who to talk to
MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
# Allow communication over InfiniBand cells.
MASTER_ADDR="${MASTER_ADDR}i"
# Get IP for hostname.
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"

export MASTER_PORT=7010

export GPUS_PER_NODE=4

# Make sure we are on the right directory
cd $HOME/2023-may-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/activate.sh

# Set up accelerate config.
export ACCELERATE_CONFIG_YAML=accelerate_config_"$SLURM_JOB_ID".yaml
srun bash -c "((\$SLURM_PROCID)) || cat <<EOT > \"\$ACCELERATE_CONFIG_YAML\"
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: \$SLURM_NODEID
main_process_ip: '\$MASTER_ADDR'
main_process_port: \$MASTER_PORT
main_training_function: main
mixed_precision: 'no'
num_machines: \$SLURM_JOB_NUM_NODES
num_processes: \$((SLURM_JOB_NUM_NODES * GPUS_PER_NODE))
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
EOT"

# Run the demo
time srun bash -c 'accelerate launch \
    --config_file=$ACCELERATE_CONFIG_YAML \
    distrib.py'

@sgugger
Copy link
Collaborator

sgugger commented May 30, 2023

cc @muellerzr for the CLI

@muellerzr muellerzr self-assigned this May 30, 2023
@muellerzr muellerzr added the bug Something isn't working label May 30, 2023
@surak
Copy link
Author

surak commented May 30, 2023

The issue is that the command-line settings don't set properly the rdvz_backend (which defaults to static), while the yaml file does.

@muellerzr
Copy link
Collaborator

@surak can you try again, installing accelerate with pip install git+https://github.com/huggingface/accelerate@rdzv-endpoint? You should be able to pass and use --rdzv_backend now

@janEbert
Copy link

Thanks @muellerzr for the quick response, this was really lacking from accelerate. :)
There were other issues regarding --rdzv_backend that were closed (e.g. #1337), but it makes total sense to have this parameter.

@muellerzr
Copy link
Collaborator

muellerzr commented May 30, 2023

Np @janEbert! It was on my list to get to, with how our CLI works as long as our argparser knows the param to use, that is usually all it takes to get it up and going (you can see literally I added a single line in the PR here: #1490)

(I also want to mildly change the internals so it's not as static as it is rn, but that's a when-i-have-time)

@surak
Copy link
Author

surak commented May 30, 2023

I tried with

branch 'rdzv-endpoint' set up to track 'origin/rdzv-endpoint'.
  Resolved https://github.com/huggingface/accelerate to commit 37b368e8cc34dffbb49e3ee3d5c021fabfa64a50

And I tried the script like this:

time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision no \
    --num_processes=$(($SLURM_JOB_NUM_NODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$SLURM_JOB_NUM_NODES \
    --machine_rank=$SLURM_NODEID \
    --rdzv_backend c10d \
    --same_network false \
    distrib.py'

While it does set the backend correctly, there are things missing which stop it from working.

This is the log when using the generated yaml file described in previous comments. This worked:

  entrypoint       : distrib.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.13.31.231:7010
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

And this is with the command-line parameters above. This DOES NOT work:

  entrypoint       : false
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : c10d
  rdzv_endpoint    : 
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

Two things:

  • the virtual environment with the new branch broke functionality with the working yaml script, for whatever reason.
  • shouldn't the rdzv_endpoint, config and entry point settings be filled properly by the command-line settings as it is by the YAML file?

@muellerzr
Copy link
Collaborator

What's really happening here I believe is the new version fixed a broken issue, aka c10d is your issue for your timeout, your working version has always been using static because Accelerate wasn't actually using it.

To test our theory, run your code with raw torchrun please 😄 :

srun bash -c 'torchrun \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    --nproc_per_node=4 \
    --nnodes=$NNODES  \
    --rdzv_backend c10d \
    distrib.py'

This should fail, and only work when you use static as the backend.

srun bash -c 'torchrun \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    --nproc_per_node=4 \
    --nnodes=$NNODES  \
    --rdzv_backend static \
    distrib.py'

@janEbert
Copy link

--same_backend is an argument with action=store_true. If it is specified, rdzv_endpoint will not be set.

@janEbert
Copy link

@muellerz I'm pretty sure it's correct that rdzv_backend was the issue. Our machine has had trouble since forever with torch.distributed.run (which accelerate uses): pytorch/pytorch#73656
That's the reason why I forced c10d usage via the YAML and it started working.

@janEbert
Copy link

The issue I linked to is patched with the PyTorch module we supply on the machine, but it still requires usage of --rdzv_backend c10d and --rdzv_endpoint [...], IIRC. That's why --rdzv_backend static can't work on our machine. It's complex and nasty, but that's how we manage to work around the issue. :p

@muellerzr
Copy link
Collaborator

Got it, I'm with you now.

@janEbert
Copy link

In summary, the command @surak posted should be the following (just removing the --same_network line):

time srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --multi_gpu \
    --mixed_precision no \
    --num_processes=$(($SLURM_JOB_NUM_NODES * 4)) \
    --dynamo_backend=no \
    --num_machines=$SLURM_JOB_NUM_NODES \
    --machine_rank=$SLURM_NODEID \
    --rdzv_backend c10d \
    distrib.py'

@muellerzr
Copy link
Collaborator

If we can verify that works, then I'll go ahead and merge the PR :)

@surak
Copy link
Author

surak commented May 31, 2023

Works for me indeed! Thanks a lot!

@muellerzr muellerzr added the solved The bug or feature request has been solved, but the issue is still opened label May 31, 2023
@surak surak closed this as completed Jun 1, 2023
@surak
Copy link
Author

surak commented Jun 1, 2023

Does that grant a new point release? :-)

@muellerzr
Copy link
Collaborator

@surak we'll have a new release soon on our usual release cycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working solved The bug or feature request has been solved, but the issue is still opened
Projects
None yet
Development

No branches or pull requests

4 participants