Skip to content

Commit

Permalink
add multinode support via slurm trainer, large scale race condition f…
Browse files Browse the repository at this point in the history
…ix (pytorch#63)

This PR 
1 - adds multi-node training support via a multinode_trainer.slurm file.
Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale
training in profiling, where the check for the trace dir existence fails
for process 1, but in the interim another process 2 makes the directory,
and then when process 1 tries to make the dir it errors out as the dir
exists.
This is a simple fix of adding exist_ok=True to both of the makedir
command (dump folder, trace folder).

<img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5">
<img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">
  • Loading branch information
lessw2020 authored Feb 22, 2024
1 parent 8a74077 commit a1f2dff
Show file tree
Hide file tree
Showing 3 changed files with 87 additions and 3 deletions.
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,30 @@ tensorboard --logdir=./torchtrain/outputs/tb
```

4. go to the URL it provides OR to http://localhost:6006/

## Multi-Node Training
For training on ParallelCluster/Slurm type configurations, you can use the multinode_trainer.slurm file to submit your sbatch job.</br>
Note that you will need to adjust the number of nodes and gpu count to your cluster configs.</br>
<b>To adjust total nodes:</b>
```
#SBATCH --ntasks=2
#SBATCH --nodes=2
```
should both be set to your total node count.
Then update the srun launch parameters to match:
```
srun torchrun --nnodes 2
```
where nnodes is your total node count, matching the sbatch node count above.

<b>To adjust gpu count per node:</b>

If your gpu count per node is not 8, adjust:

```--nproc_per_node```

in the torchrun command and

```#SBATCH --gpus-per-task```

in the SBATCH command section.
56 changes: 56 additions & 0 deletions multinode_trainer.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

# --- This script is optimized for AWS with EFA
# --- adjust NCCL_BUFFSIZE if you encounter memory
# --- constraint issues or to tune for improved performance.
# ---

#SBATCH --job-name=torchtrain_multi_node

#SBATCH --ntasks=2

#SBATCH --nodes=2

#SBATCH --gpus-per-task=8

#SBATCH --cpus-per-task=96

#SBATCH --partition=train


nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# Enable for A100
export FI_PROVIDER="efa"
# Ensure that P2P is available
# export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1

# debugging flags (optional)
export NCCL_DEBUG=WARN
export PYTHONFAULTHANDLER=1
# optional debug settings
# export NCCL_DEBUG=INFO
# NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV

export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
export CUDA_LAUNCH_BLOCKING=0

# on your cluster you might need these:
# set the network interface
export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
export NCCL_BUFFSIZE=2097152
#export TORCH_DIST_INIT_BARRIER=1
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0

dcgmi profile --pause
# adjust sbatch --ntasks and sbatch --nodes above and --nnodes below
# to your specific node count, and update target launch file.
srun torchrun --nnodes 2 --nproc_per_node 8 --rdzv_id 101 --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:29500" ./train.py --steps 10
dcgmi profile --resume
7 changes: 4 additions & 3 deletions torchtrain/profiling.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

import contextlib
import os

import torch

try:
Expand Down Expand Up @@ -47,15 +48,15 @@ def trace_handler(prof):
curr_trace_dir_name = "iteration_" + str(_global_iter_count)
curr_trace_dir = os.path.join(trace_dir, curr_trace_dir_name)
if not os.path.exists(curr_trace_dir):
os.makedirs(curr_trace_dir)
os.makedirs(curr_trace_dir, exist_ok=True)
rank0_log(f"exporting profile traces to {curr_trace_dir}")

prof.export_chrome_trace(f"{curr_trace_dir}/rank{rank}_trace.json")

rank0_log(f"Profiling active. Traces will be saved at {trace_dir}")

if not os.path.exists(trace_dir):
os.makedirs(trace_dir)
os.makedirs(trace_dir, exist_ok=True)

with torch.profiler.profile(
activities=[
Expand Down

0 comments on commit a1f2dff

Please sign in to comment.