Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Distributed NHC and supporting scripts. #35

Merged
merged 72 commits into from
Sep 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
7680c37
add onetouch_nhc.sh
Jul 17, 2023
2f5f42c
begin adding in error detection
Jul 17, 2023
4a2f739
logging
Jul 18, 2023
fd4e0cb
simplify
Jul 18, 2023
8dbeabf
add distributed_nhc slurm and pssh
Jul 19, 2023
336f9d7
updates
Jul 19, 2023
f018606
leave logs on node
Jul 19, 2023
690141b
leave full execution logs on nodes, adding argument supoprt, expandin…
Jul 20, 2023
64b14a7
report nodes that didn't report results
Jul 20, 2023
1c329e9
fix bug expanding nodelist
Jul 20, 2023
9e87e71
add timestamp to slurm health output, have pssh output just the healt…
Jul 24, 2023
31a9983
Merge pull request #1 from mpwillia/micwilli/onebutton-nhc
mpwillia Jul 24, 2023
98c4d2d
fix git ignore
Jul 24, 2023
2745e09
add gpu count test
Jul 26, 2023
8fe1635
remove old .sbatch
Jul 26, 2023
8dc4975
add git option
Jul 26, 2023
7c00ce6
spaces
Jul 26, 2023
77dcbb8
bruh syntax error
Jul 26, 2023
55a8be0
add topology check
Jul 27, 2023
263981c
fix nodelist
Jul 27, 2023
a036e0d
try fix
Jul 27, 2023
e2816c5
Merge pull request #2 from mpwillia/micwilli/more-tests
mpwillia Jul 27, 2023
c2f5485
chmod +x
Jul 27, 2023
0af7340
Merge branch 'main' of https://github.com/mpwillia/azurehpc-health-ch…
Jul 27, 2023
f2485f2
remind people that it will take a few minutes to complete
Jul 28, 2023
f3585b1
updating stdout logs, export to kusto
Jul 28, 2023
dcf3384
errors
Jul 28, 2023
e92b634
Merge pull request #3 from mpwillia/micwilli/kusto-export
mpwillia Jul 28, 2023
68d54e0
remove old confs
Jul 28, 2023
74aabe9
add requirements.txt
Jul 28, 2023
e2486c9
make nhc very verbose
Jul 28, 2023
83e39cc
very verbose nhc
Jul 28, 2023
049efac
update tests with more info
Jul 28, 2023
7be8e3e
debug logs and debug log export
Jul 29, 2023
9f39581
fix pip install
Jul 29, 2023
761399e
add pinned vbios check
Aug 3, 2023
e0b0db9
fixes
Aug 3, 2023
e663887
try fix, don't call on sourcing :)
Aug 3, 2023
9070b50
Merge pull request #4 from mpwillia/micwilli/vbios-check
mpwillia Aug 9, 2023
a8e5ca8
adjust directory structure
Aug 9, 2023
adfcf3e
Merge pull request #5 from mpwillia/micwilli/dir-structure
mpwillia Aug 9, 2023
66e38b8
make executable again
Aug 11, 2023
97d24e1
update confs, verified on nd96amsr_a100_v4
Aug 11, 2023
9b63f98
Merge pull request #6 from mpwillia/micwilli/a100-tests
mpwillia Aug 11, 2023
4016d24
add proper options to run-health-checks.sh
Aug 11, 2023
48c9378
debugging bad custom test copying
Aug 11, 2023
82700d3
pipe wget straight into tar so we don't have the .tar.gz on the files…
Aug 11, 2023
04bd05b
try fix
Aug 11, 2023
6521b16
standardize nhc args
Aug 11, 2023
163d902
fix -V verbose flag here
Aug 11, 2023
07a093d
-V verbose for dnhc
Aug 11, 2023
deb097a
comment out vbios check
Aug 11, 2023
b75fc75
don't create debug file if we don't have debug output
Aug 11, 2023
b47fc7e
Merge pull request #7 from mpwillia/micwilli/optional-verbose
mpwillia Aug 11, 2023
43e6d05
export py supports arguments
Aug 12, 2023
e2337ba
fix table name missing comma
Aug 14, 2023
6bd8459
rework argument parsing
Aug 15, 2023
64f5c5b
cleanup
Aug 15, 2023
e1c465a
verified that kusto export works and is optional
Aug 15, 2023
a35f8cb
Merge pull request #8 from mpwillia/micwilli/opt-kusto-export
mpwillia Aug 15, 2023
e74d084
cleanup
Aug 15, 2023
1460e7a
cleanup comments
Aug 15, 2023
afdad45
update readmes
Aug 15, 2023
5762efd
Merge pull request #9 from mpwillia/micwilli/cleanup-and-docs
mpwillia Aug 15, 2023
197e108
Merge remote-tracking branch 'upstream/main'
Aug 15, 2023
0611432
pr feedback
Sep 6, 2023
208e316
use lstopo-no-graphics, add installation
Sep 7, 2023
0cc5b83
fix bad merge
Sep 8, 2023
dcb46db
Merge pull request #10 from mpwillia/micwilli/lstopo-feedback
mpwillia Sep 8, 2023
de37cc3
Merge remote-tracking branch 'upstream/main'
Sep 8, 2023
836583f
adding extra empty lines
Sep 14, 2023
9e5f254
Merge remote-tracking branch 'upstream/main'
Sep 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ lbnl-nhc-1.4.3/
*.deb
*stream.c
*health.log
.vscode
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,12 @@ Usage
### _References_ ###
- [LBNL Node Health Checks](https://github.com/mej/nhc)
- [Azure HPC Images](https://github.com/Azure/azhpc-images)

## Distributed NHC
AzureHPC Node Health Checks also comes bundled with a distributed version of NHC, which is designed to run on a cluster of machines and report back to a central location. This is useful for running health checks on a large cluster with dozens or hundreds of nodes.

See [Distributed NHC](./distributed-nhc/README.md) for more information.

## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Expand Down
2 changes: 2 additions & 0 deletions conf/nd96amsr_a100_v4.conf
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,14 @@
* || check_hw_eth ib6
* || check_hw_eth docker0
* || check_hw_eth ib0
* || check_hw_topology /opt/microsoft/ndv4-topo.xml


#######################################################################
#####
##### GPU checks
#####
* || check_gpu_count 8
* || check_gpu_xid
* || check_nvsmi_healthmon
* || check_cuda_bw 24
Expand Down
2 changes: 2 additions & 0 deletions conf/nd96asr_v4.conf
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,14 @@
* || check_hw_eth ib2
* || check_hw_eth eth0
* || check_hw_eth ib1
* || check_hw_topology /opt/microsoft/ndv4-topo.xml


########################################################
####
#### GPU checks
####
* || check_gpu_count 8
* || check_gpu_xid
* || check_nvsmi_healthmon
* || check_cuda_bw 24
Expand Down
4 changes: 3 additions & 1 deletion conf/nd96isr_h100_v5.conf
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,13 @@
* || check_hw_eth ib6
* || check_hw_eth ib7
* || check_hw_eth docker0
* || check_hw_topology /opt/microsoft/ndv5-topo.xml

#######################################################################
####
#### GPU checks
####
* || check_gpu_count 8
* || check_nvsmi_healthmon
* || check_gpu_xid
* || check_cuda_bw 52
Expand All @@ -52,6 +54,6 @@
####
#### Additional IB checks
####
* || check_ib_bw_gdr 380 nd96isr_v5
* || check_ib_bw_gdr 375 nd96isr_v5
* || check_nccl_allreduce_ib_loopback 40.0 1 /opt/microsoft/ndv5-topo.xml 16G
* || check_ib_link_flapping 6
260 changes: 137 additions & 123 deletions customTests/azure_cuda_bandwidth.nhc
Original file line number Diff line number Diff line change
Expand Up @@ -10,139 +10,153 @@
#Catch error codes that may be thrown by the executable passed as the first
#input, and if an error code is tripped throw the second input as a message
catch_error() {
declare -g output
output=$($1)
err_code=$?
if [ $err_code -ne 0 ]; then
die 1 "\t $2 $err_code" >&2
return 1
fi
return 0
declare -g output
output=$($1)
moes1 marked this conversation as resolved.
Show resolved Hide resolved
err_code=$?
if [ $err_code -ne 0 ]; then
die 1 "\t $2 $err_code" >&2
return 1
fi
return 0

}


function cleanup {
dbg "Unlocking graphics clock before exit..."
sudo timeout 3m nvidia-smi -rgc > /dev/null 2>&1
dbg "Unlocking graphics clock before exit..."
sudo timeout 3m nvidia-smi -rgc > /dev/null 2>&1
}



function check_cuda_bw()
{

#set expected BW set to default value if argument empty
EXP_CUDA_BW=$1
if [[ -z "$EXP_CUDA_BW" ]]; then
EXP_CUDA_BW=24
fi

# location of executables, must match setup location
EXE_DIR=$2
if [[ -z "$EXE_DIR" ]]; then
EXE_DIR=/opt/azurehpc/test/nhc
fi
#Count the number of gpu-name nvidia-smi outputs.
error_smi="**Fail** nvidia-smi failed with error code"
#Lock graphics clocks to max freq to eliminate any time for the GPUs to boost.
#This likely isn't important for performance here, but we will do it anyway
#to be safe.
SKU=$( curl -H Metadata:true --max-time 10 -s "http://169.254.169.254/metadata/instance/compute/vmSize?api-version=2021-01-01&format=text")
SKU="${SKU,,}"
lock_clocks=
if echo "$SKU" | grep -q "nd96asr_v4"; then
lock_clocks="sudo nvidia-smi -lgc 1400"
elif echo "$SKU" | grep -q "nd96amsr_a100_v4"; then
lock_clocks="sudo nvidia-smi -lgc 1400"
elif echo "$SKU" | grep -q "nd96isr_h100_v5"; then
lock_clocks="sudo nvidia-smi -lgc 2619"
fi

if [[ -n "$lock_clocks" ]]; then
if ! catch_error "$lock_clocks" "$error_smi"; then
return 0
fi
fi

#exit function to unlock clocks on exit
trap cleanup EXIT

#Count the GPUs.
gpu_list="timeout 3m nvidia-smi --query-gpu=name --format=csv,noheader"
if ! catch_error "$gpu_list" "$error_smi"; then
return 0
fi
ngpus=$(echo "$output" | wc -l)

#Run device to host bandwidth test.
exec_htod="timeout 3m $EXE_DIR/gpu-copy --size 134217728 --htod"
error_htod="**Fail** The htod gpu_copy test failed to execute."
error_htod+="It exited with error code"
if ! catch_error "$exec_htod" "$error_htod"; then
return 0
fi
x_htod=$(echo "$output")

#Run host to device bandwidth test.
exec_dtoh="timeout 3m $EXE_DIR/gpu-copy --size 134217728 --dtoh"
error_dtoh="**Fail** The dtoh gpu_copy test failed to execute."
error_dtoh+="It exited with error code"
if ! catch_error "$exec_dtoh" "$error_dtoh"; then
return 0
fi
x_dtoh=$(echo "$output")
pass=1

#Loop over all of the detected GPUs.
for i in $(seq 0 $((ngpus-1))); do
#Collect host to device bandwidths computed in each numa zone.
bw_htod=$(echo "$x_htod" | grep "gpu$i" | cut -d' ' -f2 | cut -d. -f1)
max_htodbw=0
min_bw=100
#Loop over the bandwidths observed in each numa zone and find max.
for bw in $bw_htod; do
if [ $max_htodbw -lt $bw ]; then
max_htodbw=$bw
fi
done

#Collect device to host bandwidths computed in each numa zone.
bw_dtoh=$(echo "$x_dtoh" | grep "gpu$i" | cut -d' ' -f2 | cut -d. -f1)
max_dtohbw=0
#Loop over bandwidths observed in each numa zone and find max.
for bw in $bw_dtoh; do
if [ $max_dtohbw -lt $bw ]; then
max_dtohbw=$bw
fi
done
#Find minimum of the htod and dtoh bandwidths.
if [ $max_htodbw -lt $max_dtohbw ]; then
min_bw=$max_htodbw
else
min_bw=$max_dtohbw
fi

#If the min bandwidth is too low the test has failed.
if [ $min_bw -lt $EXP_CUDA_BW ]; then
die 1 "Bandwidth is low on device $i. Reported bandwidth is"\
"$min_bw GB/s."
pass=0
return 0
fi
done
#Unlock the graphics clock.
unlock_clocks="sudo timeout 3m nvidia-smi -rgc"

if ! catch_error "$unlock_clocks" "$error_smi"; then
return 0
fi

if [ $pass -ne 1 ]; then
die 1 -e "\t **Fail** At least one device reported low htod or dtoh"\
"bandwidth."
return 0
else
return 0
fi
#set expected BW set to default value if argument empty
EXP_CUDA_BW=$1
mpwillia marked this conversation as resolved.
Show resolved Hide resolved
if [[ -z "$EXP_CUDA_BW" ]]; then
EXP_CUDA_BW=24
fi

# location of executables, must match setup location
EXE_DIR=$2
if [[ -z "$EXE_DIR" ]]; then
EXE_DIR=/opt/azurehpc/test/nhc
fi
#Count the number of gpu-name nvidia-smi outputs.
error_smi="**Fail** nvidia-smi failed with error code"
#Lock graphics clocks to max freq to eliminate any time for the GPUs to boost.
#This likely isn't important for performance here, but we will do it anyway
#to be safe.
SKU=$( curl -H Metadata:true --max-time 10 -s "http://169.254.169.254/metadata/instance/compute/vmSize?api-version=2021-01-01&format=text")
SKU="${SKU,,}"
lock_clocks=
if echo "$SKU" | grep -q "nd96asr_v4"; then
lock_clocks="sudo nvidia-smi -lgc 1400"
elif echo "$SKU" | grep -q "nd96amsr_a100_v4"; then
lock_clocks="sudo nvidia-smi -lgc 1400"
elif echo "$SKU" | grep -q "nd96isr_h100_v5"; then
lock_clocks="sudo nvidia-smi -lgc 2619"
fi

if [[ -n "$lock_clocks" ]]; then
if ! catch_error "$lock_clocks" "$error_smi"; then
return 0
fi
fi

#exit function to unlock clocks on exit
trap cleanup EXIT

#Count the GPUs.
gpu_list="timeout 3m nvidia-smi --query-gpu=name --format=csv,noheader"
if ! catch_error "$gpu_list" "$error_smi"; then
return 0
fi
ngpus=$(echo "$output" | wc -l)

#Run device to host bandwidth test.
exec_htod="timeout 3m $EXE_DIR/gpu-copy --size 134217728 --htod"
error_htod="**Fail** The htod gpu_copy test failed to execute."
error_htod+="It exited with error code"
if ! catch_error "$exec_htod" "$error_htod"; then
return 0
fi
x_htod=$(echo "$output")

#Run host to device bandwidth test.
exec_dtoh="timeout 3m $EXE_DIR/gpu-copy --size 134217728 --dtoh"
error_dtoh="**Fail** The dtoh gpu_copy test failed to execute."
error_dtoh+="It exited with error code"
if ! catch_error "$exec_dtoh" "$error_dtoh"; then
return 0
fi
x_dtoh=$(echo "$output")
pass=1

#Loop over all of the detected GPUs.

low_bw_devices=()
for i in $(seq 0 $((ngpus-1))); do
#Collect host to device bandwidths computed in each numa zone.
bw_htod=$(echo "$x_htod" | grep "gpu$i" | cut -d' ' -f2 | cut -d. -f1)
max_htodbw=0
min_bw=100
#Loop over the bandwidths observed in each numa zone and find max.
for bw in $bw_htod; do
if [ $max_htodbw -lt $bw ]; then
max_htodbw=$bw
fi
done

dbg "Device $i Host to Device reported bandwidth is $max_htodbw GB/s"

#Collect device to host bandwidths computed in each numa zone.
bw_dtoh=$(echo "$x_dtoh" | grep "gpu$i" | cut -d' ' -f2 | cut -d. -f1)
max_dtohbw=0
#Loop over bandwidths observed in each numa zone and find max.
for bw in $bw_dtoh; do
if [ $max_dtohbw -lt $bw ]; then
max_dtohbw=$bw
fi
done

dbg "Device $i Device to Host reported bandwidth is $max_dtohbw GB/s"

#Find minimum of the htod and dtoh bandwidths.
if [ $max_htodbw -lt $max_dtohbw ]; then
min_bw=$max_htodbw
else
min_bw=$max_dtohbw
fi

#If the min bandwidth is too low the test has failed.
if [ $min_bw -lt $EXP_CUDA_BW ]; then
low_bw_devices+=("$i-$min_bw")
pass=0
fi
done
#Unlock the graphics clock.
unlock_clocks="sudo timeout 3m nvidia-smi -rgc"

if ! catch_error "$unlock_clocks" "$error_smi"; then
return 0
fi

if [ $pass -ne 1 ]; then

formatted_low_bw=()
for item in "${low_bw_devices[@]}"
do
deviceid=$(echo $item | awk -F'-' '{print $1}')
bw=$(echo $item | awk -F'-' '{print $2}')
formatted_low_bw+=(" Device $deviceid reports low bandwidth of $bw GB/s")
done

low_bw_str=$(IFS=',' ; echo "${formatted_low_bw[*]}")
die 1 "$FUNCNAME: Low bandwidth reported on one or more devices!$low_bw_str"
return 0
else
return 0
fi
}
8 changes: 8 additions & 0 deletions customTests/azure_gpu_count.nhc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash
function check_gpu_count() {
EXPECTED_NUM_GPU="$1"
gpu_count=$(nvidia-smi --list-gpus | wc -l)
if [ "$gpu_count" -ne "$1" ]; then
die 1 "$FUNCNAME: Expected to see $EXPECTED_NUM_GPU but found $gpu_count"
fi
}
11 changes: 11 additions & 0 deletions customTests/azure_gpu_vbios.nhc
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
function check_vbios_version() {
rafsalas19 marked this conversation as resolved.
Show resolved Hide resolved
expected_version="$1"
uniq_vbios_versions=$(nvidia-smi -q | grep "VBIOS Version" | cut -d ':' -f 2 | sed 's/ //g' | uniq)

if [ ${#uniq_vbios_versions[@]} -ne 1 ]; then
die 1 "$FUNCNAME: More than 1 VBIOS version found on GPUs! Found '${uniq_vbios_versions[@]}' but expected just '$expected_version'"
elif ! echo "${uniq_vbios_versions[@]}" | grep -qw "$expected_version"; then
die 1 "$FUNCNAME: GPU VBIOS version does not match the expected '$expected_version', instead got '${uniq_vbios_versions[@]}'"
fi
}
Loading