JIT build fails for ROCM 6.0 #5474

Xzensi · 2024-04-28T04:15:04Z

Am I safe to assume that DeepSpeed does not yet support ROCm 6.0? A whole lot of errors during JIT build of transformer_inference.

$ pip show torch
Name: torch
Version: 2.3.0+rocm6.0

HIPCC call arguments:

[1/5] /opt/rocm/bin/hipcc  -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/nexus/.conda/envs/xtts/lib/python3.11/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/home/nexus/.conda/envs/xtts/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -isystem /home/nexus/.conda/envs/xtts/lib/python3.11/site-packages/torch/include -isystem /home/nexus/.conda/envs/xtts/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/nexus/.conda/envs/xtts/lib/python3.11/site-packages/torch/include/TH -isystem /home/nexus/.conda/envs/xtts/lib/python3.11/site-packages/torch/include/THC -isystem /home/nexus/.conda/envs/xtts/lib/python3.11/site-packages/torch/include/THH -isystem /opt/rocm/include -isystem /home/nexus/.conda/envs/xtts/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -D__HIP_PLATFORM_AMD__=1 -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++17 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=6 -DROCM_VERSION_MINOR=0 --offload-arch=gfx1100 -fno-gpu-rdc -c /home/nexus/.conda/envs/xtts/lib/python3.11/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip -o apply_rotary_pos_emb.cuda.o

FAILED: apply_rotary_pos_emb.cuda.o

fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated when compiling for gfx1100

FAILED: rms_norm.cuda.o

1 warning and 16 errors generated when compiling for gfx1100.

FAILED: layer_norm.cuda.o

1 warning and 16 errors generated when compiling for gfx1100.

FAILED: pt_binding_hip.o

...

CoquiEngine: Error initializing main coqui engine model: Error building extension 'transformer_inference'

The text was updated successfully, but these errors were encountered:

loadams · 2024-04-29T16:30:21Z

FYI @rraminen and @jithunnair-amd

@jithunnair-amd

This PR enables building the below extensions for AMD GPUs with warp size 32. - transformer_inference - quantizer - random_ltd This PR works stand-alone for torch version <=2.0. For the latest versions, #5401 is required to be merged in addition to this PR. Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) ===== After this PR: ========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 30.53s ==== After this PR: ====== 186 failed, 58 passed, 8 warnings in 8.89s ====== I could not find random_ltd related unit tests to run. Fixes: #4753 #5474 ROCm#68 cc: @jithunnair-amd --------- Co-authored-by: rraminen@amd.com <rraminen> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

@jithunnair-amd

This PR enables building the below extensions for AMD GPUs with warp size 32. - transformer_inference - quantizer - random_ltd This PR works stand-alone for torch version <=2.0. For the latest versions, microsoft#5401 is required to be merged in addition to this PR. Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) ===== After this PR: ========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 30.53s ==== After this PR: ====== 186 failed, 58 passed, 8 warnings in 8.89s ====== I could not find random_ltd related unit tests to run. Fixes: microsoft#4753 microsoft#5474 ROCm#68 cc: @jithunnair-amd --------- Co-authored-by: rraminen@amd.com <rraminen> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

rraminen mentioned this issue Apr 29, 2024

Rocm warp size fix #5402

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT build fails for ROCM 6.0 #5474

JIT build fails for ROCM 6.0 #5474

Xzensi commented Apr 28, 2024

loadams commented Apr 29, 2024

JIT build fails for ROCM 6.0 #5474

JIT build fails for ROCM 6.0 #5474

Comments

Xzensi commented Apr 28, 2024

loadams commented Apr 29, 2024