[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error? #6870

pseudotensor · 2024-07-28T02:17:55Z

Your current environment

latest docker image

docker stop llama31-405b  ; docker remove llama31-405b
docker pull vllm/vllm-openai:latest
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=0,1,2,3,4,5,6,7"' \
    --shm-size=10.24gb \
    -p 5020:5020 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name llama31-405b \
    vllm/vllm-openai:latest \
        --port=5020 \
        --host=0.0.0.0 \
        --model=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
        --seed 1234 \
        --tensor-parallel-size=8 \
        --max-log-len=100 \
        --max-model-len=65536 \
        --max-num-batched-tokens=512 \
        --max_num_seqs=16 \
        --gpu-memory-utilization 0.98 \
        --enable_chunked_prefill=True \
        --enforce-eager \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.llama31_405b2.txt

🐛 Describe the bug

Complete logs

llama31-405b.log.zip

e.g.


Error: Failed to initialize the TMA descriptor 700
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e095a9b5897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7e095a965b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7e095aa8d718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7e095bc8a8e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7e095bc8e9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7e095bc9405c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e095bc94dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7e09a774bdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7e09a880d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7e09a8947353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

robertgshaw2-neuralmagic · 2024-07-28T12:46:13Z

Thanks for reporting this. We have resolved the issue with:

[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b #6852

This will be in the next release of vllm (ideally this week). you can use the nightlies to unblock yourself for now

hsubbaraj · 2024-08-02T06:24:41Z

@pseudotensor just to confirm did building from source:main now work for you? I'm running into the same error at runtime with pretty much the same setup as yours.

pseudotensor · 2024-08-02T06:37:23Z

Yes I built from source a docker image about 4 days ago. Seems like I used 3eeb148

soodrohit · 2024-10-08T20:56:51Z

We are seeing same error when using Llama3.1-70B-Instruct model, am I correct in assuming that it will be fixed for the 70B model also?

YouNeedCryDear · 2024-10-08T23:21:24Z

@robertgshaw2-neuralmagic We are encountering the same issue when serving Llama-3.1-70B-Instruct-FP8 with 2xH100. I can reproduce it consistently when num of concurrent request goes up to 256, with all engine arguments as default except tensor parallel size to 2. Do you think it could possibly be an edge case even after the fix for 405B?

chapter544 · 2024-10-17T09:30:46Z

We are also having this issue with Qwen-32B-Instruct-FP8

Error: Failed to initialize the TMA descriptor 700
INFO 10-17 12:26:04 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241017-122604.pkl...
WARNING 10-17 12:26:04 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 10-17 12:26:04 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

robertgshaw2-neuralmagic · 2024-10-17T13:27:46Z

@robertgshaw2-neuralmagic We are encountering the same issue when serving Llama-3.1-70B-Instruct-FP8 with 2xH100. I can reproduce it consistently when num of concurrent request goes up to 256, with all engine arguments as default except tensor parallel size to 2. Do you think it could possibly be an edge case even after the fix for 405B?

What version of vllm are you running?

chapter544 · 2024-10-17T15:23:51Z

As for vllm version, we are using 0.6.2 and 0.6.3, and we're having the same issue with both versions. Thanks.

robertgshaw2-neuralmagic · 2024-10-17T15:38:05Z

As for vllm version, we are using 0.6.2 and 0.6.3, and we're having the same issue with both versions. Thanks.

Can you share reproduction instructions?

chapter544 · 2024-10-18T01:45:12Z

Hi,
We are not sure how to reliability reproduce this error. If you can provide some instructions/hints, we are happy to get the information. In our case, we started the openai_server and sent data through. It could be days or it could be a few hours that we saw this exception.

Thanks.

Please see the attached file for the error log.

vllm-error-log-10-17-2024.txt

YouNeedCryDear · 2024-10-18T04:56:58Z

@robertgshaw2-neuralmagic This is the command that I use to spin up the vLLM server
docker run -tid --gpus \"device=4,5\" --shm-size 10g -v /mnt/data/models:/models --ulimit nofile=65535:65535 --name vllm-v0.6.2-llama3.1-70b-instruct-128k-pre-fp8 --network benchmark-network vllm/vllm-openai:v0.6.2 --model=/models/Meta-Llama-3.1-70B-Instruct-FP8 --tensor-parallel-size=2 --served-model-name=vllm-model --port=8080 --disable-log-requests
Then within the benchmark bridge network, I just spin up a locust server with 128 users constantly sending requests to the above vLLM server. Each request is around 100 tokens as input prompt and max_token is 100 as well.
Most of the time the server will crash with the above error within 30 seconds

pseudotensor added the bug Something isn't working label Jul 28, 2024

robertgshaw2-neuralmagic closed this as completed Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error? #6870

[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error? #6870

pseudotensor commented Jul 28, 2024

robertgshaw2-neuralmagic commented Jul 28, 2024

hsubbaraj commented Aug 2, 2024

pseudotensor commented Aug 2, 2024

soodrohit commented Oct 8, 2024

YouNeedCryDear commented Oct 8, 2024

chapter544 commented Oct 17, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Oct 17, 2024

chapter544 commented Oct 17, 2024

robertgshaw2-neuralmagic commented Oct 17, 2024

chapter544 commented Oct 18, 2024

YouNeedCryDear commented Oct 18, 2024 •

edited

Loading

[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error? #6870

[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error? #6870

Comments

pseudotensor commented Jul 28, 2024

Your current environment

🐛 Describe the bug

robertgshaw2-neuralmagic commented Jul 28, 2024

hsubbaraj commented Aug 2, 2024

pseudotensor commented Aug 2, 2024

soodrohit commented Oct 8, 2024

YouNeedCryDear commented Oct 8, 2024

chapter544 commented Oct 17, 2024 • edited Loading

robertgshaw2-neuralmagic commented Oct 17, 2024

chapter544 commented Oct 17, 2024

robertgshaw2-neuralmagic commented Oct 17, 2024

chapter544 commented Oct 18, 2024

YouNeedCryDear commented Oct 18, 2024 • edited Loading

chapter544 commented Oct 17, 2024 •

edited

Loading

YouNeedCryDear commented Oct 18, 2024 •

edited

Loading