Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error? #6870

Closed
pseudotensor opened this issue Jul 28, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@pseudotensor
Copy link

Your current environment

latest docker image

docker stop llama31-405b  ; docker remove llama31-405b
docker pull vllm/vllm-openai:latest
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=0,1,2,3,4,5,6,7"' \
    --shm-size=10.24gb \
    -p 5020:5020 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name llama31-405b \
    vllm/vllm-openai:latest \
        --port=5020 \
        --host=0.0.0.0 \
        --model=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
        --seed 1234 \
        --tensor-parallel-size=8 \
        --max-log-len=100 \
        --max-model-len=65536 \
        --max-num-batched-tokens=512 \
        --max_num_seqs=16 \
        --gpu-memory-utilization 0.98 \
        --enable_chunked_prefill=True \
        --enforce-eager \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.llama31_405b2.txt

🐛 Describe the bug

Complete logs

llama31-405b.log.zip

e.g.


Error: Failed to initialize the TMA descriptor 700
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e095a9b5897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7e095a965b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7e095aa8d718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7e095bc8a8e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7e095bc8e9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7e095bc9405c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e095bc94dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7e09a774bdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7e09a880d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7e09a8947353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

@pseudotensor pseudotensor added the bug Something isn't working label Jul 28, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

Thanks for reporting this. We have resolved the issue with:

This will be in the next release of vllm (ideally this week). you can use the nightlies to unblock yourself for now

@hsubbaraj
Copy link

@pseudotensor just to confirm did building from source:main now work for you? I'm running into the same error at runtime with pretty much the same setup as yours.

@pseudotensor
Copy link
Author

Yes I built from source a docker image about 4 days ago. Seems like I used 3eeb148

@soodrohit
Copy link

We are seeing same error when using Llama3.1-70B-Instruct model, am I correct in assuming that it will be fixed for the 70B model also?

@YouNeedCryDear
Copy link

@robertgshaw2-neuralmagic We are encountering the same issue when serving Llama-3.1-70B-Instruct-FP8 with 2xH100. I can reproduce it consistently when num of concurrent request goes up to 256, with all engine arguments as default except tensor parallel size to 2. Do you think it could possibly be an edge case even after the fix for 405B?

@chapter544
Copy link

chapter544 commented Oct 17, 2024

We are also having this issue with Qwen-32B-Instruct-FP8

Error: Failed to initialize the TMA descriptor 700
INFO 10-17 12:26:04 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241017-122604.pkl...
WARNING 10-17 12:26:04 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 10-17 12:26:04 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

@robertgshaw2-neuralmagic
Copy link
Collaborator

@robertgshaw2-neuralmagic We are encountering the same issue when serving Llama-3.1-70B-Instruct-FP8 with 2xH100. I can reproduce it consistently when num of concurrent request goes up to 256, with all engine arguments as default except tensor parallel size to 2. Do you think it could possibly be an edge case even after the fix for 405B?

What version of vllm are you running?

@chapter544
Copy link

As for vllm version, we are using 0.6.2 and 0.6.3, and we're having the same issue with both versions. Thanks.

@robertgshaw2-neuralmagic
Copy link
Collaborator

As for vllm version, we are using 0.6.2 and 0.6.3, and we're having the same issue with both versions. Thanks.

Can you share reproduction instructions?

@chapter544
Copy link

Hi,
We are not sure how to reliability reproduce this error. If you can provide some instructions/hints, we are happy to get the information. In our case, we started the openai_server and sent data through. It could be days or it could be a few hours that we saw this exception.

Thanks.

Please see the attached file for the error log.

vllm-error-log-10-17-2024.txt

@YouNeedCryDear
Copy link

YouNeedCryDear commented Oct 18, 2024

@robertgshaw2-neuralmagic This is the command that I use to spin up the vLLM server
docker run -tid --gpus \"device=4,5\" --shm-size 10g -v /mnt/data/models:/models --ulimit nofile=65535:65535 --name vllm-v0.6.2-llama3.1-70b-instruct-128k-pre-fp8 --network benchmark-network vllm/vllm-openai:v0.6.2 --model=/models/Meta-Llama-3.1-70B-Instruct-FP8 --tensor-parallel-size=2 --served-model-name=vllm-model --port=8080 --disable-log-requests
Then within the benchmark bridge network, I just spin up a locust server with 128 users constantly sending requests to the above vLLM server. Each request is around 100 tokens as input prompt and max_token is 100 as well.
Most of the time the server will crash with the above error within 30 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants