Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #8025

Closed
1 task done
chenchunhui97 opened this issue Aug 30, 2024 · 10 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@chenchunhui97
Copy link

Your current environment

VLLM image: v0.5.4
hardware: RTX4090
gpu driver: 550.78
model: qwen1.5-14b-chat-awq
launch cmd: enable-prefix-caching

🐛 Describe the bug

2024-08-30T15:30:57.763092820+08:00 INFO 08-30 15:30:57 async_llm_engine.py:175] Added request chat-1b1cbff0e55642b5a6823f983103f9fd.

2024-08-30T15:30:57.881850637+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] Engine background task failed

2024-08-30T15:30:57.881886624+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] Traceback (most recent call last):

2024-08-30T15:30:57.881901781+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion

2024-08-30T15:30:57.881912691+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return_value = task.result()

2024-08-30T15:30:57.881922782+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 648, in run_engine_loop

2024-08-30T15:30:57.881933110+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     result = task.result()

2024-08-30T15:30:57.881943849+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 591, in engine_step

2024-08-30T15:30:57.881954523+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     request_outputs = await self.engine.step_async(virtual_engine)

2024-08-30T15:30:57.881965522+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 255, in step_async

2024-08-30T15:30:57.881993715+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     output = await self.model_executor.execute_model_async(

2024-08-30T15:30:57.882004030+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async

2024-08-30T15:30:57.882013953+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     output = await make_async(self.driver_worker.execute_model

2024-08-30T15:30:57.882024970+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run

2024-08-30T15:30:57.882034958+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     result = self.fn(*self.args, **self.kwargs)

2024-08-30T15:30:57.882045318+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model

2024-08-30T15:30:57.882055179+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     output = self.model_runner.execute_model(

2024-08-30T15:30:57.882065997+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context

2024-08-30T15:30:57.882076466+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return func(*args, **kwargs)

2024-08-30T15:30:57.882087922+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model

2024-08-30T15:30:57.882098462+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     hidden_or_intermediate_states = model_executable(

2024-08-30T15:30:57.882108444+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882118244+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882127981+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882137910+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882148090+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 360, in forward

2024-08-30T15:30:57.882158272+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     hidden_states = self.model(input_ids, positions, kv_caches,

2024-08-30T15:30:57.882168547+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882178709+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882188699+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882198448+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882208435+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 276, in forward

2024-08-30T15:30:57.882218784+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     hidden_states, residual = layer(

2024-08-30T15:30:57.882229019+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882239021+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882281647+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882296432+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882307307+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 210, in forward

2024-08-30T15:30:57.882317596+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     hidden_states = self.self_attn(

2024-08-30T15:30:57.882327958+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882338523+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882349148+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882359604+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882369923+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 157, in forward

2024-08-30T15:30:57.882379833+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)

2024-08-30T15:30:57.882389814+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882400459+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882410492+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882420918+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882431960+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 98, in forward

2024-08-30T15:30:57.882441261+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self.impl.forward(query,

2024-08-30T15:30:57.882450978+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 539, in forward

2024-08-30T15:30:57.882460480+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     output[:num_prefill_tokens] = flash_attn_varlen_func(

2024-08-30T15:30:57.882471462+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func

2024-08-30T15:30:57.882480943+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return FlashAttnVarlenFunc.apply(

2024-08-30T15:30:57.882490613+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply

2024-08-30T15:30:57.882500755+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return super().apply(*args, **kwargs)  # type: ignore[misc]

2024-08-30T15:30:57.882510257+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward

2024-08-30T15:30:57.882520220+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(

2024-08-30T15:30:57.882541902+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward

2024-08-30T15:30:57.882552115+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(

2024-08-30T15:30:57.882562259+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] RuntimeError: CUDA error: an illegal memory access was encountered

2024-08-30T15:30:57.882572543+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2024-08-30T15:30:57.882582869+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] 

2024-08-30T15:30:57.883627766+08:00 Exception in callback _log_task_completion(error_callback=<bound method...7f74c9effd00>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:38

2024-08-30T15:30:57.883652558+08:00 handle: <Handle _log_task_completion(error_callback=<bound method...7f74c9effd00>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:38>

2024-08-30T15:30:57.883660984+08:00 Traceback (most recent call last):

2024-08-30T15:30:57.883668722+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion

2024-08-30T15:30:57.883675968+08:00     return_value = task.result()

2024-08-30T15:30:57.883683376+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 648, in run_engine_loop

2024-08-30T15:30:57.883690266+08:00     result = task.result()

2024-08-30T15:30:57.883697904+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 591, in engine_step

2024-08-30T15:30:57.883705047+08:00     request_outputs = await self.engine.step_async(virtual_engine)

2024-08-30T15:30:57.883711872+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 255, in step_async

2024-08-30T15:30:57.883718684+08:00     output = await self.model_executor.execute_model_async(

2024-08-30T15:30:57.883725435+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async

2024-08-30T15:30:57.883732431+08:00     output = await make_async(self.driver_worker.execute_model

2024-08-30T15:30:57.883739257+08:00   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run

2024-08-30T15:30:57.883745978+08:00     result = self.fn(*self.args, **self.kwargs)

2024-08-30T15:30:57.883753514+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model

2024-08-30T15:30:57.883760527+08:00     output = self.model_runner.execute_model(

2024-08-30T15:30:57.883767238+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context

2024-08-30T15:30:57.883774012+08:00     return func(*args, **kwargs)

2024-08-30T15:30:57.883780959+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model

2024-08-30T15:30:57.883789520+08:00     hidden_or_intermediate_states = model_executable(

2024-08-30T15:30:57.883796396+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.883803361+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.883810157+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.883816949+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.883823624+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 360, in forward

2024-08-30T15:30:57.883846483+08:00     hidden_states = self.model(input_ids, positions, kv_caches,

2024-08-30T15:30:57.883853475+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.883857676+08:00 INFO 08-30 15:30:57 async_llm_engine.py:182] Aborted request chat-79d2e9a5de194b4dbbbd72255f6181cb.

2024-08-30T15:30:57.883860506+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.883890996+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.883903365+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.883914288+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 276, in forward

2024-08-30T15:30:57.883924748+08:00     hidden_states, residual = layer(

2024-08-30T15:30:57.883935228+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.883945611+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.883956356+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.883966767+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.883977205+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 210, in forward

2024-08-30T15:30:57.883987732+08:00     hidden_states = self.self_attn(

2024-08-30T15:30:57.883998134+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.884008304+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.884018520+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.884028912+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.884039270+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 157, in forward

2024-08-30T15:30:57.884051129+08:00     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)

2024-08-30T15:30:57.884061444+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.884071978+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.884082719+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.884093054+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.884103732+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 98, in forward

2024-08-30T15:30:57.884113978+08:00     return self.impl.forward(query,

2024-08-30T15:30:57.884141341+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 539, in forward

2024-08-30T15:30:57.884153444+08:00     output[:num_prefill_tokens] = flash_attn_varlen_func(

2024-08-30T15:30:57.884166199+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func

2024-08-30T15:30:57.884176485+08:00     return FlashAttnVarlenFunc.apply(

2024-08-30T15:30:57.884186954+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply

2024-08-30T15:30:57.884197635+08:00     return super().apply(*args, **kwargs)  # type: ignore[misc]

2024-08-30T15:30:57.884207997+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward

2024-08-30T15:30:57.884231053+08:00     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(

2024-08-30T15:30:57.884241778+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward

2024-08-30T15:30:57.884252120+08:00     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(

2024-08-30T15:30:57.884263623+08:00 RuntimeError: CUDA error: an illegal memory access was encountered

2024-08-30T15:30:57.884274183+08:00 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@robertgshaw2-neuralmagic
Copy link
Collaborator

Could you share the access pattern you are using?

E.g. the client script that generates the issue? This would really help us to reproduce and solve

@chenchunhui97
Copy link
Author

Could you share the access pattern you are using?

E.g. the client script that generates the issue? This would really help us to reproduce and solve

I use the benchmark scripts in v0.4.0, and set request_rate=6 for this deployment. do you mean the token I sent to the model?

@zoltan-fedor
Copy link

zoltan-fedor commented Sep 2, 2024

I get a similar one.
vLLM version 0.5.4 using the stock Docker image from your Dockerhub repo.

Parameters used:

      - "--model"
      - "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
      - "--tensor-parallel-size"
      - "4"
      - "--gpu-memory-utilization"
      - "0.95"
      - "--enforce-eager"
      - "--trust-remote-code"
      - "--worker-use-ray"
      - "--enable-prefix-caching"
      - "--dtype"
      - "half"
      - "--max-model-len"
      - "32768"
      ```

It also crashes without the prefix caching.

│ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app │
│ raise exc │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app │
│ await app(scope, receive, sender) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
│ await self.middleware_stack(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app │
│ await route.handle(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle │
│ await self.app(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app │
│ await wrap_app_handling_exceptions(app, request)(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app │
│ raise exc │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app │
│ await app(scope, receive, sender) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app │
│ response = await func(request) │
│ File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app │
│ raw_response = await run_endpoint_function( │
│ File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function │
│ return await dependant.call(**values) │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 204, in create_completion │
│ generator = await openai_serving_completion.create_completion( │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 170, in create_completion │
│ async for i, res in result_generator: │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 346, in consumer │
│ raise e │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 337, in consumer │
│ raise item │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 312, in producer │
│ async for item in iterator: │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate │
│ raise request_output │
│ RuntimeError: CUDA error: an illegal memory access was encountered │
│ CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. │
│ For debugging consider passing CUDA_LAUNCH_BLOCKING=1 │
│ Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. │
│ │
│ [2024-09-02 12:11:54,463 E 61 3464] logging.cc:115: Stack trace: │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10b96aa) [0x7f3da67f26aa] ray::operator<<() │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10bc932) [0x7f3da67f5932] ray::TerminateHandler() │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f3eecc6e37c] │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f3eecc6e3e7] │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f3eecc6e36f] │
│ /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f3e9f182b35] c10d::ProcessGroupNCCL::ncclCommWatchdog() │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f3eecc9adf4] │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3eede5c609] start_thread │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3eedf96353] __clone │
│ │
│ *** SIGABRT received at time=1725279114 on cpu 19 *** │
│ PC: @ 0x7f3eedeba00b (unknown) raise │
│ @ 0x7f3eedeba090 3216 (unknown) │
│ @ 0x7f3eecc6e37c (unknown) (unknown) │
│ @ 0x7f3eecc6e090 (unknown) (unknown) │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: *** SIGABRT received at time=1725279114 on cpu 19 *** │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: PC: @ 0x7f3eedeba00b (unknown) raise │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: @ 0x7f3eedeba090 3216 (unknown) │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: @ 0x7f3eecc6e37c (unknown) (unknown) │
│ [2024-09-02 12:11:54,466 E 61 3464] logging.cc:440: @ 0x7f3eecc6e090 (unknown) (unknown) │
│ Fatal Python error: Aborted │

@TangJiakai
Copy link

same error in 0.6.1

@mkulariya
Copy link

+1 0.6.2 GPU:L4

@Jeffrey-JDong
Copy link

+1 0.6.2 GPU: A800 model: awq

@DaBossCoda
Copy link

+1 2x4090 on awq

@nightflight-dk
Copy link

nightflight-dk commented Nov 13, 2024

+1 8xA100, Azure. 0.6.3.post1 Mistral Nemo, Codestral, Small 22B

@chenchunhui97
Copy link
Author

seems fixed in latter version, from v0.6.3

@kldzj
Copy link

kldzj commented Dec 18, 2024

Still facing this issue with Llama 3.3 70B awq_marlin at max-model-len=8192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants