[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #8025

chenchunhui97 · 2024-08-30T07:34:50Z

Your current environment

VLLM image: v0.5.4
hardware: RTX4090
gpu driver: 550.78
model: qwen1.5-14b-chat-awq
launch cmd: enable-prefix-caching

🐛 Describe the bug

2024-08-30T15:30:57.763092820+08:00 INFO 08-30 15:30:57 async_llm_engine.py:175] Added request chat-1b1cbff0e55642b5a6823f983103f9fd.

2024-08-30T15:30:57.881850637+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] Engine background task failed

2024-08-30T15:30:57.881886624+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] Traceback (most recent call last):

2024-08-30T15:30:57.881901781+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion

2024-08-30T15:30:57.881912691+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return_value = task.result()

2024-08-30T15:30:57.881922782+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 648, in run_engine_loop

2024-08-30T15:30:57.881933110+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     result = task.result()

2024-08-30T15:30:57.881943849+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 591, in engine_step

2024-08-30T15:30:57.881954523+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     request_outputs = await self.engine.step_async(virtual_engine)

2024-08-30T15:30:57.881965522+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 255, in step_async

2024-08-30T15:30:57.881993715+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     output = await self.model_executor.execute_model_async(

2024-08-30T15:30:57.882004030+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async

2024-08-30T15:30:57.882013953+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     output = await make_async(self.driver_worker.execute_model

2024-08-30T15:30:57.882024970+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run

2024-08-30T15:30:57.882034958+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     result = self.fn(*self.args, **self.kwargs)

2024-08-30T15:30:57.882045318+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model

2024-08-30T15:30:57.882055179+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     output = self.model_runner.execute_model(

2024-08-30T15:30:57.882065997+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context

2024-08-30T15:30:57.882076466+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return func(*args, **kwargs)

2024-08-30T15:30:57.882087922+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model

2024-08-30T15:30:57.882098462+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     hidden_or_intermediate_states = model_executable(

2024-08-30T15:30:57.882108444+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882118244+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882127981+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882137910+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882148090+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 360, in forward

2024-08-30T15:30:57.882158272+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     hidden_states = self.model(input_ids, positions, kv_caches,

2024-08-30T15:30:57.882168547+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882178709+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882188699+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882198448+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882208435+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 276, in forward

2024-08-30T15:30:57.882218784+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     hidden_states, residual = layer(

2024-08-30T15:30:57.882229019+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882239021+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882281647+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882296432+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882307307+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 210, in forward

2024-08-30T15:30:57.882317596+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     hidden_states = self.self_attn(

2024-08-30T15:30:57.882327958+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882338523+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882349148+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882359604+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882369923+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 157, in forward

2024-08-30T15:30:57.882379833+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)

2024-08-30T15:30:57.882389814+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.882400459+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.882410492+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.882420918+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.882431960+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 98, in forward

2024-08-30T15:30:57.882441261+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return self.impl.forward(query,

2024-08-30T15:30:57.882450978+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 539, in forward

2024-08-30T15:30:57.882460480+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     output[:num_prefill_tokens] = flash_attn_varlen_func(

2024-08-30T15:30:57.882471462+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func

2024-08-30T15:30:57.882480943+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return FlashAttnVarlenFunc.apply(

2024-08-30T15:30:57.882490613+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply

2024-08-30T15:30:57.882500755+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     return super().apply(*args, **kwargs)  # type: ignore[misc]

2024-08-30T15:30:57.882510257+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward

2024-08-30T15:30:57.882520220+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(

2024-08-30T15:30:57.882541902+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward

2024-08-30T15:30:57.882552115+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(

2024-08-30T15:30:57.882562259+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] RuntimeError: CUDA error: an illegal memory access was encountered

2024-08-30T15:30:57.882572543+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2024-08-30T15:30:57.882582869+08:00 ERROR 08-30 15:30:57 async_llm_engine.py:58] 

2024-08-30T15:30:57.883627766+08:00 Exception in callback _log_task_completion(error_callback=<bound method...7f74c9effd00>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:38

2024-08-30T15:30:57.883652558+08:00 handle: <Handle _log_task_completion(error_callback=<bound method...7f74c9effd00>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:38>

2024-08-30T15:30:57.883660984+08:00 Traceback (most recent call last):

2024-08-30T15:30:57.883668722+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion

2024-08-30T15:30:57.883675968+08:00     return_value = task.result()

2024-08-30T15:30:57.883683376+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 648, in run_engine_loop

2024-08-30T15:30:57.883690266+08:00     result = task.result()

2024-08-30T15:30:57.883697904+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 591, in engine_step

2024-08-30T15:30:57.883705047+08:00     request_outputs = await self.engine.step_async(virtual_engine)

2024-08-30T15:30:57.883711872+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 255, in step_async

2024-08-30T15:30:57.883718684+08:00     output = await self.model_executor.execute_model_async(

2024-08-30T15:30:57.883725435+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async

2024-08-30T15:30:57.883732431+08:00     output = await make_async(self.driver_worker.execute_model

2024-08-30T15:30:57.883739257+08:00   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run

2024-08-30T15:30:57.883745978+08:00     result = self.fn(*self.args, **self.kwargs)

2024-08-30T15:30:57.883753514+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model

2024-08-30T15:30:57.883760527+08:00     output = self.model_runner.execute_model(

2024-08-30T15:30:57.883767238+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context

2024-08-30T15:30:57.883774012+08:00     return func(*args, **kwargs)

2024-08-30T15:30:57.883780959+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model

2024-08-30T15:30:57.883789520+08:00     hidden_or_intermediate_states = model_executable(

2024-08-30T15:30:57.883796396+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.883803361+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.883810157+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.883816949+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.883823624+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 360, in forward

2024-08-30T15:30:57.883846483+08:00     hidden_states = self.model(input_ids, positions, kv_caches,

2024-08-30T15:30:57.883853475+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.883857676+08:00 INFO 08-30 15:30:57 async_llm_engine.py:182] Aborted request chat-79d2e9a5de194b4dbbbd72255f6181cb.

2024-08-30T15:30:57.883860506+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.883890996+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.883903365+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.883914288+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 276, in forward

2024-08-30T15:30:57.883924748+08:00     hidden_states, residual = layer(

2024-08-30T15:30:57.883935228+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.883945611+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.883956356+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.883966767+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.883977205+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 210, in forward

2024-08-30T15:30:57.883987732+08:00     hidden_states = self.self_attn(

2024-08-30T15:30:57.883998134+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.884008304+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.884018520+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.884028912+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.884039270+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 157, in forward

2024-08-30T15:30:57.884051129+08:00     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)

2024-08-30T15:30:57.884061444+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl

2024-08-30T15:30:57.884071978+08:00     return self._call_impl(*args, **kwargs)

2024-08-30T15:30:57.884082719+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl

2024-08-30T15:30:57.884093054+08:00     return forward_call(*args, **kwargs)

2024-08-30T15:30:57.884103732+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 98, in forward

2024-08-30T15:30:57.884113978+08:00     return self.impl.forward(query,

2024-08-30T15:30:57.884141341+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 539, in forward

2024-08-30T15:30:57.884153444+08:00     output[:num_prefill_tokens] = flash_attn_varlen_func(

2024-08-30T15:30:57.884166199+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func

2024-08-30T15:30:57.884176485+08:00     return FlashAttnVarlenFunc.apply(

2024-08-30T15:30:57.884186954+08:00   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply

2024-08-30T15:30:57.884197635+08:00     return super().apply(*args, **kwargs)  # type: ignore[misc]

2024-08-30T15:30:57.884207997+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward

2024-08-30T15:30:57.884231053+08:00     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(

2024-08-30T15:30:57.884241778+08:00   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward

2024-08-30T15:30:57.884252120+08:00     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(

2024-08-30T15:30:57.884263623+08:00 RuntimeError: CUDA error: an illegal memory access was encountered

2024-08-30T15:30:57.884274183+08:00 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

robertgshaw2-neuralmagic · 2024-08-31T15:45:51Z

Could you share the access pattern you are using?

E.g. the client script that generates the issue? This would really help us to reproduce and solve

chenchunhui97 · 2024-09-02T04:10:00Z

Could you share the access pattern you are using?

E.g. the client script that generates the issue? This would really help us to reproduce and solve

I use the benchmark scripts in v0.4.0, and set request_rate=6 for this deployment. do you mean the token I sent to the model?

zoltan-fedor · 2024-09-02T12:27:08Z

I get a similar one.
vLLM version 0.5.4 using the stock Docker image from your Dockerhub repo.

Parameters used:

      - "--model"
      - "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
      - "--tensor-parallel-size"
      - "4"
      - "--gpu-memory-utilization"
      - "0.95"
      - "--enforce-eager"
      - "--trust-remote-code"
      - "--worker-use-ray"
      - "--enable-prefix-caching"
      - "--dtype"
      - "half"
      - "--max-model-len"
      - "32768"
      ```

It also crashes without the prefix caching.

│ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app │
│ raise exc │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app │
│ await app(scope, receive, sender) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call │
│ await self.middleware_stack(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app │
│ await route.handle(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle │
│ await self.app(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app │
│ await wrap_app_handling_exceptions(app, request)(scope, receive, send) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app │
│ raise exc │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app │
│ await app(scope, receive, sender) │
│ File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app │
│ response = await func(request) │
│ File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app │
│ raw_response = await run_endpoint_function( │
│ File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function │
│ return await dependant.call(**values) │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 204, in create_completion │
│ generator = await openai_serving_completion.create_completion( │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 170, in create_completion │
│ async for i, res in result_generator: │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 346, in consumer │
│ raise e │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 337, in consumer │
│ raise item │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 312, in producer │
│ async for item in iterator: │
│ File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate │
│ raise request_output │
│ RuntimeError: CUDA error: an illegal memory access was encountered │
│ CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. │
│ For debugging consider passing CUDA_LAUNCH_BLOCKING=1 │
│ Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. │
│ │
│ [2024-09-02 12:11:54,463 E 61 3464] logging.cc:115: Stack trace: │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10b96aa) [0x7f3da67f26aa] ray::operator<<() │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10bc932) [0x7f3da67f5932] ray::TerminateHandler() │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f3eecc6e37c] │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f3eecc6e3e7] │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f3eecc6e36f] │
│ /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f3e9f182b35] c10d::ProcessGroupNCCL::ncclCommWatchdog() │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f3eecc9adf4] │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3eede5c609] start_thread │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3eedf96353] __clone │
│ │
│ *** SIGABRT received at time=1725279114 on cpu 19 *** │
│ PC: @ 0x7f3eedeba00b (unknown) raise │
│ @ 0x7f3eedeba090 3216 (unknown) │
│ @ 0x7f3eecc6e37c (unknown) (unknown) │
│ @ 0x7f3eecc6e090 (unknown) (unknown) │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: *** SIGABRT received at time=1725279114 on cpu 19 *** │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: PC: @ 0x7f3eedeba00b (unknown) raise │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: @ 0x7f3eedeba090 3216 (unknown) │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: @ 0x7f3eecc6e37c (unknown) (unknown) │
│ [2024-09-02 12:11:54,466 E 61 3464] logging.cc:440: @ 0x7f3eecc6e090 (unknown) (unknown) │
│ Fatal Python error: Aborted │
│

TangJiakai · 2024-09-16T11:42:57Z

same error in 0.6.1

mkulariya · 2024-10-15T16:25:30Z

+1 0.6.2 GPU:L4

Jeffrey-JDong · 2024-10-30T03:10:02Z

+1 0.6.2 GPU: A800 model: awq

DaBossCoda · 2024-11-08T04:15:43Z

+1 2x4090 on awq

nightflight-dk · 2024-11-13T22:03:31Z

+1 8xA100, Azure. 0.6.3.post1 Mistral Nemo, Codestral, Small 22B

chenchunhui97 · 2024-11-22T07:59:34Z

seems fixed in latter version, from v0.6.3

kldzj · 2024-12-18T16:07:47Z

Still facing this issue with Llama 3.3 70B awq_marlin at max-model-len=8192

chenchunhui97 added the bug Something isn't working label Aug 30, 2024

zoltan-fedor mentioned this issue Aug 31, 2024

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

Open

1 task

chenchunhui97 closed this as completed Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #8025

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #8025

chenchunhui97 commented Aug 30, 2024

robertgshaw2-neuralmagic commented Aug 31, 2024

chenchunhui97 commented Sep 2, 2024

zoltan-fedor commented Sep 2, 2024 •

edited

Loading

TangJiakai commented Sep 16, 2024

mkulariya commented Oct 15, 2024

Jeffrey-JDong commented Oct 30, 2024

DaBossCoda commented Nov 8, 2024

nightflight-dk commented Nov 13, 2024 •

edited

Loading

chenchunhui97 commented Nov 22, 2024

kldzj commented Dec 18, 2024

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #8025

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #8025

Comments

chenchunhui97 commented Aug 30, 2024

Your current environment

🐛 Describe the bug

Before submitting a new issue...

robertgshaw2-neuralmagic commented Aug 31, 2024

chenchunhui97 commented Sep 2, 2024

zoltan-fedor commented Sep 2, 2024 • edited Loading

TangJiakai commented Sep 16, 2024

mkulariya commented Oct 15, 2024

Jeffrey-JDong commented Oct 30, 2024

DaBossCoda commented Nov 8, 2024

nightflight-dk commented Nov 13, 2024 • edited Loading

chenchunhui97 commented Nov 22, 2024

kldzj commented Dec 18, 2024

zoltan-fedor commented Sep 2, 2024 •

edited

Loading

nightflight-dk commented Nov 13, 2024 •

edited

Loading