Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError #8194

Closed
1 task done
NicolasDrapier opened this issue Sep 5, 2024 · 13 comments · Fixed by #7928
Closed
1 task done

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError #8194

NicolasDrapier opened this issue Sep 5, 2024 · 13 comments · Fixed by #7928
Labels
bug Something isn't working

Comments

@NicolasDrapier
Copy link

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: openSUSE Tumbleweed (x86_64)
GCC version: (SUSE Linux) 13.2.1 20240206 [revision 67ac78caf31f7cb3202177e6428a46d829b70f23]
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.39

Python version: 3.11.9 (main, Apr 08 2024, 06:18:15) [GCC] (64-bit runtime)
Python platform: Linux-6.8.5-1-default-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S
GPU 3: NVIDIA L40S
GPU 4: NVIDIA L40S
GPU 5: NVIDIA L40S
GPU 6: NVIDIA L40S
GPU 7: NVIDIA L40S

Nvidia driver version: 550.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               48
On-line CPU(s) list:                  0-47
Vendor ID:                            AuthenticAMD
BIOS Vendor ID:                       Advanced Micro Devices, Inc.
Model name:                           AMD EPYC 9254 24-Core Processor
BIOS Model name:                      AMD EPYC 9254 24-Core Processor                 Unknown CPU @ 2.9GHz
BIOS CPU family:                      107
CPU family:                           25
Model:                                17
Thread(s) per core:                   1
Core(s) per socket:                   24
Socket(s):                            2
Stepping:                             1
Frequency boost:                      enabled
CPU(s) scaling MHz:                   52%
CPU max MHz:                          4151.7568
CPU min MHz:                          1500.0000
BogoMIPS:                             5793.37
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization:                       AMD-V
L1d cache:                            1.5 MiB (48 instances)
L1i cache:                            1.5 MiB (48 instances)
L2 cache:                             48 MiB (48 instances)
L3 cache:                             256 MiB (8 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-23
NUMA node1 CPU(s):                    24-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] flashinfer==0.1.1+cu121torch2.3
[pip3] mypy-extensions==1.0.0
[pip3] mypy-protobuf==3.6.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.535.133
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pytorch-ranger==0.1.1
[pip3] pyzmq==26.0.0
[pip3] torch==2.4.0
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.4.0
[pip3] torchmetrics==1.3.2
[pip3] torchvision==0.19.0
[pip3] transformers==4.43.1
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4@4db5176d9758b720b05460c50ace3c01026eb158
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     0-23    0               N/A
GPU1    PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     0-23    0               N/A
GPU2    SYS     SYS      X      PIX     SYS     SYS     SYS     SYS     0-23    0               N/A
GPU3    SYS     SYS     PIX      X      SYS     SYS     SYS     SYS     0-23    0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     24-47   1               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     24-47   1               N/A
GPU6    SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     24-47   1               N/A
GPU7    SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      24-47   1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Hi,

I am using vLLM v0.6.0 from the commit 8685ba1 and I built the docker image using this command :

DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai:v0.6.0-flashinfer --build-arg max_jobs=32 --build-arg nvcc_threads=8 --build-arg torch_cuda_arch_list=""

I built the image on my own because of this but not matter.

Here is the command I try to use :

docker run --rm --runtime nvidia --gpus all \
--name vllm-mistral \
-e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-p 8090:8000 \
-v /data/vllm/huggingface:/root/vllm/huggingface \
-v /data/models/mistral/mistral-large-instruct-2407-awq:/root/data/mistral-large-instruct-2407-awq \
--ipc=host \
vllm/vllm-openai:v0.6.0-flashinfer \
--host 0.0.0.0 \
--model /root/data/mistral-large-instruct-2407-awq \
--disable-custom-all-reduce \
--distributed-executor-backend ray \
--tensor-parallel-size 4 \
--max-model-len $((1024*100)) \
--max-num-seqs 16 \
--num-scheduler-steps 8 \
--trust-remote-code \
--kv-cache-dtype fp8_e4m3 \
--use-v2-block-manager \
--enable-chunked-prefill=False \
--quantization awq_marlin

Here is the bug I get when I send a request :

ERROR 09-05 05:00:27 worker_base.py:464] Error executing method execute_model. This might cause deadlock in distributed execution.
ERROR 09-05 05:00:27 worker_base.py:464] Traceback (most recent call last):
ERROR 09-05 05:00:27 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-05 05:00:27 worker_base.py:464]     return executor(*args, **kwargs)
ERROR 09-05 05:00:27 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-05 05:00:27 worker_base.py:464]     output = self.model_runner.execute_model(
ERROR 09-05 05:00:27 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-05 05:00:27 worker_base.py:464]     return func(*args, **kwargs)
ERROR 09-05 05:00:27 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 390, in execute_model
ERROR 09-05 05:00:27 worker_base.py:464]     model_input = self._advance_step(
ERROR 09-05 05:00:27 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 499, in _advance_step
ERROR 09-05 05:00:27 worker_base.py:464]     assert isinstance(attn_metadata, FlashAttentionMetadata)
ERROR 09-05 05:00:27 worker_base.py:464] AssertionError
ERROR 09-05 05:00:27 async_llm_engine.py:63] Engine background task failed
ERROR 09-05 05:00:27 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-05 05:00:27 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-05 05:00:27 async_llm_engine.py:63]     result = task.result()
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-05 05:00:27 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-05 05:00:27 async_llm_engine.py:63]     output = await self.model_executor.execute_model_async(
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 502, in execute_model_async
ERROR 09-05 05:00:27 async_llm_engine.py:63]     return await super().execute_model_async(execute_model_req)
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-05 05:00:27 async_llm_engine.py:63]     return await self._driver_execute_model_async(execute_model_req)
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 519, in _driver_execute_model_async
ERROR 09-05 05:00:27 async_llm_engine.py:63]     return await self.driver_exec_method("execute_model",
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 09-05 05:00:27 async_llm_engine.py:63]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 465, in execute_method
ERROR 09-05 05:00:27 async_llm_engine.py:63]     raise e
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-05 05:00:27 async_llm_engine.py:63]     return executor(*args, **kwargs)
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-05 05:00:27 async_llm_engine.py:63]     output = self.model_runner.execute_model(
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-05 05:00:27 async_llm_engine.py:63]     return func(*args, **kwargs)
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 390, in execute_model
ERROR 09-05 05:00:27 async_llm_engine.py:63]     model_input = self._advance_step(
ERROR 09-05 05:00:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 499, in _advance_step
ERROR 09-05 05:00:27 async_llm_engine.py:63]     assert isinstance(attn_metadata, FlashAttentionMetadata)
ERROR 09-05 05:00:27 async_llm_engine.py:63] AssertionError
Exception in callback functools.partial(<function _log_task_completion at 0x7f47d481f640>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f47d13b96f0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f47d481f640>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f47d13b96f0>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 502, in execute_model_async
    return await super().execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 519, in _driver_execute_model_async
    return await self.driver_exec_method("execute_model",
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 465, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 390, in execute_model
    model_input = self._advance_step(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 499, in _advance_step
    assert isinstance(attn_metadata, FlashAttentionMetadata)
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 09-05 05:00:27 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-05 05:00:27 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-05 05:00:27 client.py:412] Traceback (most recent call last):
ERROR 09-05 05:00:27 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-05 05:00:27 client.py:412]     await self.check_health(socket=socket)
ERROR 09-05 05:00:27 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-05 05:00:27 client.py:412]     await self._send_one_way_rpc_request(
ERROR 09-05 05:00:27 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-05 05:00:27 client.py:412]     raise response
ERROR 09-05 05:00:27 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 257, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 253, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f5c1e58ebf0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

I am opening this issue because of

vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@NicolasDrapier NicolasDrapier added the bug Something isn't working label Sep 5, 2024
@stefanobranco
Copy link

Does this also happen without multi-step scheduling?

@ShangmingCai
Copy link
Contributor

Try removing --num-scheduler-steps 8.

Flash attn is not supported on Volta and Turing GPUs, therefore this assertion will fail.

@NicolasDrapier
Copy link
Author

Hi, thank you all for the answers. Here are few points:

  • @stefanobranco It does not happen without multi-step scheduling, but I think it's a shame to be deprived of such an important feature as this optimization. Ideally, I'd like it to work with
  • @ShangmingCai My GPUs are L40S, so Ada Lovelace architecture

@ShangmingCai
Copy link
Contributor

Hi, thank you all for the answers. Here are few points:

  • @stefanobranco It does not happen without multi-step scheduling, but I think it's a shame to be deprived of such an important feature as this optimization. Ideally, I'd like it to work with
  • @ShangmingCai My GPUs are L40S, so Ada Lovelace architecture

Right. What I really mean is that only Ampere or newer GPUs are supported. For early GPUs, vllm will use xformer backend. Since multi-step scheduling this feature is currently implemented with flash attn backend, it is not supported on older GPUs for now. Maybe this feature will be supported in the future when they consider all backends.

@NicolasDrapier
Copy link
Author

Right. What I really mean is that only Ampere or newer GPUs are supported. For early GPUs, vllm will use xformer backend. Since multi-step scheduling this feature is currently implemented with flash attn backend, it is not supported on older GPUs for now. Maybe this feature will be supported in the future when they consider all backends.

Ok but Ada Lovelace is Ampere's successor generation. It's a “consumer” architecture, even if the L40S are server GPUs. If someone assures me that this architecture in particular is not supported, fine, but for the moment I have the impression that my hardware perfectly meets all the conditions for supporting flash attention (and indeed it does) and flashinfer.

@ShangmingCai
Copy link
Contributor

Right. What I really mean is that only Ampere or newer GPUs are supported. For early GPUs, vllm will use xformer backend. Since multi-step scheduling this feature is currently implemented with flash attn backend, it is not supported on older GPUs for now. Maybe this feature will be supported in the future when they consider all backends.

Ok but Ada Lovelace is Ampere's successor generation. It's a “consumer” architecture, even if the L40S are server GPUs. If someone assures me that this architecture in particular is not supported, fine, but for the moment I have the impression that my hardware perfectly meets all the conditions for supporting flash attention (and indeed it does) and flashinfer.

My bad. I don't know Ada Lovelace is newer. Currently, this new feature only supports flash_attn and rocm_flash_attn. You can check which backend you are using and manually set up your backend to FlashAttentionBackend or ROCmFlashAttention.

If you are using flashinfer, then you would see this in your start info.

INFO xx-xx xx:xx:xx selector.py:142] Using Flashinfer backend.

I don't know if this method would work around since my GPU is Volta architecture. But you can have it a try.

@NicolasDrapier
Copy link
Author

Yes I passed the environment variable VLLM_ATTENTION_BACKEND=FLASHINFER as you can see in my command and I can affirm I have the log
So I don't know

@ShangmingCai
Copy link
Contributor

Yes I passed the environment variable VLLM_ATTENTION_BACKEND=FLASHINFER as you can see in my command and I can affirm I have the log So I don't know

Try removing this environment variable, then vllm will use FlashAttentionBackend as default. (To be clear, flashinfer backend is not FlashAttentionBackend)

@NicolasDrapier
Copy link
Author

I need flashinfer backend with --num-scheduler-steps > 1

@NicolasDrapier
Copy link
Author

Any news about this error?

@ShangmingCai
Copy link
Contributor

Any news about this error?

Maybe you should ask the contributor of this feature.
@SolitaryThinker Hello, sorry to bother you. Thank you for the great feature, is there any plan to support other backends?

@SolitaryThinker
Copy link
Contributor

SolitaryThinker commented Sep 12, 2024

flashinfer+multi-step will be supported by this PR #7928

@SolitaryThinker
Copy link
Contributor

The PR is merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
@SolitaryThinker @NicolasDrapier @DarkLight1337 @stefanobranco @ShangmingCai and others