Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vLLM v0.6.1 Instability issue under load. #8219

Closed
1 task done
ashgold opened this issue Sep 6, 2024 · 23 comments
Closed
1 task done

[Bug]: vLLM v0.6.1 Instability issue under load. #8219

ashgold opened this issue Sep 6, 2024 · 23 comments
Labels
bug Something isn't working

Comments

@ashgold
Copy link

ashgold commented Sep 6, 2024

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-25-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3

Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   52 bits physical, 57 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              1
Core(s) per socket:              48
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8468
Stepping:                        8
CPU MHz:                         2100.000
CPU max MHz:                     2100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4200.00
L1d cache:                       4.5 MiB
L1i cache:                       3 MiB
L2 cache:                        192 MiB
L3 cache:                        210 MiB
NUMA node0 CPU(s):               0-11
NUMA node1 CPU(s):               12-23
NUMA node2 CPU(s):               24-35
NUMA node3 CPU(s):               36-47
NUMA node4 CPU(s):               48-59
NUMA node5 CPU(s):               60-71
NUMA node6 CPU(s):               72-83
NUMA node7 CPU(s):               84-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.0@
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-11    0               N/A
GPU1    NV18     X      NV18    NV18    SYS     PIX     PIX     PXB     SYS     SYS     SYS     SYS     24-35   2               N/A
GPU2    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     72-83   6               N/A
GPU3    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     72-83   6               N/A
NIC0    PXB     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    SYS     PIX     SYS     SYS     SYS      X      PIX     PXB     SYS     SYS     SYS     SYS
NIC2    SYS     PIX     SYS     SYS     SYS     PIX      X      PXB     SYS     SYS     SYS     SYS
NIC3    SYS     PXB     SYS     SYS     SYS     PXB     PXB      X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PXB     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PXB     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB      X      SYS
NIC7    SYS     SYS     PXB     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

🐛 Describe the bug

I did a load test on vLLM v0.6.0 based on the conversation history.
I've run the test about 3 times, and I've always encountered the issue. I'm not sure if this issue appears after GPU cache usage reaches 100%, but so far it's been reproduced during load after GPU cache usage reaches 100%.

ERROR 09-05 17:22:27 async_llm_engine.py:63] Engine background task failed
ERROR 09-05 17:22:27 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-05 17:22:27 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-05 17:22:27 async_llm_engine.py:63]     result = task.result()
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-05 17:22:27 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-05 17:22:27 async_llm_engine.py:63]     output = await self.model_executor.execute_model_async(
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-05 17:22:27 async_llm_engine.py:63]     return await self._driver_execute_model_async(execute_model_req)
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
ERROR 09-05 17:22:27 async_llm_engine.py:63]     return await self.driver_exec_model(execute_model_req)
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 09-05 17:22:27 async_llm_engine.py:63]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model
ERROR 09-05 17:22:27 async_llm_engine.py:63]     inputs = self.prepare_input(execute_model_req)
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input
ERROR 09-05 17:22:27 async_llm_engine.py:63]     return self._get_driver_input_and_broadcast(execute_model_req)
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
ERROR 09-05 17:22:27 async_llm_engine.py:63]     self.model_runner.prepare_model_input(
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input
ERROR 09-05 17:22:27 async_llm_engine.py:63]     model_input = self._prepare_model_input_tensors(
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1042, in _prepare_model_input_tensors
ERROR 09-05 17:22:27 async_llm_engine.py:63]     return builder.build()  # type: ignore
ERROR 09-05 17:22:27 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 698, in build
ERROR 09-05 17:22:27 async_llm_engine.py:63]     max(inter_data.seq_lens))
ERROR 09-05 17:22:27 async_llm_engine.py:63] ValueError: max() arg is an empty sequence
ERROR:asyncio:Exception in callback functools.partial(<function _log_task_completion at 0x7f204842b640>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f20447a8970>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f204842b640>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f20447a8970>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model
    inputs = self.prepare_input(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input
    return self._get_driver_input_and_broadcast(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
    self.model_runner.prepare_model_input(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input
    model_input = self._prepare_model_input_tensors(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1042, in _prepare_model_input_tensors
    return builder.build()  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 698, in build
    max(inter_data.seq_lens))
ValueError: max() arg is an empty sequence

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

(...)

ERROR 09-05 17:22:27 client.py:412] Traceback (most recent call last):
ERROR 09-05 17:22:27 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-05 17:22:27 client.py:412]     await self.check_health(socket=socket)
ERROR 09-05 17:22:27 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-05 17:22:27 client.py:412]     await self._send_one_way_rpc_request(
ERROR 09-05 17:22:27 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-05 17:22:27 client.py:412]     raise response

(...)

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 257, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 253, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fb1397c3c10

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 257, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 253, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fb13b480a00


The average number of input prompt tokens in the conversation history we used for testing was 1,600 tokens, and the average number of answer tokens from the model was 150 tokens.

The following are the vLLM startup arguments.

    - args:
      - --model
      - /data/models/llama-65b-instruct/base
      - --tensor-parallel-size
      - "4"
      - --load-format
      - "auto"
      - --block-size
      - "32"
      - --max-seq-len-to-capture
      - "8192"
      - --max-model-len
      - "8192"
      - --disable-log-requests
      - --uvicorn-log-level
      - "warning"
      - --gpu-memory-utilization
      - "0.95"

This error did not occur in versions prior to v0.5.5. (I did load tests 3 times with exactly same arguments.)

During a load, I also got this warning message 5 minutes before the system died. Could this be related to the issue?

WARNING 09-05 17:10:02 scheduler.py:1355] Sequence group cmpl-0c49e59124fe4f9c8b8e6e0f4bae49d7-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@ashgold ashgold added the bug Something isn't working label Sep 6, 2024
@ashgold
Copy link
Author

ashgold commented Sep 6, 2024

I changed some arguments and took the load test again.

    - args:
      - --model
      - /data/models/llama-65b-instruct/base
      - --tensor-parallel-size
      - "4"
      - --load-format
      - "auto"
      - --block-size
      - "16"
      - --max-model-len
      - "8192"
      - --disable-log-requests
      - --uvicorn-log-level
      - "warning"
      - --gpu-memory-utilization
      - "0.9"

engine still dead but the error message is different.

INFO 09-05 18:45:49 metrics.py:351] Avg prompt throughput: 4321.6 tokens/s, Avg generation throughput: 434.6 tokens/s, Running: 28 reqs, Swapped: 0 reqs, Pending: 71 reqs, GPU KV cache usage: 98.5%, CPU KV cache usage: 0.0%.
ERROR 09-05 18:45:50 async_llm_engine.py:63] Engine background task failed
ERROR 09-05 18:45:50 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-05 18:45:50 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-05 18:45:50 async_llm_engine.py:63]     result = task.result()
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-05 18:45:50 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 296, in step_async
ERROR 09-05 18:45:50 async_llm_engine.py:63]     ) = self.scheduler[virtual_engine].schedule()
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1109, in schedule
ERROR 09-05 18:45:50 async_llm_engine.py:63]     scheduler_outputs = self._schedule()
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1074, in _schedule
ERROR 09-05 18:45:50 async_llm_engine.py:63]     return self._schedule_default()
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 921, in _schedule_default
ERROR 09-05 18:45:50 async_llm_engine.py:63]     running_scheduled = self._schedule_running(budget,
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 551, in _schedule_running
ERROR 09-05 18:45:50 async_llm_engine.py:63]     num_running_tokens = self._get_num_new_tokens(
ERROR 09-05 18:45:50 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1462, in _get_num_new_tokens
ERROR 09-05 18:45:50 async_llm_engine.py:63]     assert num_new_tokens > 0
ERROR 09-05 18:45:50 async_llm_engine.py:63] AssertionError
ERROR:asyncio:Exception in callback functools.partial(<function _log_task_completion at 0x7f2eaa977640>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f2ea6b20b50>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f2eaa977640>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f2ea6b20b50>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 296, in step_async
    ) = self.scheduler[virtual_engine].schedule()
  File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1109, in schedule
    scheduler_outputs = self._schedule()
  File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1074, in _schedule
    return self._schedule_default()
  File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 921, in _schedule_default
    running_scheduled = self._schedule_running(budget,
  File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 551, in _schedule_running
    num_running_tokens = self._get_num_new_tokens(
  File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1462, in _get_num_new_tokens
    assert num_new_tokens > 0
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

@ashgold ashgold changed the title [Bug]: vLLM v0.6.0, "ValueError: max() arg is an empty sequence" under load [Bug]: vLLM v0.6.0 Instability issues. "ValueError: max() arg is an empty sequence" under load. Sep 6, 2024
@ashgold ashgold changed the title [Bug]: vLLM v0.6.0 Instability issues. "ValueError: max() arg is an empty sequence" under load. [Bug]: vLLM v0.6.0 Instability issue. "ValueError: max() arg is an empty sequence" under load. Sep 6, 2024
@br3no
Copy link
Contributor

br3no commented Sep 6, 2024

I have observed the same issue while load testing 0.6.0.

I also have observed the error when the GPU KV cache usage was close to 100%. I'm not sure there is a causal relation, though.

This issue leads to no graceful degradation, vLLM needs to be restarted.

@br3no
Copy link
Contributor

br3no commented Sep 6, 2024

I have now observed another exception:

Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63] Engine background task failed
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63] Traceback (most recent call last):
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     return_value = task.result()
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     result = task.result()
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 296, in step_async
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     ) = self.scheduler[virtual_engine].schedule()
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1109, in schedule
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     scheduler_outputs = self._schedule()
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1074, in _schedule
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     return self._schedule_default()
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 921, in _schedule_default
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     running_scheduled = self._schedule_running(budget,
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 551, in _schedule_running
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     num_running_tokens = self._get_num_new_tokens(
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1462, in _get_num_new_tokens
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63]     assert num_new_tokens > 0
Sep 06 10:53:54 hal9000 docker[931748]: ERROR 09-06 01:53:54 async_llm_engine.py:63] AssertionError
Sep 06 10:53:54 hal9000 docker[931748]: Exception in callback functools.partial(<function _log_task_completion at 0x7fdccb8aed40>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fdcc81c7bb0>>)
Sep 06 10:53:54 hal9000 docker[931748]: handle: <Handle functools.partial(<function _log_task_completion at 0x7fdccb8aed40>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fdcc81c7bb0>>)>
Sep 06 10:53:54 hal9000 docker[931748]: Traceback (most recent call last):
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
Sep 06 10:53:54 hal9000 docker[931748]:     return_value = task.result()
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
Sep 06 10:53:54 hal9000 docker[931748]:     result = task.result()
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
Sep 06 10:53:54 hal9000 docker[931748]:     request_outputs = await self.engine.step_async(virtual_engine)
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 296, in step_async
Sep 06 10:53:54 hal9000 docker[931748]:     ) = self.scheduler[virtual_engine].schedule()
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1109, in schedule
Sep 06 10:53:54 hal9000 docker[931748]:     scheduler_outputs = self._schedule()
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1074, in _schedule
Sep 06 10:53:54 hal9000 docker[931748]:     return self._schedule_default()
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 921, in _schedule_default
Sep 06 10:53:54 hal9000 docker[931748]:     running_scheduled = self._schedule_running(budget,
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 551, in _schedule_running
Sep 06 10:53:54 hal9000 docker[931748]:     num_running_tokens = self._get_num_new_tokens(
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/core/scheduler.py", line 1462, in _get_num_new_tokens
Sep 06 10:53:54 hal9000 docker[931748]:     assert num_new_tokens > 0
Sep 06 10:53:54 hal9000 docker[931748]: AssertionError
Sep 06 10:53:54 hal9000 docker[931748]: The above exception was the direct cause of the following exception:
Sep 06 10:53:54 hal9000 docker[931748]: Traceback (most recent call last):
Sep 06 10:53:54 hal9000 docker[931748]:   File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
Sep 06 10:53:54 hal9000 docker[931748]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
Sep 06 10:53:54 hal9000 docker[931748]:     raise AsyncEngineDeadError(
Sep 06 10:53:54 hal9000 docker[931748]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Both exceptions have in common, that they expect content, but instead there is nothing there. It seems that some sequences get lost on high load.

@br3no
Copy link
Contributor

br3no commented Sep 6, 2024

Also, adding --disable-frontend-multiprocessing does not work around this issue.

@br3no
Copy link
Contributor

br3no commented Sep 6, 2024

Neither does increasing VLLM_RPC_GET_DATA_TIMEOUT_MS or VLLM_ENGINE_ITERATION_TIMEOUT_S.

@youkaichao
Copy link
Member

cc @SolitaryThinker is it caused by multi-step?

@SolitaryThinker
Copy link
Contributor

Let me try to reproduce on my end and take a look. Meanwhile, @ashgold, @br3no could you please trying --disable-async-output-proc and see if that changes anything?

@SolitaryThinker
Copy link
Contributor

Don't think I can reproduce, but probably not cause my multi-step as it's disabled by default

@okwinds
Copy link

okwinds commented Sep 7, 2024

the same exception under load (running 32B-W4A16)

4090 RTX

--dtype auto
--gpu-memory-utilization 0.95
--block-size 16
--max-model-len 5200
--kv-cache-dtype auto
--max-num-batched-tokens 5200
--max-seq-len-to-capture 5200
--enable-prefix-caching
--use-v2-block-manager
--num-lookahead-slots 1
--gpu-memory-utilization 0.95
--max-num-seqs 30
--num-scheduler-steps 2
--scheduler-delay-factor 0.3

vLLM version 0.6.0

CRITICAL 09-07 01:49:38 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: 127.0.0.1:50572 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [3071562]
INFO 09-07 01:49:38 server.py:228] vLLM ZMQ RPC Server was interrupted.
ERROR:asyncio:Future exception was never retrieved
future:
Traceback (most recent call last):
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
return_value = task.result()
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
result = task.result()
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 296, in step_async
) = self.scheduler[virtual_engine].schedule()
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1109, in schedule
scheduler_outputs = self._schedule()
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1074, in _schedule
return self._schedule_default()
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 921, in _schedule_default
running_scheduled = self._schedule_running(budget,
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 551, in _schedule_running
num_running_tokens = self._get_num_new_tokens(
File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1462, in _get_num_new_tokens
assert num_new_tokens > 0
AssertionError

@alpayariyak
Copy link
Contributor

alpayariyak commented Sep 7, 2024

Facing the same issue, used to be that it handled running out of kv cache space gracefully:
WARNING 08-28 10:13:51 scheduler.py:1242] Sequence group chat-be1eb78b9aa6437ea54a4490617b7286 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=4251

Now it throws the assert num_new_tokens > 0 AssertionError as soon as this happens once

@alpayariyak
Copy link
Contributor

Update: --disable-async-output-proc does prevent this

@SolitaryThinker
Copy link
Contributor

cc @alexm-neuralmagic @megha95

@alexm-neuralmagic
Copy link
Collaborator

I'm working on a fix, should be ready soon.

@alexm-neuralmagic
Copy link
Collaborator

alexm-neuralmagic commented Sep 7, 2024

@ashgold here is the PR to fix it #8267 should be merged soon. Thanks for pointing this, it was a real race condition.

CC: @robertgshaw2-neuralmagic

@ashgold
Copy link
Author

ashgold commented Sep 7, 2024

@ashgold here is the PR to fix it #8267 should be merged soon. Thanks for pointing this, it was a real race condition.

CC: @robertgshaw2-neuralmagic

Thank you for the quick fix! There's also an issue related to metrics in v0.6.0. I'd appreciate it if you could take a look at #8178!

@ashgold
Copy link
Author

ashgold commented Sep 12, 2024

@alexm-neuralmagic @youkaichao @SolitaryThinker @robertgshaw2-neuralmagic
I did load test on v0.6.1 today, and engine STILL DIES.

    - args:
      - --model
      - /data/models/llama-65b-instruct/base
      - --tensor-parallel-size
      - "4"
      - --load-format
      - "auto"
      - --block-size
      - "32"
      - --max-seq-len-to-capture
      - "8192"
      - --max-model-len
      - "8192"
      - --disable-log-requests
      - --uvicorn-log-level
      - "warning"
      - --gpu-memory-utilization
      - "0.95"
      env:
      - name: VLLM_WORKER_MULTIPROC_METHOD
        value: "spawn"
      image: aspcr01-queffmyz.scr.skr-west.scp-in.com/serving/vllm:v0.6.1

Now the timing of the error is a little different.
As soon as the warning message below is printed, the engine dies.

**Sequence group cmpl-71c6c2702c07450bb38a3a497140d0eb-0 is preempted in PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can impact end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1**
ERROR 09-11 17:21:30 async_llm_engine.py:63] Engine background task failed
ERROR 09-11 17:21:30 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-11 17:21:30 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-11 17:21:30 async_llm_engine.py:63]                    ^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-11 17:21:30 async_llm_engine.py:63]     result = task.result()
ERROR 09-11 17:21:30 async_llm_engine.py:63]              ^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-11 17:21:30 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-11 17:21:30 async_llm_engine.py:63]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-11 17:21:30 async_llm_engine.py:63]     outputs = await self.model_executor.execute_model_async(
ERROR 09-11 17:21:30 async_llm_engine.py:63]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-11 17:21:30 async_llm_engine.py:63]     return await self._driver_execute_model_async(execute_model_req)
ERROR 09-11 17:21:30 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
ERROR 09-11 17:21:30 async_llm_engine.py:63]     return await self.driver_exec_model(execute_model_req)
ERROR 09-11 17:21:30 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR 09-11 17:21:30 async_llm_engine.py:63]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-11 17:21:30 async_llm_engine.py:63]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-11 17:21:30 async_llm_engine.py:63]     output = self.model_runner.execute_model(
ERROR 09-11 17:21:30 async_llm_engine.py:63]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-11 17:21:30 async_llm_engine.py:63]     return func(*args, **kwargs)
ERROR 09-11 17:21:30 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1583, in execute_model
ERROR 09-11 17:21:30 async_llm_engine.py:63]     model_input.async_callback()
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1438, in _process_model_outputs
ERROR 09-11 17:21:30 async_llm_engine.py:63]     self.do_log_stats(scheduler_outputs, outputs, finished_before)
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1748, in do_log_stats
ERROR 09-11 17:21:30 async_llm_engine.py:63]     stats = self._get_stats(scheduler_outputs, model_output,
ERROR 09-11 17:21:30 async_llm_engine.py:63]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1860, in _get_stats
ERROR 09-11 17:21:30 async_llm_engine.py:63]     latency = seq_group.get_last_latency(now)
ERROR 09-11 17:21:30 async_llm_engine.py:63]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 17:21:30 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/sequence.py", line 686, in get_last_latency
ERROR 09-11 17:21:30 async_llm_engine.py:63]     raise ValueError(
ERROR 09-11 17:21:30 async_llm_engine.py:63] ValueError: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.
ERROR:asyncio:Exception in callback functools.partial(<function _log_task_completion at 0x7f92380be5c0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9234571dc0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f92380be5c0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9234571dc0>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1583, in execute_model
    model_input.async_callback()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1438, in _process_model_outputs
    self.do_log_stats(scheduler_outputs, outputs, finished_before)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1748, in do_log_stats
    stats = self._get_stats(scheduler_outputs, model_output,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1860, in _get_stats
    latency = seq_group.get_last_latency(now)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/sequence.py", line 686, in get_last_latency
    raise ValueError(
ValueError: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

@ashgold ashgold changed the title [Bug]: vLLM v0.6.0 Instability issue. "ValueError: max() arg is an empty sequence" under load. [Bug]: vLLM v0.6.1 Instability issue under load. Sep 12, 2024
alexm-neuralmagic added a commit to neuralmagic/vllm that referenced this issue Sep 12, 2024
@alexm-neuralmagic
Copy link
Collaborator

alexm-neuralmagic commented Sep 12, 2024

@ashgold thanks for pointing this, here is a PR to fix it: #8417
I also expanded the preemption test so it actually does the log stats (before it was disabled)

@ashgold
Copy link
Author

ashgold commented Sep 12, 2024

@ashgold thanks for pointing this, here is a PR to fix it: #8417
I also expanded the preemption test so it actually does the log stats (before it was disabled)

LGTM!

Can the hotfix of v0.6.1 come out right away? If not, when will the next release be?

@okwinds
Copy link

okwinds commented Sep 12, 2024

@ashgold thanks for pointing this, here is a PR to fix it: #8417感谢您指出这一点,这里有一个 PR 来修复它: #8417 I also expanded the preemption test so it actually does the log stats (before it was disabled)我还扩展了抢占测试,因此它实际上会进行日志统计(在它被禁用之前)

load test on v0.6.1 today, and engine crash.
(running 32B-W4A16)

vLLM version 0.6.1
args:
--gpu-memory-utilization 0.95
--max-model-len 5200
--kv-cache-dtype auto
--max-num-batched-tokens 5200
--max-seq-len-to-capture 5200
--enable-prefix-caching
--use-v2-block-manager
--num-lookahead-slots 1
--max-num-seqs 40
--num-scheduler-steps 4
--scheduler-delay-factor 0.4


ERROR 09-12 22:01:20 async_llm_engine.py:63] Engine background task failed
ERROR 09-12 22:01:20 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-12 22:01:20 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-12 22:01:20 async_llm_engine.py:63]     result = task.result()
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-12 22:01:20 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-12 22:01:20 async_llm_engine.py:63]     outputs = await self.model_executor.execute_model_async(
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 185, in execute_model_async
ERROR 09-12 22:01:20 async_llm_engine.py:63]     output = await make_async(self.driver_worker.execute_model
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 09-12 22:01:20 async_llm_engine.py:63]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-12 22:01:20 async_llm_engine.py:63]     output = self.model_runner.execute_model(
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-12 22:01:20 async_llm_engine.py:63]     return func(*args, **kwargs)
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/multi_step_model_runner.py", line 409, in execute_model
ERROR 09-12 22:01:20 async_llm_engine.py:63]     output = self._base_model_runner.execute_model(frozen_model_input,
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-12 22:01:20 async_llm_engine.py:63]     return func(*args, **kwargs)
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1583, in execute_model
ERROR 09-12 22:01:20 async_llm_engine.py:63]     model_input.async_callback()
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/multi_step_model_runner.py", line 267, in _async_process_outputs
ERROR 09-12 22:01:20 async_llm_engine.py:63]     output_proc_callback()
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1438, in _process_model_outputs
ERROR 09-12 22:01:20 async_llm_engine.py:63]     self.do_log_stats(scheduler_outputs, outputs, finished_before)
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1748, in do_log_stats
ERROR 09-12 22:01:20 async_llm_engine.py:63]     stats = self._get_stats(scheduler_outputs, model_output,
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1860, in _get_stats
ERROR 09-12 22:01:20 async_llm_engine.py:63]     latency = seq_group.get_last_latency(now)
ERROR 09-12 22:01:20 async_llm_engine.py:63]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/sequence.py", line 686, in get_last_latency
ERROR 09-12 22:01:20 async_llm_engine.py:63]     raise ValueError(
ERROR 09-12 22:01:20 async_llm_engine.py:63] ValueError: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.
ERROR 09-12 22:01:20 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-12 22:01:20 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-12 22:01:20 client.py:412] Traceback (most recent call last):
ERROR 09-12 22:01:20 client.py:412]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-12 22:01:20 client.py:412]     await self.check_health(socket=socket)
ERROR 09-12 22:01:20 client.py:412]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-12 22:01:20 client.py:412]     await self._send_one_way_rpc_request(
ERROR 09-12 22:01:20 client.py:412]   File "/home/gavin/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-12 22:01:20 client.py:412]     raise response
ERROR 09-12 22:01:20 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.

@alexm-neuralmagic
Copy link
Collaborator

@okwinds #8417 should fix it

@alexm-neuralmagic
Copy link
Collaborator

@ashgold @okwinds a temporary solution is to disable async postprocessor by adding the flag --disable-async-output-proc

@ashgold
Copy link
Author

ashgold commented Sep 13, 2024

issue disappeared in v0.6.1.post1.

Thank you guys!!

@ashgold ashgold closed this as completed Sep 13, 2024
@alexm-neuralmagic
Copy link
Collaborator

@ashgold Cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants