Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Prefix Caching Bug #3193

Closed
78 opened this issue Mar 5, 2024 · 2 comments · Fixed by #3239
Closed

Automatic Prefix Caching Bug #3193

78 opened this issue Mar 5, 2024 · 2 comments · Fixed by #3239

Comments

@78
Copy link

78 commented Mar 5, 2024

If I enable automatic prefix caching, it occasionally crashes.

Future exception was never retrieved
future: <Future finished exception=RuntimeError('step must be nonzero')>
Traceback (most recent call last):
File "/root/vllm/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
task.result()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 391, in engine_step
    request_outputs = await self.engine.step_async()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/root/miniconda3/envs/vllm/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/model_runner.py", line 575, in execute_model
    lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list)
  File "/root/vllm/vllm/worker/model_runner.py", line 494, in prepare_input_tensors
    lora_requests) = self._prepare_prompt(seq_group_metadata_list)
File "/root/vllm/vllm/worker/model_runner.py", line 243, in _prepare_prompt
start_loc_tensor = torch.arange(0,
RuntimeError: step must be nonzero

Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f87f65c35b0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f87ec4e3fd0>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f87f65c35b0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f87ec4e3fd0>)>
Traceback (most recent call last):
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 391, in engine_step
    request_outputs = await self.engine.step_async()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/root/miniconda3/envs/vllm/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(KeyError): ray::RayWorkerVllm.execute_method() (pid=1030270, ip=0.0.0.0, actor_id=be1ed7b0fca5fd6227e71c0101000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f5f2f9ad630>)
  File "/root/vllm/vllm/engine/ray_utils.py", line 37, in execute_method
    return executor(*args, **kwargs)
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/worker.py", line 212, in execute_model
    num_seq_groups = data["num_seq_groups"]
KeyError: 'num_seq_groups'

The above exception was the direct cause of the following exception:


Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    raise exc
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 03-04 20:37:48 async_llm_engine.py:133] Aborted request cmpl-7edf10b340a74b3e8c7c2e07325ae5c6.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 264, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 260, in wrap
    await func()
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 237, in listen_for_disconnect
    message = await receive()
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 580, in receive
    await self.message_event.wait()
  File "/root/miniconda3/envs/vllm/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f87bc0d52d0

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
  |     raise exc
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
  |     await self.app(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
  |     await route.handle(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
  |     await self.app(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
  |     await response(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 257, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    |     task.result()
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
    |     has_requests_in_progress = await self.engine_step()
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 391, in engine_step
    |     request_outputs = await self.engine.step_async()
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 189, in step_async
    |     all_outputs = await self._run_workers_async(
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
    |     all_outputs = await asyncio.gather(*coros)
    |   File "/root/miniconda3/envs/vllm/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    |     return (yield from awaitable.__await__())
    | ray.exceptions.RayTaskError(KeyError): ray::RayWorkerVllm.execute_method() (pid=1030270, ip=0.0.0.0, actor_id=be1ed7b0fca5fd6227e71c0101000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f5f2f9ad630>)
    |   File "/root/vllm/vllm/engine/ray_utils.py", line 37, in execute_method
    |     return executor(*args, **kwargs)
    |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    |     return func(*args, **kwargs)
    |   File "/root/vllm/vllm/worker/worker.py", line 212, in execute_model
    |     num_seq_groups = data["num_seq_groups"]
    | KeyError: 'num_seq_groups'
    |
    | The above exception was the direct cause of the following exception:
    |
    | Traceback (most recent call last):
    |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 260, in wrap
    |     await func()
    |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 249, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/root/vllm/vllm/entrypoints/openai/serving_chat.py", line 148, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 565, in generate
    |     raise e
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 559, in generate
    |     async for request_output in stream:
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 69, in __anext__
    |     raise result
    |   File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    |     raise exc
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    |     raise AsyncEngineDeadError(
    | vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
    +------------------------------------

vLLM: main branch
Model: openbuddy-deepseek-67b-v18.1-4k-gptq (Marlin Kernel)
GPU: 4 x RTX3090

@ywang96
Copy link
Member

ywang96 commented Mar 5, 2024

Can confirm similar issues happened to me as well when automatic prefix caching is enabled.

Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f19b986c0d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f19af5db4f0>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f19b986c0d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f19af5db4f0>)>
Traceback (most recent call last):
  File "/workspace/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/workspace/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/workspace/vllm/engine/async_llm_engine.py", line 391, in engine_step
    request_outputs = await self.engine.step_async()
  File "/workspace/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/workspace/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 575, in execute_model
    lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list)
File "/workspace/vllm/worker/model_runner.py", line 494, in prepare_input_tensors
INFO 03-05 09:16:34 async_llm_engine.py:133] Aborted request cmpl-49aa25f0dba24ec7b00d8ae6a0a102ad.
    lora_requests) = self._prepare_prompt(seq_group_metadata_list)
  File "/workspace/vllm/worker/model_runner.py", line 243, in _prepare_prompt
    start_loc_tensor = torch.arange(0,
RuntimeError: step must be nonzero

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/workspace/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    raise exc
  File "/workspace/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Model: mistralai/Mixtral-8x7B-Instruct-v0.1, 2xA100-80G, cuda graph enabled.

@robertgshaw2-redhat
Copy link
Collaborator

Note: @78

Model: openbuddy-deepseek-67b-v18.1-4k-gptq (Marlin Kernel)

This model is not using the marlin kernel

@SageMoore going to take a look

ElizaWszola added a commit to neuralmagic/nm-vllm that referenced this issue Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants