-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Xavier: Share KV cache between VLLM replicas #2732
FEAT: Xavier: Share KV cache between VLLM replicas #2732
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A Corner Case:
When transferring blocks, the block may be evicted or replaced by a new one. It's better to use a block hash during transfers. If the block is evicted or there is a block hash mismatch, we can simply handle it as a cache miss. |
How can we produce the corner case? |
We can add mock logic to produce it. For example, call evict or modify (to simulate the block replacement) on the model block while querying the block. |
OK, how about opening a new issue to track this? |
Let me open an issue. |
Xavier: Share KV cache between VLLM replicas
Naming
It is derived from Professor X (Charles Francis Xavier) in the Marvel Comics X-Men series. The project name starts with "X," and like Professor X, who possesses a powerful mind that controls information, this metaphorically refers to the project managing the data scheduling in vllm.
Purpose
In vllm with multiple replicas, some long prompts have a lengthy prefill time. If other replicas have already computed the results, they can be directly transferred and used.
Usage
Simply add the parameter
enable_xavier=True
when starting the vllm model.Test
Using this script to generate a long prompt for LLM (about 9k+ prompt token):
Use
LONG_PROMPT+q1
andLONG_PROMPT+q2
as prompts to interact with the model separately for each query.Test Results:
First query (without cache, just calculating) E2E time:
LONG_PROMPT+q1
: ~2.96 sSecond query (with transferring) E2E time:
LONG_PROMPT+q2
: ~1.33 sLimitations
enable_prefix_caching
. The vllm version needs to be >= 0.6.50.0.0.0
address, so when starting xinference, you need to use the actual IP address, for example:xinference-local -H 192.168.xx.xx
.