-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Hybrid allocator for full attention & sliding window attention interleaved models #12655
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
This pr is built on top of #12086, will rebase and resolve merge conflicts with main branch after that pr got merged. The diff of this PR and #12086 is here: heheda12345/vllm@grouped_block_table...heheda12345:vllm:hybrid_allocator
This pr is working on hybrid memory allocator (#11382), and does the following things:
KVCacheManger
to support multiple KV cache groupsSpecializedManager
, an abstraction for expressing the allocation & free & prefix caching logic of KV cache for different attention variantsKVCacheConfig
andGPUModelRunner.initialize_kv_cache
to allow multiple layers sharing the same KV cache memory pool.Benchmark results:
The following benchmarks are performed on H100, with Gemma2 a hybrid model combines sliding window attention layers and standard attention layers, and llama, a model with only standard attention layers. Hybrid allocator can accelerate hybrid model & only introduce very little overhead on standard full attention models.
This PR (68fe2db):
Main branch (df450aa)
CC @comaniac @WoosukKwon