[WIP] Hybrid allocator for full attention & sliding window attention interleaved models #12655

heheda12345 · 2025-02-02T03:41:43Z

This pr is built on top of #12086, will rebase and resolve merge conflicts with main branch after that pr got merged. The diff of this PR and #12086 is here: heheda12345/vllm@grouped_block_table...heheda12345:vllm:hybrid_allocator

This pr is working on hybrid memory allocator (#11382), and does the following things:

Extend KVCacheManger to support multiple KV cache groups
Introduce SpecializedManager, an abstraction for expressing the allocation & free & prefix caching logic of KV cache for different attention variants
Extend KVCacheConfig and GPUModelRunner.initialize_kv_cache to allow multiple layers sharing the same KV cache memory pool.

Benchmark results:
The following benchmarks are performed on H100, with Gemma2 a hybrid model combines sliding window attention layers and standard attention layers, and llama, a model with only standard attention layers. Hybrid allocator can accelerate hybrid model & only introduce very little overhead on standard full attention models.

This PR (68fe2db):

VLLM_USE_V1=1 python3 benchmark_throughput.py --model google/gemma-2-27b-it --input-len 6144 --output-len 1024 --num-prompts 50
Throughput: Throughput: 0.17 requests/s, 1250.04 total tokens/s, 178.58 output tokens/s
VLLM_USE_V1=1 python3 benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --input-len 6144 --output-len 1024 --num-prompts 50
Throughput: 1.47 requests/s, 10541.44 total tokens/s, 1505.92 output tokens/s

Main branch (df450aa)

VLLM_USE_V1=1 python3 benchmark_throughput.py --model google/gemma-2-27b-it --input-len 6144 --output-len 1024 --num-prompts 50
Throughput: 0.15 requests/s, 1073.40 total tokens/s, 153.34 output tokens/s
VLLM_USE_V1=1 python3 benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --input-len 6144 --output-len 1024 --num-prompts 50
Throughput: 1.48 requests/s, 10629.64 total tokens/s, 1518.52 output tokens/s

CC @comaniac @WoosukKwon

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

…k_table

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

…k_table

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

github-actions · 2025-02-02T03:41:55Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-02-02T03:42:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 added 29 commits January 15, 2025 01:10

can run

0f8a54c

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

fix tests

990d086

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

format

e46fff5

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

fix bug

36a649a

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

add comments

9c36e7d

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

format

da6b549

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Merge branch 'main' of github.com:vllm-project/vllm into grouped_bloc…

4030199

…k_table

Merge branch 'main' of github.com:vllm-project/vllm into grouped_bloc…

2d8213e

…k_table

update code

41bc571

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Merge branch 'main' of github.com:vllm-project/vllm into grouped_bloc…

a939b6d

…k_table

can run

34c9d74

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

update comments

cfcf2b4

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

init kv cache for group allocation

4898973

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

can run, result a little strange

ef9dc9d

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

fix small bug

5b71ccd

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

cleanup SpecializedManager

6a0eb69

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

add test and fix bug for sliding window manager

14ad04e

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

remove useless code

eb34a44

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

fix several bugs

f53e824

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

update sliding window test

0ecf3fa

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

small fix, can run gemma2

446e99d

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

add test for range_intersect

d97c1b0

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

clean up get_computed_blocks, append_slots, allocate_slots

5ebfeac

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

finish the clean up of kv cache manager

4e0dc48

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

clean up the code

cd4f8e2

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

fix some tests

2d7bbca

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

remove print kvcacheconfig

68fe2db

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

move files

30e9837

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

add docstrings

e6016e5

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

mergify bot added the v1 label Feb 2, 2025

mergify bot added the needs-rebase label Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Hybrid allocator for full attention & sliding window attention interleaved models #12655

[WIP] Hybrid allocator for full attention & sliding window attention interleaved models #12655

heheda12345 commented Feb 2, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 2, 2025

mergify bot commented Feb 2, 2025

[WIP] Hybrid allocator for full attention & sliding window attention interleaved models #12655

Are you sure you want to change the base?

[WIP] Hybrid allocator for full attention & sliding window attention interleaved models #12655

Conversation

heheda12345 commented Feb 2, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 2, 2025

mergify bot commented Feb 2, 2025

heheda12345 commented Feb 2, 2025 •

edited by github-actions bot

Loading