Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837

Merged
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
7eb0e0d
added block manager tests
afeldman-nm May 15, 2024
6e41c39
passing block manager encoder/decoder test
afeldman-nm May 15, 2024
7bcc4ef
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 15, 2024
f04ee73
block manager v2 changes to pass test_can_allocate_seq_group_encoder_…
afeldman-nm May 15, 2024
07bbd8a
block manager v2 support for encoder/decoder
afeldman-nm May 15, 2024
85e602b
Merge branch 'upstream-main' into infra_enc_dec_block_manager_merge
afeldman-nm May 15, 2024
3e95602
renamed encoder to cross in block manager v2, regarding block tables
afeldman-nm May 15, 2024
04f38a8
renamed encoder to cross where appropriate
afeldman-nm May 15, 2024
2dcd663
formatting
afeldman-nm May 15, 2024
22d4c17
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 15, 2024
a6aba57
Merge branch 'upstream-main' into infra_enc_dec_block_manager_merge
afeldman-nm May 16, 2024
22d9dba
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 17, 2024
954cd54
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 17, 2024
2e245b3
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 21, 2024
63dd42d
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 21, 2024
2ced012
fix wording nits (ben->been, decoder->encoder/decoder)
afeldman-nm May 21, 2024
ed337e8
Merge branch 'upstream-main' into infra_enc_dec_block_manager_merge
afeldman-nm May 22, 2024
8286b4c
changed two block manager tests to construct fake prompts that are eq…
afeldman-nm May 22, 2024
eba551c
keyword args for dummy prompt construction in block manager encoder/d…
afeldman-nm May 22, 2024
a7c8b19
bugfix - decoder prompt kwarg repeated in lieu of encoder prompt kwarg
afeldman-nm May 22, 2024
9feb994
In block manager test which used with block to detect error - created…
afeldman-nm May 22, 2024
5eb0032
refactoring block manager v1/v2 swap in/swap out functions
afeldman-nm May 22, 2024
0644cde
formatting; changed blocktable type specifier from Dict to List[int]
afeldman-nm May 22, 2024
19ed741
prefixed internal method with _
afeldman-nm May 22, 2024
a557972
refactored self-/cross-attention allocation functions into a single h…
afeldman-nm May 22, 2024
e48bebf
Refactored block manager v2 self-/cross-block-table alloc functions t…
afeldman-nm May 22, 2024
18b415f
Merge branch 'upstream-main' into infra_enc_dec_block_manager_review
afeldman-nm May 22, 2024
c6842c8
Merge branch 'infra_enc_dec_block_manager_review' into infra_enc_dec_…
afeldman-nm May 22, 2024
ac2da97
formatting
afeldman-nm May 22, 2024
e985a2f
refactored out block manager v1 swap_n/swap_out helper functions
afeldman-nm May 22, 2024
98c5863
Help function avoids prefix caching code in encoder/decoder scenarios…
afeldman-nm May 22, 2024
defa279
Merge branch 'upstream-main' into infra_enc_dec_block_manager_merge
afeldman-nm May 23, 2024
f3b1b94
Merge branch 'upstream-main' into infra_enc_dec_block_manage_reviews
afeldman-nm May 23, 2024
84f5510
block manager v1 NotImplementError's for sliding window and automatic…
afeldman-nm May 23, 2024
cc61959
Fixes
afeldman-nm May 23, 2024
dcb9abe
formatting
afeldman-nm May 23, 2024
e8c40fc
explanatory comment
afeldman-nm May 23, 2024
5ccb70b
various fixes according to reviews
afeldman-nm May 23, 2024
dfcc28b
slight refactoring
afeldman-nm May 23, 2024
8d3ad05
small refactor
afeldman-nm May 23, 2024
5a76979
replaced all encoder_seq is not None with not decoder_only
afeldman-nm May 23, 2024
09ae4ad
added is_encoder_decoder() method to sequence group
afeldman-nm May 23, 2024
ecd1a99
tests for NotImplemented errors when encoder/decoder models are used …
afeldman-nm May 23, 2024
191a5b6
Merge branch 'upstream-main' into infra_enc_dec_block_manager_reviews
afeldman-nm May 23, 2024
d3935f7
rename tests
afeldman-nm May 23, 2024
e6a7125
spelling error
afeldman-nm May 23, 2024
68b4762
isort
afeldman-nm May 23, 2024
0c5fc61
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 24, 2024
845f040
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 24, 2024
849e49c
Merge branch 'upstream-main' into infra_enc_dec_block_manager_reviews
afeldman-nm May 26, 2024
a80325d
return output of SequenceGroup constructor
afeldman-nm May 26, 2024
8b38776
capitalize constants
afeldman-nm May 26, 2024
f39c313
refactored swap-block-table functionality
afeldman-nm May 26, 2024
90b5a0e
Refactored block manager + enc dec + unsupported feature checks into …
afeldman-nm May 26, 2024
9ee2582
removed circular import
afeldman-nm May 26, 2024
5d0ac23
apparently isort has to run last?
afeldman-nm May 26, 2024
1bcc949
slight name change
afeldman-nm May 26, 2024
5ae5969
merge
afeldman-nm May 28, 2024
1bece71
wip merge
afeldman-nm May 28, 2024
1d882ca
fixed utils to correctly handle encoder/decoder unsupported scenarios
afeldman-nm May 28, 2024
dfd9469
formatting
afeldman-nm May 28, 2024
611df43
yapf fix
afeldman-nm May 29, 2024
8ee49dd
yapf fix
afeldman-nm May 29, 2024
6f4b49e
Merge branch 'upstream-main' into infra_enc_dec_block_manager_reviews
afeldman-nm May 29, 2024
039c25e
upstream merge
afeldman-nm May 29, 2024
8e9ef5b
fix formatting issue
afeldman-nm May 29, 2024
2b59ddc
formatting
afeldman-nm May 29, 2024
471569f
Merge branch 'upstream-main' into infra_enc_dec_block_manager_reviews
afeldman-nm May 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 154 additions & 2 deletions tests/core/block/test_block_manager_v2.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
import pytest

from vllm.core.block_manager_v2 import BlockSpaceManagerV2
from vllm.core.block_manager_v2 import (BlockSpaceManagerV2,
str_not_impl_enc_dec_prefix_cache,
str_not_impl_enc_dec_swa)
from vllm.core.interfaces import AllocStatus
from vllm.sequence import Logprob, SequenceStatus
from vllm.utils import chunk_list

from ..utils import create_seq_group
from ..utils import create_seq_group, create_seq_group_encoder_decoder


@pytest.mark.parametrize("block_size", [16])
Expand Down Expand Up @@ -52,6 +54,156 @@ def test_can_allocate_seq_group(block_size: int, num_seqs_per_group: int,
assert can_allocate_result == AllocStatus.LATER


@pytest.mark.parametrize("block_size", [16])
@pytest.mark.parametrize("num_gpu_blocks", [16, 80, 160])
@pytest.mark.parametrize("num_seqs_per_group", [1, 4])
@pytest.mark.parametrize("watermark", [0.0, 0.5])
def test_can_allocate_seq_group_encoder_decoder(block_size: int,
num_seqs_per_group: int,
num_gpu_blocks: int,
watermark: float):
block_manager = BlockSpaceManagerV2(
block_size=block_size,
num_gpu_blocks=num_gpu_blocks,
num_cpu_blocks=1024,
watermark=watermark,
)
num_watermark_blocks = int(watermark * num_gpu_blocks)

num_output_blocks_per_seq = 1

# NOTE: This should be num_output_blocks_per_seq * num_seqs_per_group, but
# the current implementation assumes all seqs are new prompts / don't have
# different output lens.
num_output_blocks = num_output_blocks_per_seq

for bdx, num_prompt_blocks in enumerate(
range(1, num_gpu_blocks - num_output_blocks)):
num_cross_blocks_per_seq = num_prompt_blocks

seq_group = create_seq_group_encoder_decoder(
seq_prompt_len=block_size * num_prompt_blocks,
seq_output_lens=[
block_size * num_output_blocks_per_seq
for _ in range(num_seqs_per_group)
],
request_id=str(bdx))

assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks

can_allocate_result = block_manager.can_allocate(seq_group)

num_required_blocks = num_prompt_blocks + \
num_output_blocks + \
num_cross_blocks_per_seq

if num_gpu_blocks - num_required_blocks < num_watermark_blocks:
assert can_allocate_result == AllocStatus.NEVER
elif num_gpu_blocks >= num_required_blocks:
assert can_allocate_result == AllocStatus.OK
else:
assert can_allocate_result == AllocStatus.LATER


@pytest.mark.parametrize("block_size", [16])
@pytest.mark.parametrize("num_gpu_blocks", [16])
@pytest.mark.parametrize("num_seqs_per_group", [1])
@pytest.mark.parametrize("watermark", [0.0, 0.5])
def test_can_allocate_encoder_decoder_fails_with_swa(block_size: int,
num_seqs_per_group: int,
num_gpu_blocks: int,
watermark: float):
'''
SWA short for Sliding Window Attention.

At time of writing block manager v2 does not support SWA.

However even when SWA is implemented for block manager v2,
there will still most likely be a separate workstream required
to enable SWA for encoder/decoder models.

Therefore this test enforces that one of the following cases
hold true:
1. Block manager v2 does not support SWA at all (true at time of writing)
2. Block manager v2 fails with NotImplementError when SWA is enabled
AND a SequenceGroup with an encoder sequence (i.e. in support of an
encoder/decoder model) is passed into can_allocate() as an argument

The setup for this test is stripped down version of
test_can_allocate_seq_group_encoder_decoder()
'''

with pytest.raises((NotImplementedError, AssertionError)) as exc_info:
block_manager = BlockSpaceManagerV2(
block_size=block_size,
num_gpu_blocks=num_gpu_blocks,
num_cpu_blocks=1024,
watermark=watermark,
sliding_window=5 # SWA
)

num_output_blocks_per_seq = 1
num_prompt_blocks = 1
num_output_blocks = num_output_blocks_per_seq
seq_group = create_seq_group_encoder_decoder(
seq_prompt_len=block_size * num_prompt_blocks,
seq_output_lens=[
block_size * num_output_blocks_per_seq
for _ in range(num_seqs_per_group)
],
request_id="0")

assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks
block_manager.can_allocate(seq_group)

# Assert that either
# 1. Block manager v2 constructor fails with assertion that sliding window
# is not yet supported (most likely near-term outcome at time of
# writing), or
# 2. can_allocate() fails with NotImplementedError due to combination of
# encoder/decoder and sliding window attention
if isinstance(exc_info.value, NotImplementedError):
assert str(exc_info.value) == str_not_impl_enc_dec_swa
elif isinstance(exc_info.value, AssertionError):
assert str(exc_info.value) == "Sliding window not yet supported"


@pytest.mark.parametrize("block_size", [16])
@pytest.mark.parametrize("num_gpu_blocks", [16])
@pytest.mark.parametrize("num_seqs_per_group", [1])
@pytest.mark.parametrize("watermark", [0.0, 0.5])
def test_can_allocate_encoder_decoder_fails_with_prefix_cache(
block_size: int, num_seqs_per_group: int, num_gpu_blocks: int,
watermark: float):

block_manager = BlockSpaceManagerV2(
block_size=block_size,
num_gpu_blocks=num_gpu_blocks,
num_cpu_blocks=1024,
watermark=watermark,
enable_caching=True # Prefix cache
)

num_output_blocks_per_seq = 1
num_prompt_blocks = 1
num_output_blocks = num_output_blocks_per_seq
seq_group = create_seq_group_encoder_decoder(
seq_prompt_len=block_size * num_prompt_blocks,
seq_output_lens=[
block_size * num_output_blocks_per_seq
for _ in range(num_seqs_per_group)
],
request_id="0")

assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks

# Assert that either can_allocate() fails with NotImplementedError
# due to combination of encoder/decoder and prefix cache
with pytest.raises(NotImplementedError) as exc_info:
block_manager.can_allocate(seq_group)
assert str(exc_info.value) == str_not_impl_enc_dec_prefix_cache


@pytest.mark.parametrize("block_size", [1, 8])
@pytest.mark.parametrize("prompt_len", [1, 7, 8])
@pytest.mark.parametrize("num_slots_to_append", [1, 8, 129])
Expand Down
Loading
Loading