[Core] Enable prefix caching with block manager v2 enabled #4142

leiwen83 · 2024-04-17T08:58:38Z

[Core] Enable prefix caching with block manager v2 enabled

This PR is for enabel APC feature over block manager V2.

APC core idea is like maintain a table mapping between content hash and
physical block. If one block is full, then it would be prompted to be
immutable. While if several immutable comes in sequence, we kept the
first physical block, while free the other.

Those freed physical block, aka no reference at scheduler, would be
housekept inside evictor, when we try to alloc block again under pressue,
and find no free one as in the initial state, we could resort to
evictor to pop one unrefered computed block, which is used to store
new data comes from upper layer.

The evictor policy could be changed, but current only implement LRU one.

Co-authored-by: Sage Moore <sagemoore@utexas.edu>

leiwen83 · 2024-04-17T09:01:06Z

@cadedaniel
Would you mind take a look for this patch to enable the prefix caching with block manager V2?

SageMoore · 2024-04-17T13:01:20Z

Hi, @leiwen83 looks like we collided a bit. I just posted #4146 which also adds prefix caching support to V2 of the block manager. Would you be willing to work together to get that PR over the finish line? The main thing I like from this PR is that it stores references to the blocks in the PrefixCachingBlockAllocator. I did something similar with a new BlockMetaData structure, but I think we can make some minor adjustments to the existing LRU evictor such that it will work with generic block objects. That way neither one of us has to add a new, but basically the same, evictor :).

Occasional duplication of work is an unfortunate reality with popular open source projects, but I think we can work together and end up merging something pretty good.

leiwen83 · 2024-04-17T13:42:38Z

Hi @SageMoore

what a coincidence ;)

But I'd like to hear from @cadedaniel first before take further action, he may give us some good advice.

Thx,
Lei

cadedaniel · 2024-04-18T08:21:10Z

Thanks @leiwen83 and @SageMoore for the work ~~

Priority-wise, whoever has a more complete reviewable PR should get review precedence. I'll look at this one first as it's no longer WIP / tests are passing. @SageMoore feel free to add review comments on missing requirements or improvements. I haven't thought through the Evictor design tradeoffs so especially clarifying which approach we should go with for that would be helpful!

In the future we should be more communicative in #3667 so we don't have mixups

SageMoore

I think moving incrementally here is good. Let's try to get this PR into a mergable state and go from there.

There are four high level changes that I would suggest for this PR.

Move the core/block/evictor.py file to core/evictor_v2.py, similar to what was done for block_manager. Rename core/evictor.py to core/evictor_v1.py.
Let's remove unused_cached_blocks from the PrefixCachingBlockAllocator and just rely on the evictor.
Assuming that we don't want to make the policy an argument to the BlockSpaceManagerV2, lets move the code that records last_accessed and computed into the PrefixCachingBlockAllocator. The code to manage and record these two variables will likely change if a non-LRU policy is used. For example, we wouldn't want to spend time recording last_accessed time if we switched to a FIFO eviction policy.
Evictor unit tests would be good. The eviction logic is somewhat cumbersome to exercise in integration tests because it only runs when we run out of blocks.

vllm/core/block/prefix_caching_block.py

SageMoore · 2024-04-18T13:09:07Z

vllm/core/block/evictor.py

+        pass
+
+
+class Block():


There's a bit of a name collision here. Can you rename this since "Block" is already used extensively? I went with BlockMetaData. Feel free to use that or a new name :). EvictorBlockInfo?

BlockMetaData is nice, I'll take it~

SageMoore · 2024-04-18T13:16:41Z

vllm/core/block/prefix_caching_block.py

@@ -351,6 +376,9 @@ def __init__(
        self._prev_block = prev_block
        self._cached_content_hash: Optional[int] = None
        self._prefix_caching_allocator = prefix_caching_allocator
+        self.computed = False
+        self.num_hashed_tokens = 0


Nit: Instead of manually setting this when marking blocks as computed, you should be able to do something like:

@property def num_tokens_total(self) -> int: return self._prev_block.num_tokens_total + len(self.token_ids)

You'll have to account for the _prev_block == None case.

I didn't get the num_tokens_total usage here...

To be clear, I'm talking about using the "num_tokens_total" property that I posted above as a replacement for "num_hashed_tokens". The primary benefit is that you won't have to update it as part of "mark_blocks_as_computed"

I see, add this property in latest commit

leiwen83 · 2024-04-19T15:12:53Z

I think moving incrementally here is good. Let's try to get this PR into a mergable state and go from there.

There are four high level changes that I would suggest for this PR.

Move the core/block/evictor.py file to core/evictor_v2.py, similar to what was done for block_manager. Rename core/evictor.py to core/evictor_v1.py.

Let's remove unused_cached_blocks from the PrefixCachingBlockAllocator and just rely on the evictor.

Assuming that we don't want to make the policy an argument to the BlockSpaceManagerV2, lets move the code that records last_accessed and computed into the PrefixCachingBlockAllocator. The code to manage and record these two variables will likely change if a non-LRU policy is used. For example, we wouldn't want to spend time recording last_accessed time if we switched to a FIFO eviction policy.

Evictor unit tests would be good. The eviction logic is somewhat cumbersome to exercise in integration tests because it only runs when we run out of blocks.

Great thanks for those comments! And I borrow the evictor test case from your PR, and add some change to make it pass.
Let's see what else need to be done for this PR.

SageMoore · 2024-04-22T13:39:52Z

This looks reasonable to me. Thanks for working on it!

cadedaniel

Thanks for the PR. Some missing things:

Can you add a PR description? it helps with fast reviews -- you can detail your approach, the testing, what's missing, etc. Shouldn't take long!

To call automatic prefix caching in BlockManagerV2 "done", we need end-to-end tests.

See this example

vllm/tests/core/block/e2e/test_correctness.py

Lines 187 to 199 in ceaf4ed

    
           def test_lookahead_greedy_equality_with_preemption(baseline_llm_generator, 
        
                                                              test_llm_generator, 
        
                                                              batch_size): 
        
               """Verify vLLM produces the same output with greedy sampling, when lookahead 
        
               scheduling is used vs. not. 
        
               Lookahead scheduling is not expected to modify the output, as it simply 
        
               allocates empty slots ahead of the known token ids in a sliding fashion. 
        
               This test constrains the total number of blocks to force preemption. It also 
        
               varies the block size so that the lookahead size is less than and greater 
        
               than the block size. 
        
               """

, which compares decoding output of a sequence with lookahead scheduling against output without lookahead scheduling and verifies they are the same.

Off the top of my head, we'll want to test the following in APC. Maybe @SageMoore can also help here, since it's a bit out of scope of the BlockManagerV2 itself.
- Greedy output equality with APC and without APC
- Greedy output equality with APC + preemption

tests/core/block/test_prefix_caching_block.py

vllm/core/block/cpu_gpu_block_allocator.py

vllm/core/block_manager_v2.py

vllm/core/block/cpu_gpu_block_allocator.py

This PR is for enabel APC feature over block manager V2. APC core idea is like maintain a table mapping between content hash and physical block. If one block is full, then it would be prompted to be immutable. While if several immutable comes in sequence, we kept the first physical block, while free the other. Those freed physical block, aka no reference at scheduler, would be housekept inside evictor, when we try to alloc block again under pressue, and find no free one as in the initial state, we could resort to evictor to pop one unrefered computed block, which is used to store new data comes from upper layer. The evictor policy could be changed, but current only implement LRU one. Co-authored-by: Sage Moore <sagemoore@utexas.edu>

leiwen83 · 2024-04-27T07:30:03Z

Thanks for the PR. Some missing things:

Can you add a PR description? it helps with fast reviews -- you can detail your approach, the testing, what's missing, etc. Shouldn't take long!

To call automatic prefix caching in BlockManagerV2 "done", we need end-to-end tests.

See this example

vllm/tests/core/block/e2e/test_correctness.py

Lines 187 to 199 in ceaf4ed

def test_lookahead_greedy_equality_with_preemption(baseline_llm_generator,

test_llm_generator,

batch_size):

"""Verify vLLM produces the same output with greedy sampling, when lookahead

scheduling is used vs. not.

Lookahead scheduling is not expected to modify the output, as it simply

allocates empty slots ahead of the known token ids in a sliding fashion.

This test constrains the total number of blocks to force preemption. It also

varies the block size so that the lookahead size is less than and greater

than the block size.

"""

, which compares decoding output of a sequence with lookahead scheduling against output without lookahead scheduling and verifies they are the same.

Off the top of my head, we'll want to test the following in APC. Maybe @SageMoore can also help here, since it's a bit out of scope of the BlockManagerV2 itself.

Greedy output equality with APC and without APC

Greedy output equality with APC + preemption

Got it, I refactor the PR and make the change as suggested. For e2e test, I add two, one is prefill caching enabled while comparing v1 vs v2, and in v2 + preemption, w/ APC vs w/o APC.

cadedaniel

Please add a TODO on the O(N^2) num_total_tokens. Otherwise LGTM.

cadedaniel · 2024-04-30T17:31:29Z

vllm/core/block/prefix_caching_block.py

+        _block = self
+        self._cached_num_tokens_total = 0
+        while _block is not None:
+            self._cached_num_tokens_total += len(_block.token_ids)
+            _block = _block.prev_block


This is still O(N^2) for the entire sequence, e.g. each block will perform O(N) scan. We should use the design I linked to you content_hash which is O(N).

I will not block the PR (we can optimize later) but you should add a TODO!

I see, I would add this TODO note

cadedaniel

Thanks @leiwen83

…ect#4142) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>

SageMoore suggested changes Apr 18, 2024

View reviewed changes

SageMoore reviewed Apr 18, 2024

View reviewed changes

SageMoore mentioned this pull request Apr 19, 2024

[Core][WIP] Add Automatic Prefix Caching to PrefixCachingBlockAllocator #4146

Closed

leiwen83 force-pushed the block_v2_prefix branch from 807808e to 3daae3c Compare April 21, 2024 08:20

cadedaniel reviewed Apr 23, 2024

View reviewed changes

leiwen83 force-pushed the block_v2_prefix branch from 7163552 to ebc941e Compare April 27, 2024 07:27

wenlei03 added 5 commits April 27, 2024 15:31

fix ruff

de4537a

fix yapf

3053991

fix _blocks track issue

8527c30

fix ruff issue

425b657

fix yapf

b57bfcb

leiwen83 requested review from cadedaniel and SageMoore April 29, 2024 13:10

cadedaniel reviewed Apr 30, 2024

View reviewed changes

robertgshaw2-neuralmagic mentioned this pull request Apr 30, 2024

v0.4.2 Release Tracker #4505

Closed

add performance todo note

7a547e1

leiwen83 requested a review from cadedaniel May 1, 2024 13:59

cadedaniel approved these changes May 1, 2024

View reviewed changes

cadedaniel merged commit 24750f4 into vllm-project:main May 1, 2024
48 checks passed

cadedaniel mentioned this pull request May 1, 2024

[Performance]: Profile & optimize the BlockManagerV2 #4536

Open

leiwen83 deleted the block_v2_prefix branch May 2, 2024 08:58

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Core] Enable prefix caching with block manager v2 enabled (vllm-proj…

b2eed51

…ect#4142) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024

[Core] Enable prefix caching with block manager v2 enabled (vllm-proj…

2c112c6

…ect#4142) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core] Enable prefix caching with block manager v2 enabled (vllm-proj…

080adc3

…ect#4142) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Enable prefix caching with block manager v2 enabled #4142

[Core] Enable prefix caching with block manager v2 enabled #4142

leiwen83 commented Apr 17, 2024 •

edited

Loading

leiwen83 commented Apr 17, 2024

SageMoore commented Apr 17, 2024 •

edited

Loading

leiwen83 commented Apr 17, 2024

cadedaniel commented Apr 18, 2024

SageMoore left a comment

SageMoore Apr 18, 2024

leiwen83 Apr 19, 2024

SageMoore Apr 18, 2024

leiwen83 Apr 19, 2024

SageMoore Apr 22, 2024

leiwen83 Apr 22, 2024

leiwen83 commented Apr 19, 2024

SageMoore commented Apr 22, 2024

cadedaniel left a comment

leiwen83 commented Apr 27, 2024

cadedaniel left a comment

cadedaniel Apr 30, 2024

leiwen83 May 1, 2024

cadedaniel left a comment

	def test_lookahead_greedy_equality_with_preemption(baseline_llm_generator,
	test_llm_generator,
	batch_size):
	"""Verify vLLM produces the same output with greedy sampling, when lookahead
	scheduling is used vs. not.

	Lookahead scheduling is not expected to modify the output, as it simply
	allocates empty slots ahead of the known token ids in a sliding fashion.

	This test constrains the total number of blocks to force preemption. It also
	varies the block size so that the lookahead size is less than and greater
	than the block size.
	"""

[Core] Enable prefix caching with block manager v2 enabled #4142

[Core] Enable prefix caching with block manager v2 enabled #4142

Conversation

leiwen83 commented Apr 17, 2024 • edited Loading

leiwen83 commented Apr 17, 2024

SageMoore commented Apr 17, 2024 • edited Loading

leiwen83 commented Apr 17, 2024

cadedaniel commented Apr 18, 2024

SageMoore left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leiwen83 commented Apr 19, 2024

SageMoore commented Apr 22, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

leiwen83 commented Apr 27, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cadedaniel left a comment

Choose a reason for hiding this comment

leiwen83 commented Apr 17, 2024 •

edited

Loading

SageMoore commented Apr 17, 2024 •

edited

Loading