[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

NickLucche · 2024-10-11T17:03:39Z

Hey, this PR implements #5016.

The main idea is to make use of the current Speculative Decoder workflow and integrate it with mixed prefill-decode batches.
In particular, we can run the batched prefills and decodes together through the scorer (with the usual prefill|decode layout supported by backend), while the proposer can sync its KV cache on prefills only.

Current attention kernel implementation still doesn't make full use of the prefill|decode, but once the MQA integration is finalized we can get an easy speedup by running the batch in a single forward.

Current implementation on main already is (to some extent) prefill aware, so I was able to re-use a good chunk of the logic and the changes aren't (purposely) drastic.
On the other hand, one could prioritize optimizations more and I am open to any suggestion on how to best implement the approach, even at the cost
of re-writing more parts and making the PR more invasive (ie breaking some of the interfaces to avoid duplication).

TODO:

benchmark on A/H100
expand test coverage with prefill chunking enabled
test with new mqa_scorer, current implementation was rebased from v0.6.2
~~fix speculative methods requiring return_hidden_states~~ EDIT: on second thought, I believe this would be better addressed in a separate PR
~~disable_logprobs_during_spec_decoding compatibility~~

Update:

We add support for chunk prefill and spec decoding with the workflow depicted above, unless the proposer requires final hidden state from the target model (MLPSpeculator/Medusa): this will require supporting chunked hidden states too as input x is now split into blocks x1|x2..|xn, so this definitely needs its own PR if we want to include it.

mqa_scorer is set to supersede BatchExpansion* thanks to the great work by @LiuXiaoxuanPKU, so we add support to that scorer directly in this PR!
Incidentally, this means enabling backend with flash_attn_varlen_func to take in any "mixed prefill-decode batch" in a single kernel call (so no more decoupled prefix-decode calls), which should also boost performance in "vanilla" chunked prefill scheduling policy (no spec).

Many thanks to @sroy745 for benchmarking the BatchExpansionTop1Scorer approach here (MQA to follow)!

github-actions · 2024-10-11T17:03:51Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/spec_decode/spec_decode_worker.py

sroy745

Thanks for the pr. Left some comments. PTAL

sroy745 · 2024-10-11T22:49:05Z

vllm/spec_decode/spec_decode_worker.py

+        # TODO skip this if chunking is not enabled
+        if len(non_spec_indices):
+            all_hidden_states = proposal_scores.hidden_states
+            # TODO fix `return_hidden_states`


can you clarify more on this TODO about return_hidden_states?

here you have a hidden state entry even for non-terminal chunks, while the LogitsProcessor only selects and returns the indices that needs sampling; hence we need to use the indices prior to filtering based on do_sample to get the right hidden states

btw I am planning to have that case covered too

sroy745 · 2024-10-13T02:12:30Z

vllm/spec_decode/spec_decode_worker.py

            (seq_id, seq_data) for sg in \
            execute_model_req.seq_group_metadata_list \
            for seq_id, seq_data in sg.seq_data.items()
-        )
+            if sg.do_sample # ignore empty token sequences


Is this going to change the order of entries in seq_data_entries and seq_output_prompt_logprobs ? In the loop in L542 and L543 can we use the same value of output_index to access the seq_data_entries and seq_output_prompt_logprobs?

relative order won't change, input is guaranteed to be prefill|decodes, so you have something like

seq1: chunk to sample | chunk no sample | chunk to sample | decode |... | decode
filtered to
seq2; chunk to sample | chunk to sample | decode |... | decode
so seq2 is a subset of seq1.

We used to iterate on filtered sequences seq2 (we had no chunks), now we iterate on seq1 to account for empty outputs and keep the old index as output_index, (only increment on seq1 elements) so the order is maintained

Could you please resolve the above comment if its not applicable?

vllm/spec_decode/batch_expansion.py

vllm/config.py

vllm/spec_decode/spec_decode_worker.py

vllm/worker/model_runner.py

…run prefill through draft model

…ked prefill

…ontract

arashsadrieh · 2024-10-15T05:23:46Z

@NickLucche Thanks for the great work and understand that is WIP, just small note while you are working on this piece

We tried this PR with tensor parallelism and we found that it throughs the following exception when we activate tensor parallelism:

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /8b/  --speculative_model /1b/  --served-model-name SpeculativeLLM --tensor-parallel-size 4  --max-model-len 34336  --max-num-seqs 128  --enable-prefix-caching  --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5  --spec-decoding-acceptance-method typical_acceptance_sampler  --enable_chunked_prefill

Here is the exception:

Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: 'num_seq_groups', Traceback (most recent call last):
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 224, in _run_worker_process
     output = executor(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/conda/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/spec_decode/spec_decode_worker.py", line 459, in start_worker_execution_loop
     while self._run_non_driver_rank():
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/spec_decode/spec_decode_worker.py", line 649, in _run_non_driver_rank
     self.proposer_worker.execute_model()
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/worker/worker_base.py", line 308, in execute_model
     inputs = self.prepare_input(execute_model_req)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/worker/worker_base.py", line 298, in prepare_input
     return self._get_worker_input_from_broadcast()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/worker/worker_base.py", line 240, in _get_worker_input_from_broadcast
     worker_input = WorkerInput.from_broadcasted_tensor_dict(broadcast_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/ec2-user/tengfei_workspace/vllm/vllm/worker/worker_base.py", line 151, in from_broadcasted_tensor_dict
     num_seq_groups=tensor_dict.pop("num_seq_groups"),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 KeyError: 'num_seq_groups'

The following command works normally

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/ec2-user/tengfei_workspace/output/8b-aio-20240923-3/merged/ --speculative_model /home/ec2-user/tengfei_workspace/output/1b-aio-20240923-3/merged/ --served-model-name SpeculativeLLM --tensor-parallel-size 1 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5  --spec-decoding-acceptance-method typical_acceptance_sampler --enable_chunked_prefill --tensor-parallel-size 1

Thanks again and appreciate your work/ VLLM community

NickLucche · 2024-10-15T07:28:52Z

Thanks for testing that, will look right into it!
Might actually be related to prefix_caching, which I haven't taken into account yet (I know there's been some recent work on that too).

NickLucche · 2024-10-16T15:25:44Z

Update on mqa_scorer integration: enable_chunked_prefill with changes in this PR appears to work fine with the flash_attn kernel prior to the optimized one introduced here #9298 (so flash_attn_with_kvcache instead of flash_attn_varlen_func). I will sync with @LiuXiaoxuanPKU on this.

vllm/config.py

LiuXiaoxuanPKU · 2024-10-17T23:02:54Z

vllm/attention/backends/flash_attn.py

+        if (decode_meta and prefill_meta
+                and (pq := prefill_meta.query_start_loc)
+                and (dq := decode_meta.query_start_loc)):
+            combined_loc = torch.cat([pq, dq[1:]], axis=0)


why 1 here?
Also curious is attention_meta.query_start_loc == combined_loc?

yeah you're right these methods are useless, will remove them, thanks!

sroy745

Thanks for the pr. Left a few comments. PTAL.

vllm/attention/backends/flash_attn.py

sroy745 · 2024-10-18T02:38:44Z

vllm/config.py

+        if disable_logprobs is not None and enable_chunked_prefill:
+            raise ValueError("Chunked prefill and"
+                             "`disable-logprobs-during-spec-decoding` are "
+                             "not yet compatible.")


nit - not yet compatible -> not compatible. Same comment for L1285

sroy745 · 2024-10-18T02:39:20Z

vllm/config.py

-                "Speculative decoding and chunked prefill are "
-                f"currently mutually exclusive ({enable_chunked_prefill=}).")
-
+        if disable_logprobs is not None and enable_chunked_prefill:


do you need to check if disable_logprobs evaluates to true or false? My understanding is that if we just specify --disable-logprobs-during-spec-decoding this variable will be set to True. In that case do we want to check the value of disable_logprobs in addition to it not being None?

Why are these not compatible?

sroy745 · 2024-10-18T03:44:20Z

vllm/spec_decode/spec_decode_worker.py

-                prompt_logprobs = [
-                    create_logprobs_output(
-                        token_id=p_token_id,
+        output_index = 0


Aren't these changes needed only when disable_logprobs is True and enable_chunked_prefill is true? Currently are we allowing both to be set to true. If not are these changes needed?

sroy745 · 2024-10-18T03:53:26Z

vllm/spec_decode/spec_decode_worker.py

            (seq_id, seq_data) for sg in \
            execute_model_req.seq_group_metadata_list \
            for seq_id, seq_data in sg.seq_data.items()
-        )
+            if sg.do_sample # ignore empty token sequences


Could you please resolve the above comment if its not applicable?

sroy745 · 2024-10-18T04:00:36Z

vllm/spec_decode/spec_decode_worker.py

-                    create_logprobs_output(
-                        token_id=p_token_id,
+        output_index = 0
+        # Make sure the even prefill chunks are still aligned with their own


nit - consider rewording to Make sure the even prefill chunks -> Make sure the non-terminal prefill chunks are still aligned with ...

sroy745 · 2024-10-18T04:01:04Z

vllm/spec_decode/spec_decode_worker.py

-                        token_id=p_token_id,
+        output_index = 0
+        # Make sure the even prefill chunks are still aligned with their own
+        # empty output. One single samplerout to avoid


Could you please elaborate on the " One single samplerout to avoid" comment?

tests/spec_decode/test_spec_decode_worker.py

sroy745 · 2024-10-18T05:37:19Z

vllm/spec_decode/mqa_scorer.py

@@ -21,6 +21,11 @@ def score_proposals(
        all_proposal_lengths = proposals.proposal_lens.tolist()
        for i, seq_group_metadata in enumerate(
                execute_model_req.seq_group_metadata_list):
+            if all_proposal_lengths[i] == 0:


Please consider adding a test similar to this https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/tests/spec_decode/test_scorer.py?L49 with the request containing both prefill and decodes.

tests/spec_decode/test_spec_decode_worker.py

NickLucche requested review from njhill, LiuXiaoxuanPKU, WoosukKwon, zhuohan123, youkaichao, alexm-neuralmagic and comaniac as code owners October 11, 2024 17:03

NickLucche marked this pull request as draft October 11, 2024 17:04

comaniac assigned comaniac and LiuXiaoxuanPKU Oct 11, 2024

comaniac reviewed Oct 12, 2024

View reviewed changes

vllm/spec_decode/spec_decode_worker.py Show resolved Hide resolved

sroy745 reviewed Oct 13, 2024

View reviewed changes

NickLucche added 17 commits October 14, 2024 09:27

wip

a5d883f

wip experimenting with fixing output_processor

ee57b3c

align sequence and its output correctly based on do_sample/is_chunk; …

6931d7e

…run prefill through draft model

more debugging prints, looks awful but I can use the notes for now

3ff25c5

sort requests so that backend received prefill|decode ordered tokens

b6f54fe

remove some silly prints

20c9b11

tweak tests to show bug

ba64361

align chunk with empty output with their seq_metadata in no_spec flow

7ee4370

revert multiple SamplerOutput change but skipping sequences with chun…

a54138d

…ked prefill

update tests

b7087d3

fix bug in outputs getting mixed up due to wrong split during batch_c…

2aa397a

…ontract

fix regular chunk prefill-no_spec workflow

4f69f7b

more chunk prefill green tests

f75c002

fix bug with non_spec_indices in batch contraction

42ff2e0

clean-up unused code

3807026

complete tests

65f752a

refactor to avoid interface changes to scorer/proposer

354b230

NickLucche added 7 commits October 14, 2024 09:40

more clean up

9334a5b

temporarily disable mqa scorer

1340891

format and clean up extra comments

92576c3

clean up

b92a711

clean up

31c2a83

variables clarity

2b472ea

defer return_hidden_states speculation methods

8b88b8a

NickLucche force-pushed the chunk-spec-decoding-rebase branch from 49b03ab to 8b88b8a Compare October 14, 2024 10:41

NickLucche added 3 commits October 14, 2024 14:01

fix bug in draft_hf_config evaluation order

57f39c2

test with ngram speculator

a1be00e

format and address line too long

d502ba4

wip mqa_scorer support

8088479

NickLucche added 5 commits October 17, 2024 10:41

enable single mixed batch flash_attn kernel!

996942d

test enable_chunked_prefill flow with mock workers

2909e33

chunked prefill now works with default mqa_scorer, test only on that

f8f0e4e

disable chunk prefill and disable-logprobs

07383bc

bash format.sh

b7ca709

NickLucche marked this pull request as ready for review October 17, 2024 15:45

LiuXiaoxuanPKU reviewed Oct 17, 2024

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

LiuXiaoxuanPKU reviewed Oct 17, 2024

View reviewed changes

sroy745 reviewed Oct 18, 2024

View reviewed changes

NickLucche added 3 commits October 18, 2024 12:13

prefill/decode-only scheduling in test

a6a9b74

remove duplicate properties

9c90390

re-enable disable-logprobs-during-spec-decoding with chunked-prefill

b225719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

NickLucche commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024

sroy745 left a comment

sroy745 Oct 11, 2024

NickLucche Oct 14, 2024

NickLucche Oct 14, 2024

sroy745 Oct 13, 2024

NickLucche Oct 14, 2024

sroy745 Oct 18, 2024

arashsadrieh commented Oct 15, 2024 •

edited

Loading

NickLucche commented Oct 15, 2024 •

edited

Loading

NickLucche commented Oct 16, 2024

LiuXiaoxuanPKU Oct 17, 2024

NickLucche Oct 18, 2024

sroy745 left a comment

sroy745 Oct 18, 2024

sroy745 Oct 18, 2024

sroy745 Oct 18, 2024

sroy745 Oct 18, 2024

sroy745 Oct 18, 2024

sroy745 Oct 18, 2024

sroy745 Oct 18, 2024

sroy745 Oct 18, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

Are you sure you want to change the base?

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #9291

Conversation

NickLucche commented Oct 11, 2024 • edited Loading

github-actions bot commented Oct 11, 2024

sroy745 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arashsadrieh commented Oct 15, 2024 • edited Loading

NickLucche commented Oct 15, 2024 • edited Loading

NickLucche commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sroy745 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLucche commented Oct 11, 2024 •

edited

Loading

arashsadrieh commented Oct 15, 2024 •

edited

Loading

NickLucche commented Oct 15, 2024 •

edited

Loading