[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. #6485

sroy745 · 2024-07-16T20:34:14Z

In this PR we disable the serialization of the LogProbs to CPU for both draft and target models. To that end we make the following changes

Add top level flags to allow users to configure if they want token log probabilities during speculative decoding
Introduce a light weight TargetModelRunner using which we can set the SamplingMetadata.skip_sampler_cpu_output as needed.
Made changes to SpecDecodeWorker to skip serialization of log probability tensors if not needed.
Fixed clearing out the appropriate tensor in SamplerOutput. E.g. earlier we were clearing sampler_output.probs which is not a valid tensor.
Fixed failing tests as needed.

Pull from head

github-actions · 2024-07-16T20:34:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only trigger fastcheck CI to run, which consists only a small and essential subset of tests to quickly catch errors with the flexibility to run extra individual tests on top (you can do this by unblocking test steps in the Buildkite run).

Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well.

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

sroy745 · 2024-07-19T19:49:40Z

@cadedaniel this pr is now ready for review. PTAL.

cadedaniel

great stuff, thanks! all feedback is code cleanliness/comments

vllm/spec_decode/spec_decode_worker.py

vllm/spec_decode/target_model_runner.py

cadedaniel · 2024-07-19T20:30:49Z

vllm/spec_decode/target_model_runner.py

+
+
+class TargetModelRunner(ModelRunner):
+    """Specialized model runner for speculative decoding target model.


nit: add comment explaining why we do this

In speculative decoding, the logprobs selected may not be the same ones as selected by the target model sampling. This means that the time spent in the logprob calculation of the target model is time wasted, since we calculate logprobs after deciding which tokens are accepted. For this reason we disable logprobs in the target model so scoring is faster.

vllm/spec_decode/spec_decode_worker.py

vllm/config.py

sroy745 · 2024-07-21T05:04:13Z

Thanks for the review. Addressed all comments. PTAL.

cadedaniel · 2024-07-21T06:58:39Z

Enabling auto merge

… for both draft and target models. (vllm-project#6485)

sroy745 and others added 19 commits May 28, 2024 20:39

Merge pull request #1 from vllm-project/main

5650b95

Pull from head

Merge branch 'vllm-project:main' into main

8f36146

Merge branch 'vllm-project:main' into main

9e75057

Merge branch 'vllm-project:main' into main

db2c679

Merge branch 'vllm-project:main' into main

8d7512c

Merge branch 'vllm-project:main' into main

1473f74

Merge branch 'vllm-project:main' into main

4013e1a

Merge branch 'vllm-project:main' into main

2dbdd78

Merge branch 'vllm-project:main' into main

b3575e9

Merge branch 'vllm-project:main' into main

94b0d43

Merge branch 'vllm-project:main' into main

fa8fedf

Merge branch 'vllm-project:main' into main

6ed96b4

Merge branch 'vllm-project:main' into main

b71c533

Merge branch 'vllm-project:main' into main

57babef

Merge branch 'vllm-project:main' into main

4b19bac

Merge branch 'vllm-project:main' into main

eb7a1c4

Merge branch 'vllm-project:main' into main

7e2c87e

Merge branch 'vllm-project:main' into main

6212d5f

Disable LogProbs generation for spec decoding

1eddc27

Changes to sampler.py

7a2f99c

sroy745 marked this pull request as draft July 16, 2024 20:45

sroy745 and others added 8 commits July 17, 2024 14:50

Merge branch 'main' into sroy-sd-disable-log-prob

92a050a

Not sync logprobs

a46433f

Revert change and add target_model_runner

81c1628

Revert one more file

a0cfe9b

Revert changes to sampler.py

773ad81

Fix formatting and other errors

3d8e0e8

Reverting unneeded changes

95bc438

Change to batch_expansion.py

d2f0237

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 19, 2024

sroy745 and others added 10 commits July 19, 2024 08:15

Fixes for e2e logprobs test

a31a457

Fix formatting

acfef3c

Reverting a change

e631a3e

Fix tests

4148de6

Dummy commit

0bf1ae0

dummy commit

d73aba9

Fix comments

40af330

Merge branch 'main' into sroy-sd-disable-log-prob

d3d54c1

Some documentation fix

ec25de7

Format and comments

c6417a7

sroy745 changed the title ~~[WIP] [Spec Decode] Disable Log Probs computation for spec decoding for both draft and target models.~~ [Spec Decode] Disable Log Probs for spec decoding for both draft and target models. Jul 19, 2024

sroy745 changed the title ~~[Spec Decode] Disable Log Probs for spec decoding for both draft and target models.~~ [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. Jul 19, 2024

sroy745 marked this pull request as ready for review July 19, 2024 19:49

cadedaniel approved these changes Jul 19, 2024

View reviewed changes

sroy745 added 4 commits July 20, 2024 05:46

Addressing comments

25ce4af

Moving logic to 2 new methods

7bc875b

Address comments

596506d

Address comments

c35f8f0

Small fix to comment

2c9d95f

cadedaniel merged commit 14f91fe into vllm-project:main Jul 21, 2024
72 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Spec Decode] Disable Log Prob serialization to CPU for spec decoding…

47a7201

… for both draft and target models. (vllm-project#6485)

gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024

[Spec Decode] Disable Log Prob serialization to CPU for spec decoding…

04eb29f

… for both draft and target models. (vllm-project#6485)

cduk pushed a commit to cduk/vllm-pascal that referenced this pull request Aug 6, 2024

[Spec Decode] Disable Log Prob serialization to CPU for spec decoding…

a3d1310

… for both draft and target models. (vllm-project#6485)

njhill mentioned this pull request Aug 6, 2024

[Bug]: speculative decoding doesn't work with online mode #6967

Closed

tjohnson31415 mentioned this pull request Aug 6, 2024

[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output #7232

Merged

tjohnson31415 mentioned this pull request Aug 20, 2024

fix: enable logprobs during spec decoding by default opendatahub-io/vllm#131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. #6485

[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. #6485

sroy745 commented Jul 16, 2024 •

edited

Loading

github-actions bot commented Jul 16, 2024

sroy745 commented Jul 19, 2024

cadedaniel left a comment

cadedaniel Jul 19, 2024

sroy745 Jul 21, 2024

sroy745 commented Jul 21, 2024

cadedaniel commented Jul 21, 2024



		class TargetModelRunner(ModelRunner):
		"""Specialized model runner for speculative decoding target model.

[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. #6485

[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. #6485

Conversation

sroy745 commented Jul 16, 2024 • edited Loading

github-actions bot commented Jul 16, 2024

sroy745 commented Jul 19, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

cadedaniel Jul 19, 2024

Choose a reason for hiding this comment

sroy745 Jul 21, 2024

Choose a reason for hiding this comment

sroy745 commented Jul 21, 2024

cadedaniel commented Jul 21, 2024

sroy745 commented Jul 16, 2024 •

edited

Loading