[1/n] Triton sampling kernel #3186

Yard1 · 2024-03-04T21:29:01Z

This PR is the first one in a series of PRs.

This PR adds a custom triton sampling kernel, giving us the following benefits:

sampling from both greedy and random sequences in the same kernel
batched deterministic sampling with per-sequence seeds
potentially fusing other operations like logprob gather

Currently the codepath using the triton kernel is disabled due to the following issues:

Triton JIT has a large kernel launch overhead which is noticeable for small models. Potential solution would be to compile the kernels ahead of time - we have a pipeline for that internally
We need to call the kernel multiple times for models with very large vocabulary (eg. gemma). This should be possible to solve in the kernel itself.
The sampling code in general is unoptimized and adds overhead on top of the kernel. It is non-trivial to simplify it due to the beam search code. Next PR will try to separate out the beam search sampling code from the rest of the sampling.

tests/kernels/test_sampler.py

ywang96

Left a few comments & questions and hope you don't mind them!

vllm/model_executor/layers/sampler.py

vllm/model_executor/layers/triton_kernel/rand.py

vllm/model_executor/layers/triton_kernel/sample.py

vllm/model_executor/sampling_metadata.py

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

njhill

Thanks @Yard1!

njhill · 2024-03-19T00:03:09Z

vllm/model_executor/sampling_metadata.py


 _SAMPLING_EPS = 1e-5
+_SEED_0_REPLACEMENT = 3403598558


 class SamplingMetadata:


Could we skip any of the new operations in this class in the case that no seeds are in use? (which I expect would be very common).

Good question - I think there are three considerations:

Skipping seeds could bring a little better performance.

Skipping seeds introduces more special cases (undesirable).

Not skipping seeds allows for request-level reproducibility on the server side, which could be useful for debugging model behavior.

Aside from those, triton random operations require some sort of a seed, so generating one would be necessary regardless.

njhill · 2024-03-19T00:09:23Z

vllm/model_executor/sampling_metadata.py

+        """Get `seeds_to_generate` child seeds from `seed` and extra entropy."""
+        if not is_greedy:
+            if seed is None:
+                randint_fn = random.randint


If there's effectively no overhead of seeded vs non-seeded random sampling, a nice feature would be to treat random.randint here equivalent to a passed-in seed, and then always return this seed in the API response.

This allows users to use the returned seed to reproduce the same output, if it happened to be something they particularly liked for example (without them having to provide a seed explicitly up-front).

Yes, I agree! That's one of the advantages of always generating the seed. I think it would be good to include it in a followup (ideally once we are using just the kernel so the logic is consistent).

Yeah it would be costly to do this in the non-kernel case.

njhill · 2024-03-19T00:20:36Z

sampling from both greedy and random sequences in the same kernel

batched deterministic sampling with per-sequence seeds

potentially fusing other operations like logprob gather

These are only benefits if they translate to non-negligible end-to-end performance improvements right? Curious what the speedup looks like as a proportion of total TPOT? I guess it depends on the mix of parameters and in particular if there are many seeded requests (presumably uncommon) and/or mix of greed, random, seeded random in the same batch (presumably more common).

I guess this question might be more important here given the nontrivial amount of new code introduced for this specific optimization.

The sampling code in general is unoptimized and adds overhead on top of the kernel. It is non-trivial to simplify it due to the beam search code. Next PR will try to separate out the beam search sampling code from the rest of the sampling.

Would these optimizations be applicable whether or not the dedicated kernel is used?

simon-mo

stamp. please address @njhill's comment before merge.

Yard1 · 2024-03-19T03:01:30Z

@njhill We are seeing ~10% reduction in sampler time in our fork, but that will require more work to achieve (two next PRs required are Triton AOT compilation for those kernels and refactor of the sampler code to avoid unnecessary operations). This PR only adds the kernel to streamline the review process. Furthermore, once we can fully move to the kernel, we'll be able to remove the existing torch-based sampling code (not including the logit processing code).

Would these optimizations be applicable whether or not the dedicated kernel is used?

I think they would make the sampler code easier to work with, though they would be tailored for the kernel. In general, the introduction of this kernel will allow us to push code complexity away from the sampler and into the kernel.

njhill · 2024-03-19T03:20:47Z

We are seeing ~10% reduction in sampler time in our fork,

@Yard1 do you have a rough sense of what percentage of TPOT sampler time accounts for? (I know as a proportion it would vary based on model size) .. e.g. if that is <10% then I guess this would translate to <1%?

Yard1 · 2024-03-19T03:49:36Z

@njhill You are correct it's not that noticeable in normal usage, but we are seeing large gains in draft model speculative decoding, where the draft model is CPU bound. It can reduce ITL by several ms in that case.

* upstream/main: [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (vllm-project#3551) [Misc][Log] Add log for tokenizer length not equal to vocabulary size (vllm-project#3500) [🚀 Ready to be merged] Added support for Jais models (vllm-project#3183) Fix 1D query issue from `_prune_hidden_states` (vllm-project#3539) [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (vllm-project#3431) [BugFix] Hot fix in setup.py for neuron build (vllm-project#3537) Migrate `logits` computation and gather to `model_runner` (vllm-project#3233) [1/n][Chunked Prefill] Refactor input query shapes (vllm-project#3236) [1/n] Triton sampling kernel (vllm-project#3186) [Bugfix] Fix ROCm support in CMakeLists.txt (vllm-project#3534)

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

Yard1 added 10 commits February 29, 2024 11:49

WIP

c636ac2

Merge branch 'main' into triton_sampling

b6adc13

Faster kernel

f215c3e

Merge branch 'main' into triton_sampling

28afb50

WIP

4a7b24a

Add test

111650e

Lint

bc5f9d4

Tweak

5e37fc2

Lint

8ccb36c

Try fix test

73e5d28

Yard1 commented Mar 5, 2024

View reviewed changes

tests/kernels/test_sampler.py Outdated Show resolved Hide resolved

Yard1 added 3 commits March 5, 2024 11:48

Update tests/kernels/test_sampler.py

c276f4d

Merge branch 'main' into triton_sampling

cb61752

Disable for now

f6bb1d0

Yard1 changed the title ~~[WIP] Triton sampling kernel~~ [1/n] Triton sampling kernel Mar 5, 2024

Yard1 requested review from WoosukKwon and zhuohan123 March 5, 2024 22:44

Yard1 marked this pull request as ready for review March 5, 2024 22:44

Yard1 requested a review from simon-mo March 5, 2024 22:56

ywang96 reviewed Mar 14, 2024

View reviewed changes

Update vllm/model_executor/sampling_metadata.py

1a548be

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

richardliaw assigned simon-mo Mar 18, 2024

Yard1 added 3 commits March 18, 2024 15:30

Merge branch 'main' into triton_sampling

0d41857

Review feedback

c41cb11

Lint

f018ebb

njhill reviewed Mar 19, 2024

View reviewed changes

FIx C

f217766

simon-mo approved these changes Mar 19, 2024

View reviewed changes

Yard1 added 2 commits March 18, 2024 20:02

Fix

fe6ae24

Lint

119e355

Yard1 merged commit 426ec4e into vllm-project:main Mar 20, 2024
30 checks passed

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

peng1999 mentioned this pull request Aug 1, 2024

[Core] Comment out unused code in sampler #7023

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[1/n] Triton sampling kernel (vllm-project#3186)

ee5e617

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

simon-mo mentioned this pull request Sep 16, 2024

[refactor] remove triton based sampler #8524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/n] Triton sampling kernel #3186

[1/n] Triton sampling kernel #3186

Yard1 commented Mar 4, 2024 •

edited

Loading

ywang96 left a comment

njhill left a comment

njhill Mar 19, 2024

Yard1 Mar 19, 2024

njhill Mar 19, 2024

Yard1 Mar 19, 2024

njhill Mar 19, 2024

njhill commented Mar 19, 2024

simon-mo left a comment

Yard1 commented Mar 19, 2024 •

edited

Loading

njhill commented Mar 19, 2024

Yard1 commented Mar 19, 2024

[1/n] Triton sampling kernel #3186

[1/n] Triton sampling kernel #3186

Conversation

Yard1 commented Mar 4, 2024 • edited Loading

ywang96 left a comment

Choose a reason for hiding this comment

njhill left a comment

Choose a reason for hiding this comment

njhill Mar 19, 2024

Choose a reason for hiding this comment

Yard1 Mar 19, 2024

Choose a reason for hiding this comment

njhill Mar 19, 2024

Choose a reason for hiding this comment

Yard1 Mar 19, 2024

Choose a reason for hiding this comment

njhill Mar 19, 2024

Choose a reason for hiding this comment

njhill commented Mar 19, 2024

simon-mo left a comment

Choose a reason for hiding this comment

Yard1 commented Mar 19, 2024 • edited Loading

njhill commented Mar 19, 2024

Yard1 commented Mar 19, 2024

Yard1 commented Mar 4, 2024 •

edited

Loading

Yard1 commented Mar 19, 2024 •

edited

Loading