Improve the mixed chunk prefill by lanuch two kernels #2811

libratiger · 2025-01-09T10:21:47Z

Motivation

Improve the mixed chunk prefill performance. see #2273

launch two kernels: one prefill attention kernel for prefill requests and one decode attention kernel for decode requests.

Modifications

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

launch two kernels for one batch,

libratiger · 2025-01-10T10:57:37Z

related: #2273

libratiger · 2025-01-11T01:38:42Z

cc @merrymercy for review.
I will give more performance detail later.

merrymercy · 2025-01-13T05:25:36Z

@libratiger Could you share some perf numbers?
@hnyls2002 Could you review this?

libratiger · 2025-01-13T05:44:14Z

@libratiger Could you share some perf numbers? @hnyls2002 Could you review this?

Yes, I will add this as soon as possible.
But I want to ensure the implementation is correct and reliable first. Any feedback from the review would be greatly appreciated.

I added a new property from ScheduleBatch to ForwardBatch to distinguish between prefill and decode.
Now there are enough properties. If you have any optimization suggestions, I will improve it. @merrymercy @hnyls2002

…into mixed_chunked

xiezhq-hermann · 2025-01-13T22:07:25Z

Thank you for your contribution. I'm also curious about the performance. From my understanding, the advantage of mixing prefill and decode lies in their complementary resource usage: prefill is compute-bound, while decoding can utilize some remaining memory bandwidth. In my opinion, a specialized kernel for mixed computation might be an ideal solution. Alternatively, have you tried launching the two kernels concurrently?

libratiger · 2025-01-14T02:17:22Z

Thank you for your contribution. I'm also curious about the performance. From my understanding, the advantage of mixing prefill and decode lies in their complementary resource usage: prefill is compute-bound, while decoding can utilize some remaining memory bandwidth. In my opinion, a specialized kernel for mixed computation might be an ideal solution. Alternatively, have you tried launching the two kernels concurrently?

Your comment shows deep insight!
theoretically using a single kernel would have lower overhead. This is can be implemented by similar to the flash_attn_varlen_func in FlashAttention. However, currently FlashInfer doesn't seem to implement such an interface.

The current implementation uses two kernels, primarily due to two considerations:

To maintain consistency with the previous PR, making it easier to review and capture partial performance benefits
Other attention backends can be optimized similarly by reusing this logic, without needing to dive into low-level kernel modifications

Regarding the performance benefits of launching two kernels, I don't have sufficient performance data yet.

However, after completing my implementation, I looked into other frameworks' implementations to address this concern and found they also use two kernels, suggesting that this approach should still yield some performance benefits.

We can further optimize it to use a single kernel once we have more performance data, which theoretically would reduce some upper-layer overhead.

xiezhq-hermann · 2025-01-14T02:25:11Z

Thank you for your contribution. I'm also curious about the performance. From my understanding, the advantage of mixing prefill and decode lies in their complementary resource usage: prefill is compute-bound, while decoding can utilize some remaining memory bandwidth. In my opinion, a specialized kernel for mixed computation might be an ideal solution. Alternatively, have you tried launching the two kernels concurrently?

Your comment shows deep insight! theoretically using a single kernel would have lower overhead. This is can be implemented by similar to the flash_attn_varlen_func in FlashAttention. However, currently FlashInfer doesn't seem to implement such an interface.

The current implementation uses two kernels, primarily due to two considerations:

To maintain consistency with the previous PR, making it easier to review and capture partial performance benefits

Other attention backends can be optimized similarly by reusing this logic, without needing to dive into low-level kernel modifications

Regarding the performance benefits of launching two kernels, I don't have sufficient performance data yet.

However, after completing my implementation, I looked into other frameworks' implementations to address this concern and found they also use two kernels, suggesting that this approach should still yield some performance benefits.

We can further optimize it to use a single kernel once we have more performance data, which theoretically would reduce some upper-layer overhead.

Thanks for the reply, keep me posted with the performance number and we can merge the two kernel version first if the benefit justifies. Consider to benchmark concurrent launch version as well which might also be interesting.

libratiger requested review from merrymercy, Ying1123, hnyls2002, zhyncs and ispobock as code owners January 9, 2025 10:21

libratiger changed the title ~~Improve the mixed chunk prefill~~ WIP: Improve the mixed chunk prefill Jan 9, 2025

Improve the mixed chunk prefill

4bdd901

launch two kernels for one batch,

libratiger force-pushed the mixed_chunked branch from e9c826d to 4bdd901 Compare January 9, 2025 10:26

libratiger added 5 commits January 10, 2025 18:33

Improve the index in the FlashInferIndicesUpdaterDecode

7dd7ce5

Merge branch 'main' into mixed_chunked

d6164a7

improve the MixedMetadata

5435a01

fix the FlashInferIndicesUpdaterPrefill

cb2d4f7

Fix some bugs

bd85c2c

libratiger changed the title ~~WIP: Improve the mixed chunk prefill~~ Improve the mixed chunk prefill by lanuch two lernels Jan 10, 2025

libratiger changed the title ~~Improve the mixed chunk prefill by lanuch two lernels~~ Improve the mixed chunk prefill by lanuch two kernels Jan 10, 2025

libratiger added 2 commits January 11, 2025 08:56

Fix the init order

d092fdc

Merge branch 'main' into mixed_chunked

8e5f6b3

libratiger added 2 commits January 13, 2025 12:50

Enable mixed chunk by default

2b3c486

Merge remote-tracking branch 'origin' into mixed_chunked

875bb27

libratiger requested a review from ByronHsu as a code owner January 13, 2025 04:51

Merge branch 'main' into mixed_chunked

e22a553

libratiger added 2 commits January 13, 2025 18:43

remove the unused assert

62ab356

Merge branch 'mixed_chunked' of https://github.com/libratiger/sglang …

89062bf

…into mixed_chunked

libratiger added 3 commits January 15, 2025 11:20

add some assert

5ad9c1d

Fix the errors

83341be

restore the default parameter

ccee4a2

libratiger marked this pull request as draft January 17, 2025 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the mixed chunk prefill by lanuch two kernels #2811

Improve the mixed chunk prefill by lanuch two kernels #2811

libratiger commented Jan 9, 2025 •

edited by merrymercy

Loading

libratiger commented Jan 10, 2025

libratiger commented Jan 11, 2025

merrymercy commented Jan 13, 2025

libratiger commented Jan 13, 2025

xiezhq-hermann commented Jan 13, 2025

libratiger commented Jan 14, 2025

xiezhq-hermann commented Jan 14, 2025 •

edited

Loading

Improve the mixed chunk prefill by lanuch two kernels #2811

Are you sure you want to change the base?

Improve the mixed chunk prefill by lanuch two kernels #2811

Conversation

libratiger commented Jan 9, 2025 • edited by merrymercy Loading

Motivation

Modifications

Checklist

libratiger commented Jan 10, 2025

libratiger commented Jan 11, 2025

merrymercy commented Jan 13, 2025

libratiger commented Jan 13, 2025

xiezhq-hermann commented Jan 13, 2025

libratiger commented Jan 14, 2025

xiezhq-hermann commented Jan 14, 2025 • edited Loading

libratiger commented Jan 9, 2025 •

edited by merrymercy

Loading

xiezhq-hermann commented Jan 14, 2025 •

edited

Loading