Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the mixed chunk prefill by lanuch two kernels #2811

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

libratiger
Copy link
Contributor

@libratiger libratiger commented Jan 9, 2025

Motivation

Improve the mixed chunk prefill performance. see #2273

launch two kernels: one prefill attention kernel for prefill requests and one decode attention kernel for decode requests.

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@libratiger libratiger changed the title Improve the mixed chunk prefill WIP: Improve the mixed chunk prefill Jan 9, 2025
launch two kernels for one batch,
@libratiger libratiger changed the title WIP: Improve the mixed chunk prefill Improve the mixed chunk prefill by lanuch two lernels Jan 10, 2025
@libratiger libratiger changed the title Improve the mixed chunk prefill by lanuch two lernels Improve the mixed chunk prefill by lanuch two kernels Jan 10, 2025
@libratiger
Copy link
Contributor Author

related: #2273

@libratiger
Copy link
Contributor Author

cc @merrymercy for review.
I will give more performance detail later.

@libratiger libratiger requested a review from ByronHsu as a code owner January 13, 2025 04:51
@merrymercy
Copy link
Contributor

@libratiger Could you share some perf numbers?
@hnyls2002 Could you review this?

@libratiger
Copy link
Contributor Author

@libratiger Could you share some perf numbers? @hnyls2002 Could you review this?

Yes, I will add this as soon as possible.
But I want to ensure the implementation is correct and reliable first. Any feedback from the review would be greatly appreciated.

I added a new property from ScheduleBatch to ForwardBatch to distinguish between prefill and decode.
Now there are enough properties. If you have any optimization suggestions, I will improve it. @merrymercy @hnyls2002

@xiezhq-hermann
Copy link
Collaborator

Thank you for your contribution. I'm also curious about the performance. From my understanding, the advantage of mixing prefill and decode lies in their complementary resource usage: prefill is compute-bound, while decoding can utilize some remaining memory bandwidth. In my opinion, a specialized kernel for mixed computation might be an ideal solution. Alternatively, have you tried launching the two kernels concurrently?

@libratiger
Copy link
Contributor Author

Thank you for your contribution. I'm also curious about the performance. From my understanding, the advantage of mixing prefill and decode lies in their complementary resource usage: prefill is compute-bound, while decoding can utilize some remaining memory bandwidth. In my opinion, a specialized kernel for mixed computation might be an ideal solution. Alternatively, have you tried launching the two kernels concurrently?

Your comment shows deep insight!
theoretically using a single kernel would have lower overhead. This is can be implemented by similar to the flash_attn_varlen_func in FlashAttention. However, currently FlashInfer doesn't seem to implement such an interface.

The current implementation uses two kernels, primarily due to two considerations:

  1. To maintain consistency with the previous PR, making it easier to review and capture partial performance benefits
  2. Other attention backends can be optimized similarly by reusing this logic, without needing to dive into low-level kernel modifications

Regarding the performance benefits of launching two kernels, I don't have sufficient performance data yet.

However, after completing my implementation, I looked into other frameworks' implementations to address this concern and found they also use two kernels, suggesting that this approach should still yield some performance benefits.

We can further optimize it to use a single kernel once we have more performance data, which theoretically would reduce some upper-layer overhead.

@xiezhq-hermann
Copy link
Collaborator

xiezhq-hermann commented Jan 14, 2025

Thank you for your contribution. I'm also curious about the performance. From my understanding, the advantage of mixing prefill and decode lies in their complementary resource usage: prefill is compute-bound, while decoding can utilize some remaining memory bandwidth. In my opinion, a specialized kernel for mixed computation might be an ideal solution. Alternatively, have you tried launching the two kernels concurrently?

Your comment shows deep insight! theoretically using a single kernel would have lower overhead. This is can be implemented by similar to the flash_attn_varlen_func in FlashAttention. However, currently FlashInfer doesn't seem to implement such an interface.

The current implementation uses two kernels, primarily due to two considerations:

  1. To maintain consistency with the previous PR, making it easier to review and capture partial performance benefits
  2. Other attention backends can be optimized similarly by reusing this logic, without needing to dive into low-level kernel modifications

Regarding the performance benefits of launching two kernels, I don't have sufficient performance data yet.

However, after completing my implementation, I looked into other frameworks' implementations to address this concern and found they also use two kernels, suggesting that this approach should still yield some performance benefits.

We can further optimize it to use a single kernel once we have more performance data, which theoretically would reduce some upper-layer overhead.

Thanks for the reply, keep me posted with the performance number and we can merge the two kernel version first if the benefit justifies. Consider to benchmark concurrent launch version as well which might also be interesting.

@libratiger libratiger marked this pull request as draft January 17, 2025 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants