-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the mixed chunk prefill by lanuch two kernels #2811
base: main
Are you sure you want to change the base?
Conversation
launch two kernels for one batch,
e9c826d
to
4bdd901
Compare
related: #2273 |
cc @merrymercy for review. |
@libratiger Could you share some perf numbers? |
Yes, I will add this as soon as possible. I added a new property from |
Thank you for your contribution. I'm also curious about the performance. From my understanding, the advantage of mixing prefill and decode lies in their complementary resource usage: prefill is compute-bound, while decoding can utilize some remaining memory bandwidth. In my opinion, a specialized kernel for mixed computation might be an ideal solution. Alternatively, have you tried launching the two kernels concurrently? |
Your comment shows deep insight! The current implementation uses two kernels, primarily due to two considerations:
Regarding the performance benefits of launching two kernels, I don't have sufficient performance data yet. However, after completing my implementation, I looked into other frameworks' implementations to address this concern and found they also use two kernels, suggesting that this approach should still yield some performance benefits. We can further optimize it to use a single kernel once we have more performance data, which theoretically would reduce some upper-layer overhead. |
Thanks for the reply, keep me posted with the performance number and we can merge the two kernel version first if the benefit justifies. Consider to benchmark concurrent launch version as well which might also be interesting. |
Motivation
Improve the mixed chunk prefill performance. see #2273
launch two kernels: one prefill attention kernel for prefill requests and one decode attention kernel for decode requests.
Modifications
Checklist