-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cuda graph in the triton attention backend #1401
Conversation
Significant improvement, especially in small batch latency. Accuracy is similar to before. ref #1285 (comment)
The impact on max throughput is not significant, because after enabling CUDA Graph, TP 1 needs to adjust --mem-frac 0.85, otherwise it will result in OOM.
|
|
|
Hi @zhyncs @merrymercy Does this support sm_89 (L40)? I see that cuda graph relies on vllm's fused_moe, but from what I can see, it seems that it does not support sm_89? |
@fengyang95 It should support L40 but I haven't tested it. I think cuda graph does not depend on specific ops. It just captures the existing ops. |
Llama 3 8B (1.3x faster)
DeepSeek-Coder-V2-Lite (4x faster)