-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support double sparsity #1459
Support double sparsity #1459
Conversation
Great work. Some tips for rebasing:
|
Quick question @andy-yang-1 - Does this PR support just Double Sparsity or DS-Offload as well? |
@vnkc1 Hi, this PR doesn't support DS-Offload for now. DS-Offload may be integrated in other PR if needed. |
9798dc2
to
57c998b
Compare
Is there a plan to merge this PR? |
Yes. It should be merged within one week.
|
6b07a3d
to
5f71afa
Compare
Please fix the lint error and add an end-to-end accuracy test |
Give two example commands and past their results in the description of this PR. This is for tracking the progress. It should be something like this
|
@andy-yang-1 Can you also paste the latency results? |
@andy-yang-1 Thanks for the contribution. It is merged. |
How does one generate the ds-channel-config to be able to use this? |
I noticed that CUDA graph is not currently supported. Are there any plans to support it? @andy-yang-1 |
@max99x You can use this link to generate channel config file. @fengyang95 We may support it in the next PR |
hi @andy-yang-1 Does this support the deepseek-v2 architecture? How can I obtain the config for this structure? I see that the example here https://github.com/andy-yang-1/DoubleSparse/blob/main/evaluation/group_channel_config.py only support llama/mixtral arch. |
@andy-yang-1 I tried running the deepseek-v2 model, but encountered the following issue:
File "/opt/tiger/custome_sglang/python/sglang/srt/layers/attention/__init__.py", line 49, in forward
return self.forward_extend(q, k, v, layer, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/custome_sglang/python/sglang/srt/layers/attention/double_sparsity_backend.py", line 162, in forward_extend
k_label = torch.gather(
^^^^^^^^^^^^^
RuntimeError: Size does not match at dimension 1 expected index [7, 128, 16] to be smaller than self [7, 1, 576] apart from dimension 2
|
@fengyang95 I haven't added support for deepseek-v2 model. I may add support for this later |
@andy-yang-1 Thank you very much! Looking forward to support for deepseek-v2 and cuda graph. |
@andy-yang-1 - Loved the paper! I was trying this out and I am facing a few issues generating the config file using the mentioned script.
I replaced it with
Any help on how to run this would be appreciated. |
@shreyansh26 The first problem is caused by older version of transformers, and I will update the base repo to fix it this week. |
Thank you.
But in the Llama-3.1-8B-Instruct config file, |
@shreyansh26 Hi, I have updated the main repo. Can you try with this code? |
Thank you @andy-yang-1!! This is working perfectly now. |
|
I found that the throughput of prefill is lower when enable DS attention(from 6543.55 to 4189.16 ). The possible reason is that you use triton as attention-backend. Is it possible to use flashinfer attention in prefill to increase the throughput of prefill. |
Motivation
Modifications
sglang/python/sglang/srt/layers/sparse_decode_attention.py
Speedup Evaluation
Run double sparsity with:
Original triton implementation:
Original flashinfer implementation:
With Llama-3.1-8B:
Checklist