-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][Speculative Decoding] Add EAGLE-style MTP module reference code for DeepSeek-R1 #12915
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Hi @benchislett, does that mean your implementation will be like EAGLE head and predict multiple tokens with k > 1 by reusing MTP 1? Trying to understand your statement |
Hi @Neo9061, that is correct. In the existing EAGLE implementation (limited to single-GPU TP=1), the hidden states from the output of the draft model are reused as inputs for multi-token drafting. I extended this functionality to the model_runner path (supports multi-gpu TP=N) to unlock multi-token prediction from a single MTP module. This is not entirely future-proof, as the future release of additional MTP module weights could not trivially integrate with this strategy, but as of right now there is only k=1 module available and the current style of drafting multiple tokens allows for much more effective speculative decoding. |
Hi @benchislett, have you tested this PR, how does the speed look like? |
Hi @LiuXiaoxuanPKU , the performance for this implementation in practice is quite good. Approximately a 2x speedup for single-request inference of DeepSeek-R1 on 8xH200, and a significant improvement across nearly all batch sizes. The acceptance rate for k=2 using this implementation is about 73%, with <2ms for drafting each token and ~30ms for scoring (single-request). This gives a theoretical speedup improvement of 1.997x, which we do see in practice. |
@benchislett I want to know whether it is possible to run the R1 model with pp=2 and tp=8, while running this draft model with tp=8. Because I have no 8xH200 node, so I use 2 nodes with 8*H800. |
It is my understanding that PP and speculative decoding are incompatible. So for this pr I didn't worry about removing the supprtspp flag. If this has changed, I can make sure PP is functional on this branch. |
This pull request has merge conflicts that must be resolved before it can be |
# (~80% for token 1, ~60% for token 2 due to accuracy decay) | ||
python3 \ | ||
-m vllm.entrypoints.openai.api_server \ | ||
--disable-log-requests \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder which dataset is used in this testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample requests from ShareGPT are used.
What is the specific range of all batch sizes? Is it speed down rapidly with batchsize grows? |
Hi @benchislett I am using your code but hitting an error loading the MTP head. Error is shown as below. Do you have any insights where the problem might be? I am using two nodes 8 H100 with Ray cluster as backend. The error seems to be relating to the implementation.
This is my invoking command.
|
Hi @Neo9061 , this PR serves as a reference implementation, so a number of things are hacked together. One such thing is that there are a number of manual checks for "model == 'deepseek-ai/DeepSeek-R1'". Could you set the model name to this, and use an environment variable like "HF_HUB_CACHE" to redirect the model loading to your desired path? If that doesn't change things, perhaps inspect the model loading/weight files. Is this a standard download of the deepseek R1 weights? |
Thanks @benchislett for sharing insights. I have done what you suggested - 1/ change model name to strict The key error term Update: the error happens at this line and And this line is never effective as there is no layer ended with |
Thanks @benchislett. I haven't tried that recent merged PR, as my top priority is to be able to use speculative tokens more than 1. I am also checking whether if it is feasible in #12755 |
@Neo9061 please prioritize testing with the existing merged PR. I will assist them with enabling k>1 similarly to this PR going forward. If you find that there are no issues with the other branch, please let me know. Otherwise, it may just be an issue with the multi-node configuration |
Thanks @benchislett . Will prioritize testing it. I also leave a comment in their PR asking this question. Feel free to chime in if you have any thoughts - even a hacky solution is appreciated. |
In essence, this PR is exemplative of one such hacky solution. For a simpler modification, you could try to force the spec_step_idx to be 0 always during inference (see here): However, you may encounter the fact that MLA attention is currently (to my knowledge) incompatible with multi-step scheduling, so running k>1 through the TP==1 code path in that PR will likely fail. You will probably also need to modify the code here to forward the hidden states: as I have done here: https://github.com/vllm-project/vllm/pull/12915/files#diff-4b4a724a124ddb123bcb688b15678ad862815bb769880f74affbefed82d4354bR99 |
Hey @benchislett I finally made your solution working by distributed inference with speculation length k > 1. Because I have to use docker for distributed ray cluster setup so the vllm installed there is not really the base your PR is comparing with :) Thank you so much for the patient reply. |
Thanks @benchislett ! Will look into your new PR. Just wonder did you see improved performance compared to your current PR? |
The performance is the same. |
This is @CentML's implementation of DeepSeek MTP modules that enable speculative decoding for DeepSeek-R1.
The changes in this branch enable
--speculative-model DeepSeekV3MTP
to register itself as an alternate implementation of DeepSeek-R1, that loads only the MTP weights and have modified model code to invoke only the MTP layer. This model code resides indeepseek_mtp.py
.An additional change moves the RMSNorm application of the final hidden states into the
compute_logits
function, such that theprevious_hidden_states
which are taken from the output ofmodel(...)
are un-normalized. Our experimental results show a small increase in predictive accuracy by making this change. This makes sense intuitively because the hidden states are treated as they would be for an additional layer of the transformer model and the output norm is treated separately as a part of the output head.This code also enables the EAGLE code path for reusing
previous_hidden_states
across TP workers in the base model runner code, enabling (single-step) multi-GPU draft model execution.While it is possible that the hidden states can be reused from each worker and do not need to be broadcasted to the TP workers between iterations, we choose to follow the EAGLE syntax and respect the abstraction boundary of draft worker / target worker for speculative decoding.
Notably, this implementation does not mask the zero-th position of the input embeddings into the MTP module, though the existing EAGLE models do. This is because of an unknown issue with CUDA graphs and MLA attention that causes accuracy issues when this is performed. Our testing shows much improved acceptance rates by omitting this problematic masking.
Note that this is not the first vLLM PR to implement MTP module support for DeepSeekV3 models, and serves primarily as a reference implementation for validation purposes. See vllm#12755