You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Speculative decoding allows emitting multiple tokens per sequence by speculating future tokens, scoring their likelihood using the LLM, and then accepting each speculative token based on its likelihood. This process is laid out in the following diagram:
The problem with rejection sampling is that it holds a very high bar for quality: it is lossless and guarantees the distribution of the target model, even if it means rejecting plausible speculative tokens.
This issue is a request to implement Medusa's typical acceptance routing in vLLM. Typical acceptance trades off output quality to increase the acceptance rate. See "Choice of threshold in typical acceptance" in the Medusa blogpost for more information.
vLLM users should be able to toggle between different acceptance routines; they can use rejection sampling for tasks that require higher quality, or typical acceptance when speedup is more important.
NOTE: This acceptance routine should work with other proposal types (Eagle, draft, ngram, other), not just Medusa. The speculative decoding framework in vLLM may need improvements to the rejection sampling interface to support this.
🚀 The feature, motivation and pitch
Speculative decoding allows emitting multiple tokens per sequence by speculating future tokens, scoring their likelihood using the LLM, and then accepting each speculative token based on its likelihood. This process is laid out in the following diagram:
The problem with rejection sampling is that it holds a very high bar for quality: it is lossless and guarantees the distribution of the target model, even if it means rejecting plausible speculative tokens.
This issue is a request to implement Medusa's typical acceptance routing in vLLM. Typical acceptance trades off output quality to increase the acceptance rate. See "Choice of threshold in typical acceptance" in the Medusa blogpost for more information.
vLLM users should be able to toggle between different acceptance routines; they can use rejection sampling for tasks that require higher quality, or typical acceptance when speedup is more important.
NOTE: This acceptance routine should work with other proposal types (Eagle, draft, ngram, other), not just Medusa. The speculative decoding framework in vLLM may need improvements to the rejection sampling interface to support this.
Alternatives
No response
Additional context
vLLM's rejection sampler is implemented here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rejection_sampler.py
The text was updated successfully, but these errors were encountered: