Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement speculative decoding #5245

Closed
CjhHa1 opened this issue Jan 9, 2024 · 0 comments
Closed

Implement speculative decoding #5245

CjhHa1 opened this issue Jan 9, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@CjhHa1
Copy link
Contributor

CjhHa1 commented Jan 9, 2024

Development branch: https://github.com/hpcaitech/ColossalAI/tree/feat/speculative-decoding

In speculative decoding, or assisted decoding, both a drafter model (small model) and a main model (large model) will be used. The drafter model will generate a few tokens sequentially, and then the main model will validate those candidate tokens in parallel and accept validated ones. The decoding process will be speeded up, for the latency of speculating multiple tokens by the drafter model is lower than that by the main model.

We're going to support Speculative decoding using the inference engine, with optimized kernels and cache management for the main model.

Additionally, GLIDE, a modified draft model architecture that reuses key and value caches from the main model, is expected to be supported. It improves the acceptance rate and increment the speed-up ratio. Details can be found in research paper GLIDE with a CAPE - A Low-Hassle Method to Accelerate Speculative Decoding on arXiv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants