Implement speculative decoding #5245

CjhHa1 · 2024-01-09T08:34:54Z

Development branch: https://github.com/hpcaitech/ColossalAI/tree/feat/speculative-decoding

In speculative decoding, or assisted decoding, both a drafter model (small model) and a main model (large model) will be used. The drafter model will generate a few tokens sequentially, and then the main model will validate those candidate tokens in parallel and accept validated ones. The decoding process will be speeded up, for the latency of speculating multiple tokens by the drafter model is lower than that by the main model.

We're going to support Speculative decoding using the inference engine, with optimized kernels and cache management for the main model.

Additionally, GLIDE, a modified draft model architecture that reuses key and value caches from the main model, is expected to be supported. It improves the acceptance rate and increment the speed-up ratio. Details can be found in research paper GLIDE with a CAPE - A Low-Hassle Method to Accelerate Speculative Decoding on arXiv.

CjhHa1 assigned yuanheng-zhao and nkfyz Jan 17, 2024

yuanheng-zhao changed the title ~~Immigrate speculative decoding~~ Implement speculative decoding Feb 26, 2024

This was referenced Feb 26, 2024

[Infer] Revise and Adapt Triton Kernels for Spec-Dec #5401

Merged

[Inference/SpecDec] Add Basic Drafter Model Container #5405

Merged

[Inference/SpecDec] Add Speculative Decoding Implementation #5423

Merged

yuanheng-zhao mentioned this issue Mar 15, 2024

[SpecDec] Support GLIDE Drafter Model #5455

Merged

10 tasks

This was referenced Apr 3, 2024

[doc] Add inference/speculative-decoding README #5552

Merged

[Inference/Spec-Dec] Add Speculative Decoding and GLIDE Spec-Dec #5565

Merged

yuanheng-zhao closed this as completed May 8, 2024

yuanheng-zhao added the enhancement New feature or request label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement speculative decoding #5245

Implement speculative decoding #5245

CjhHa1 commented Jan 9, 2024 •

edited by yuanheng-zhao

Loading

Implement speculative decoding #5245

Implement speculative decoding #5245

Comments

CjhHa1 commented Jan 9, 2024 • edited by yuanheng-zhao Loading

CjhHa1 commented Jan 9, 2024 •

edited by yuanheng-zhao

Loading