You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In speculative decoding, or assisted decoding, both a drafter model (small model) and a main model (large model) will be used. The drafter model will generate a few tokens sequentially, and then the main model will validate those candidate tokens in parallel and accept validated ones. The decoding process will be speeded up, for the latency of speculating multiple tokens by the drafter model is lower than that by the main model.
We're going to support Speculative decoding using the inference engine, with optimized kernels and cache management for the main model.
Additionally, GLIDE, a modified draft model architecture that reuses key and value caches from the main model, is expected to be supported. It improves the acceptance rate and increment the speed-up ratio. Details can be found in research paper GLIDE with a CAPE - A Low-Hassle Method to Accelerate Speculative Decoding on arXiv.
The text was updated successfully, but these errors were encountered:
Development branch: https://github.com/hpcaitech/ColossalAI/tree/feat/speculative-decoding
In speculative decoding, or assisted decoding, both a drafter model (small model) and a main model (large model) will be used. The drafter model will generate a few tokens sequentially, and then the main model will validate those candidate tokens in parallel and accept validated ones. The decoding process will be speeded up, for the latency of speculating multiple tokens by the drafter model is lower than that by the main model.
We're going to support Speculative decoding using the inference engine, with optimized kernels and cache management for the main model.
Additionally, GLIDE, a modified draft model architecture that reuses key and value caches from the main model, is expected to be supported. It improves the acceptance rate and increment the speed-up ratio. Details can be found in research paper GLIDE with a CAPE - A Low-Hassle Method to Accelerate Speculative Decoding on arXiv.
The text was updated successfully, but these errors were encountered: