-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863
Comments
currently, it does have IQ4NL quants KV cache but it's pretty slow |
fwiw @henk717 #5021 has been merged now :) |
I saw it has been merged, would be nice if @ggerganov can fill us in on the current plans regarding 4-bit cache, the demand for this feature has been high among users who currently resort to EXL2 due to its memory efficiency but would prefer to use llamacpp based solutions. Can be a full offload hit or miss scenario for certain vram sizes. |
I would also like to see 4 bit KV cache quantization / compression. Especially now that Flash Attention has landed. Not sure what is needed to create this without hurting inference performance, but it seems like something lots of us could use. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
A recent paper from UC berkley investigated 4-bit quantization of the KV cache for better performance and longer context. Given llama.cpp's emphasis on efficient inference particularly on CPU platforms through quantization, this seems right up llama.cpp's alley.
Motivation
Better performance (it's possible to write custom CUDA kernels for 40% faster inference) and longer context are always beneficial to LLM users!
Possible Implementation
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
https://arxiv.org/abs/2401.18079
The text was updated successfully, but these errors were encountered: