Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863

K-Mistele · 2024-04-24T05:04:46Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ x I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

A recent paper from UC berkley investigated 4-bit quantization of the KV cache for better performance and longer context. Given llama.cpp's emphasis on efficient inference particularly on CPU platforms through quantization, this seems right up llama.cpp's alley.

Motivation

Better performance (it's possible to write custom CUDA kernels for 40% faster inference) and longer context are always beneficial to LLM users!

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

https://arxiv.org/abs/2401.18079

sorasoras · 2024-04-25T15:34:33Z

currently, it does have IQ4NL quants KV cache but it's pretty slow

henk717 · 2024-04-27T14:20:25Z

Relevant Discussion : #5932
There they mention #5021 has to finish first.

K-Mistele · 2024-05-01T15:34:19Z

fwiw @henk717 #5021 has been merged now :)
I did see that the server now supports setting K and V quant types with -ctk TYPE and -ctv TYPE but the implementation seems off, as #5932 mentions, the efficiencies observed in exllama v2 are much better than we observed in #4312 - seems like some more relevant work is being done on this in #4801 to optimize the matmuls for int8 quants

henk717 · 2024-05-02T23:01:32Z

I saw it has been merged, would be nice if @ggerganov can fill us in on the current plans regarding 4-bit cache, the demand for this feature has been high among users who currently resort to EXL2 due to its memory efficiency but would prefer to use llamacpp based solutions. Can be a full offload hit or miss scenario for certain vram sizes.

WiseFarAI · 2024-05-03T16:16:33Z

I would also like to see 4 bit KV cache quantization / compression. Especially now that Flash Attention has landed. Not sure what is needed to create this without hurting inference performance, but it seems like something lots of us could use.

github-actions · 2024-06-17T01:07:07Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

K-Mistele added the enhancement New feature or request label Apr 24, 2024

K-Mistele changed the title ~~4-bit quantized KV Cache~~ Implement 4-bit quantized KV Cache for faster performance and to enable longer context Apr 24, 2024

github-actions bot added the stale label Jun 3, 2024

github-actions bot closed this as completed Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863

Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863

K-Mistele commented Apr 24, 2024

sorasoras commented Apr 25, 2024

henk717 commented Apr 27, 2024

K-Mistele commented May 1, 2024

henk717 commented May 2, 2024

WiseFarAI commented May 3, 2024

github-actions bot commented Jun 17, 2024

Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863

Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863

Comments

K-Mistele commented Apr 24, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

sorasoras commented Apr 25, 2024

henk717 commented Apr 27, 2024

K-Mistele commented May 1, 2024

henk717 commented May 2, 2024

WiseFarAI commented May 3, 2024

github-actions bot commented Jun 17, 2024