-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: quantized KV cache demo #7412
Conversation
Some first results:
Results are sorted by KL divergence. The quantization format is meant to be read as The K cache seems to be much more sensitive to quantization than the V cache. However, the weights seem to still be the most sensitive. Using q4_0 for the V cache and FP16 for everything else is more precise than using q6_K with FP16 KV cache. A 6.5 bit per value KV cache with q8_0 for the K cache and q4_0 for the V cache also seems to be more precise than q6_K weights. There seems to be no significant quality loss from using q8_0 instead of FP16 for the KV cache. |
Is there any measurable drop in tokens/s from this? |
With the implementation in this PR the performance is much worse because my goal was not to get good performance but to determine how the quality would be affected. For this PR the performance only needs to be good enough to do perplexity calculations in a reasonable time frame. In principle, given enough optimization, a quantized KV cache should be faster than an FP16 KV cache because you both need less I/O and because int8 operations are faster than FP16 operations. However, in the medium term quantized KV caches will be slower than FP16 on GPUs with tensor cores. I will need to first read up on how to utilize tensor cores with PTX (instead of |
if someone's looking to contribute to the research, a NIHS or NINS before/after would be an interesting test. much like PPL, the results of the test itself are only kinda useful, but as a comparison between quanted and unquanted would be really useful for metrics |
What do you mean by NIHS and NINS? |
Sorry lol, contextless acronyms is always a bad call Needle in haystack and needle in needlestack |
This PR has become obsolete. |
This PR adds a simple implementation of a quantized KV cache for research purposes only. The goal is not to provide an implementation that could be merged or that is suitable for regular use but instead to provide a minimal implementation for doing perplexity calculations with CUDA. This is to investigate the impact of a quantized KV cache on generation quality vs. the impact of quantized weights. Presumably not all quantization formats/combinations make sense to actually use which is relevant information for cutting down on the significant compilation time that you would get if you were to compile 36 different kernel versions to accommodate all of the current quantization combinations.
Edit: this PR needs to be compiled with
LLAMA_CUDA_F16=1
.