CUDA: quantized KV cache demo #7412

JohannesGaessler · 2024-05-20T10:05:36Z

This PR adds a simple implementation of a quantized KV cache for research purposes only. The goal is not to provide an implementation that could be merged or that is suitable for regular use but instead to provide a minimal implementation for doing perplexity calculations with CUDA. This is to investigate the impact of a quantized KV cache on generation quality vs. the impact of quantized weights. Presumably not all quantization formats/combinations make sense to actually use which is relevant information for cutting down on the significant compilation time that you would get if you were to compile 36 different kernel versions to accommodate all of the current quantization combinations.

Edit: this PR needs to be compiled with LLAMA_CUDA_F16=1.

JohannesGaessler · 2024-05-20T13:07:14Z

Some first results:

Quantization	KV BPV	imatrix	PPL	ΔPPL	KLD	Mean Δp	RMS Δp
f16/f16/f16	16.00	None	6.232196 ± 0.037873	0.002443 ± 0.000616	0.000189 ± 0.000001	-0.002 ± 0.001 %	0.476 ± 0.002 %
f16/f16/q8_0	12.25	None	6.232771 ± 0.037877	0.003019 ± 0.000826	0.000743 ± 0.000003	-0.000 ± 0.002 %	0.905 ± 0.004 %
f16/q8_0/f16	12.25	None	6.233073 ± 0.037878	0.003321 ± 0.000897	0.000944 ± 0.000004	-0.005 ± 0.003 %	1.010 ± 0.005 %
f16/q8_0/q8_0	8.50	None	6.234369 ± 0.037887	0.004616 ± 0.000909	0.000980 ± 0.000004	-0.007 ± 0.003 %	1.032 ± 0.005 %
q8_0/f16/f16	16.00	None	6.234640 ± 0.037881	0.004888 ± 0.001012	0.001363 ± 0.000006	-0.022 ± 0.003 %	1.200 ± 0.007 %
q8_0/q8_0/q8_0	8.50	None	6.234619 ± 0.037877	0.004866 ± 0.001070	0.001552 ± 0.000007	-0.027 ± 0.003 %	1.271 ± 0.007 %
f16/f16/q5_1	11.00	None	6.236713 ± 0.037901	0.006960 ± 0.001132	0.001759 ± 0.000008	-0.021 ± 0.004 %	1.342 ± 0.008 %
f16/q8_0/q5_1	7.25	None	6.237894 ± 0.037910	0.008142 ± 0.001173	0.001899 ± 0.000009	-0.027 ± 0.004 %	1.392 ± 0.008 %
f16/f16/q5_0	10.75	None	6.239508 ± 0.037916	0.009755 ± 0.001258	0.002241 ± 0.000012	-0.037 ± 0.004 %	1.506 ± 0.010 %
f16/q8_0/q5_0	7.00	None	6.240379 ± 0.037923	0.010627 ± 0.001289	0.002364 ± 0.000013	-0.044 ± 0.004 %	1.540 ± 0.012 %
f16/f16/q4_1	10.50	None	6.248752 ± 0.038001	0.019000 ± 0.001691	0.004318 ± 0.000033	-0.060 ± 0.005 %	2.031 ± 0.020 %
f16/q8_0/q4_1	6.75	None	6.250352 ± 0.038010	0.020599 ± 0.001708	0.004423 ± 0.000030	-0.065 ± 0.005 %	2.042 ± 0.017 %
f16/f16/q4_0	10.25	None	6.251781 ± 0.037997	0.022028 ± 0.001813	0.004988 ± 0.000029	-0.099 ± 0.006 %	2.116 ± 0.015 %
f16/q8_0/q4_0	6.5	None	6.254110 ± 0.038007	0.024357 ± 0.001814	0.005079 ± 0.000032	-0.107 ± 0.006 %	2.148 ± 0.019 %
f16/q5_1/f16	11.00	None	6.254993 ± 0.038030	0.025241 ± 0.001921	0.005423 ± 0.000048	-0.099 ± 0.006 %	2.254 ± 0.022 %
q6_K/f16/f16	16.00	None	6.251298 ± 0.038063	0.021545 ± 0.001854	0.005460 ± 0.000036	-0.003 ± 0.006 %	2.290 ± 0.020 %
f16/q5_1/q8_0	7.25	None	6.253645 ± 0.038008	0.023893 ± 0.001909	0.005438 ± 0.000045	-0.104 ± 0.006 %	2.279 ± 0.026 %
q6_K/q8_0/q8_0	8.50	None	6.253788 ± 0.038082	0.024035 ± 0.001880	0.005623 ± 0.000037	-0.012 ± 0.006 %	2.327 ± 0.020 %
f16/q5_1/q5_1	6.00	None	6.259012 ± 0.038045	0.029259 ± 0.002024	0.006096 ± 0.000047	-0.124 ± 0.006 %	2.358 ± 0.021 %
f16/q5_1/q5_0	5.75	None	6.260327 ± 0.038050	0.030574 ± 0.002078	0.006475 ± 0.000055	-0.139 ± 0.006 %	2.433 ± 0.023 %
f16/q5_1/q4_1	5.50	None	6.271945 ± 0.038166	0.042192 ± 0.002356	0.008395 ± 0.000063	-0.154 ± 0.007 %	2.763 ± 0.026 %
f16/q5_0/f16	10.75	None	6.272005 ± 0.038118	0.042253 ± 0.002421	0.008869 ± 0.000078	-0.205 ± 0.008 %	2.856 ± 0.028 %
f16/q5_0/q8_0	7.00	None	6.274423 ± 0.038133	0.044670 ± 0.002453	0.009001 ± 0.000078	-0.215 ± 0.008 %	2.913 ± 0.030 %
f16/q5_1/q4_0	5.25	None	6.275286 ± 0.038149	0.045533 ± 0.002449	0.009055 ± 0.000066	-0.208 ± 0.007 %	2.837 ± 0.027 %
f16/q5_0/q5_1	5.75	None	6.279304 ± 0.038158	0.049551 ± 0.002525	0.009632 ± 0.000084	-0.241 ± 0.008 %	2.974 ± 0.028 %
f16/q5_0/q5_0	5.50	None	6.278888 ± 0.038159	0.049135 ± 0.002559	0.009930 ± 0.000083	-0.238 ± 0.008 %	3.009 ± 0.028 %
q5_K_M/f16/f16	16.00	None	6.287391 ± 0.038329	0.057638 ± 0.002606	0.010767 ± 0.000079	-0.114 ± 0.008 %	3.165 ± 0.031 %
f16/q5_0/q4_1	5.25	None	6.290780 ± 0.038261	0.061028 ± 0.002822	0.011849 ± 0.000095	-0.270 ± 0.009 %	3.296 ± 0.032 %
f16/q5_0/q4_0	5.00	None	6.293121 ± 0.038241	0.063368 ± 0.002886	0.012683 ± 0.000103	-0.317 ± 0.009 %	3.390 ± 0.031 %
f16/q4_1/f16	10.50	None	6.325776 ± 0.038465	0.096023 ± 0.003449	0.017747 ± 0.000139	-0.437 ± 0.011 %	4.008 ± 0.036 %
f16/q4_1/q8_0	6.75	None	6.327584 ± 0.038477	0.097832 ± 0.003511	0.018022 ± 0.000134	-0.439 ± 0.011 %	4.063 ± 0.038 %
q5_1/f16/f16	16.00	None	6.336648 ± 0.038665	0.106895 ± 0.003475	0.018051 ± 0.000139	-0.289 ± 0.011 %	4.126 ± 0.039 %
f16/q4_1/q5_1	5.50	None	6.327340 ± 0.038474	0.097588 ± 0.003547	0.018477 ± 0.000137	-0.450 ± 0.011 %	4.074 ± 0.036 %
f16/q4_1/q5_0	5.25	None	6.332854 ± 0.038500	0.103101 ± 0.003574	0.018969 ± 0.000139	-0.479 ± 0.011 %	4.124 ± 0.038 %
f16/q4_1/q4_1	5.00	None	6.342528 ± 0.038601	0.112775 ± 0.003757	0.020712 ± 0.000155	-0.489 ± 0.011 %	4.294 ± 0.038 %
f16/q4_1/q4_0	4.75	None	6.347973 ± 0.038614	0.118220 ± 0.003853	0.021499 ± 0.000164	-0.540 ± 0.011 %	4.374 ± 0.038 %
q5_0/f16/f16	16.00	None	6.358436 ± 0.038827	0.128683 ± 0.003879	0.022141 ± 0.000163	-0.405 ± 0.012 %	4.623 ± 0.042 %
q4_K_M/f16/f16	16.00	None	6.406440 ± 0.039110	0.176688 ± 0.004629	0.031280 ± 0.000238	-0.598 ± 0.014 %	5.524 ± 0.050 %
f16/q4_0/q8_0	6.50	None	6.418148 ± 0.038825	0.188396 ± 0.004739	0.032763 ± 0.000242	-1.076 ± 0.014 %	5.573 ± 0.046 %
f16/q4_0/f16	10.25	None	6.420736 ± 0.038855	0.190983 ± 0.004753	0.032916 ± 0.000243	-1.079 ± 0.015 %	5.606 ± 0.047 %
f16/q4_0/q5_1	5.25	None	6.417232 ± 0.038807	0.187479 ± 0.004827	0.033626 ± 0.000257	-1.089 ± 0.015 %	5.665 ± 0.047 %
f16/q4_0/q5_0	5.00	None	6.422184 ± 0.038848	0.192432 ± 0.004841	0.033971 ± 0.000246	-1.102 ± 0.015 %	5.678 ± 0.048 %
f16/q4_0/q4_1	4.75	None	6.437401 ± 0.038984	0.207648 ± 0.004999	0.035734 ± 0.000259	-1.122 ± 0.015 %	5.840 ± 0.047 %
f16/q4_0/q4_0	4.50	None	6.438421 ± 0.038920	0.208668 ± 0.005009	0.036509 ± 0.000254	-1.195 ± 0.015 %	5.875 ± 0.047 %
q4_1/f16/f16	16.00	None	6.680320 ± 0.041266	0.450567 ± 0.008029	0.071564 ± 0.000503	-1.333 ± 0.022 %	8.506 ± 0.063 %
q4_0/f16/f16	16.00	None	6.694689 ± 0.041182	0.464937 ± 0.007925	0.071732 ± 0.000489	-1.580 ± 0.022 %	8.419 ± 0.061 %

Results are sorted by KL divergence. The quantization format is meant to be read as <weight type>/<K type>/<V type>. BPV = bits per value.

The K cache seems to be much more sensitive to quantization than the V cache. However, the weights seem to still be the most sensitive. Using q4_0 for the V cache and FP16 for everything else is more precise than using q6_K with FP16 KV cache. A 6.5 bit per value KV cache with q8_0 for the K cache and q4_0 for the V cache also seems to be more precise than q6_K weights. There seems to be no significant quality loss from using q8_0 instead of FP16 for the KV cache.

jukofyork · 2024-05-20T20:00:37Z

There seems to be no significant quality loss from using q8_0 instead of FP16 for the KV cache.

Is there any measurable drop in tokens/s from this?

JohannesGaessler · 2024-05-20T20:14:26Z

With the implementation in this PR the performance is much worse because my goal was not to get good performance but to determine how the quality would be affected. For this PR the performance only needs to be good enough to do perplexity calculations in a reasonable time frame.

In principle, given enough optimization, a quantized KV cache should be faster than an FP16 KV cache because you both need less I/O and because int8 operations are faster than FP16 operations. However, in the medium term quantized KV caches will be slower than FP16 on GPUs with tensor cores. I will need to first read up on how to utilize tensor cores with PTX (instead of nvcuda::wmma) or the performance will be worse than without tensor cores.

bartowski1182 · 2024-05-20T22:25:38Z

if someone's looking to contribute to the research, a NIHS or NINS before/after would be an interesting test. much like PPL, the results of the test itself are only kinda useful, but as a comparison between quanted and unquanted would be really useful for metrics

JohannesGaessler · 2024-05-20T22:44:36Z

What do you mean by NIHS and NINS?

bartowski1182 · 2024-05-20T23:19:53Z

Sorry lol, contextless acronyms is always a bad call

Needle in haystack and needle in needlestack

JohannesGaessler · 2024-06-14T17:07:49Z

This PR has become obsolete.

JohannesGaessler added 8 commits May 19, 2024 23:28

f16 still works

08d8a6b

q8_0 k works

1dd1857

q4_0 works

1b49f47

FP16 V still works

ca6d828

FP16 v still works

8a10e5c

q8_0 works

75096c6

q4_0 works

14e80c4

add q4_1 q5_0 q5_1 support

75ef761

JohannesGaessler added research 🔬 demo Demonstrate some concept or idea, not intended to be merged labels May 20, 2024

mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 20, 2024

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 20, 2024

JohannesGaessler mentioned this pull request May 24, 2024

CUDA: quantized KV support for FA vec #7527

Merged

JohannesGaessler closed this Jun 14, 2024

Ph0rk0z mentioned this pull request Jun 30, 2024

Add Q4/Q8 cache for llama.cpp oobabooga/text-generation-webui#6168

Open

sammcj mentioned this pull request Jul 25, 2024

feat: K/V cache quantisation (massive vRAM improvement!) ollama/ollama#5894

Closed

This was referenced Aug 9, 2024

feat: K/V cache quantisation (massive vRAM improvement!) ollama/ollama#6276

Closed

feat: Introduce K/V Context Quantisation (vRAM improvements) ollama/ollama#6279

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: quantized KV cache demo #7412

CUDA: quantized KV cache demo #7412

JohannesGaessler commented May 20, 2024 •

edited

Loading

JohannesGaessler commented May 20, 2024 •

edited

Loading

jukofyork commented May 20, 2024

JohannesGaessler commented May 20, 2024

bartowski1182 commented May 20, 2024

JohannesGaessler commented May 20, 2024

bartowski1182 commented May 20, 2024

JohannesGaessler commented Jun 14, 2024

CUDA: quantized KV cache demo #7412

CUDA: quantized KV cache demo #7412

Conversation

JohannesGaessler commented May 20, 2024 • edited Loading

JohannesGaessler commented May 20, 2024 • edited Loading

jukofyork commented May 20, 2024

JohannesGaessler commented May 20, 2024

bartowski1182 commented May 20, 2024

JohannesGaessler commented May 20, 2024

bartowski1182 commented May 20, 2024

JohannesGaessler commented Jun 14, 2024

JohannesGaessler commented May 20, 2024 •

edited

Loading

JohannesGaessler commented May 20, 2024 •

edited

Loading