New optimized kernels #365

casper-hansen · 2024-02-24T22:34:08Z

New kernels from casper-hansen/AutoAWQ_kernels#12 / mit-han-lab/llm-awq#142 that scale better. They are as fast as ExLlamaV2 kernels at batch size 1 and 64% faster decoding at larger batch sizes. Additionally, it is more memory efficient, saving GB of memory when scaling up. The kernels are much faster at decoding than previous GEMM kernels and should be the new preferred format, although it requires requantization.

Note: The AutoAWQ_kernels PR slightly modified the kernels for Windows compatibility and to fix some small issues like converting float to half / half to float.

Benchmarks

GPU: NVIDIA GeForce RTX 4090
Model: Mistral 7B Instruct

New kernel

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	153.108	181.792	4.47 GB (18.90%)
1	64	64	4986.08	184.787	4.48 GB (18.96%)
1	128	128	6445.57	183.334	4.49 GB (19.00%)
1	256	256	7903.41	178.064	4.51 GB (19.08%)
1	512	512	8918.64	168.751	4.55 GB (19.24%)
1	1024	1024	7338.22	152.792	4.63 GB (19.56%)
1	2048	2048	5690.87	128.262	5.63 GB (23.80%)
1	4096	4096	3843.33	129.923	9.82 GB (41.54%)
8	32	32	990	1324	4.50 GB (19.02%)
8	64	64	9628	1323	4.54 GB (19.18%)
8	128	128	9656	1309	4.60 GB (19.45%)
8	256	256	9036	1265	4.72 GB (19.97%)
8	512	512	7992	1191	5.31 GB (22.45%)
8	1024	1024	6922	1070	7.92 GB (33.49%)
8	2048	2048	5520	892	16.90 GB (71.44%)
16	32	32	1997.92	2591.88	4.53 GB (19.14%)
16	64	64	9916.44	2583.3	4.60 GB (19.44%)
16	128	128	9446.14	2547.31	4.72 GB (19.96%)
16	256	256	8663.37	2426.64	5.22 GB (22.08%)
16	512	512	7967.01	2221.12	6.65 GB (28.13%)
16	1024	1024	6924.81	1897.71	11.86 GB (50.14%)
32	32	32	3353.53	4248.61	4.59 GB (19.40%)
32	64	64	9677.23	4154.19	4.72 GB (19.95%)
32	128	128	9109.03	4044.9	5.22 GB (22.06%)
32	256	256	8595.1	3770.58	6.49 GB (27.42%)
32	512	512	7937.57	3275.92	9.34 GB (39.50%)

ExLlamaV2

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	151.36	186.248	4.53 GB (19.15%)
1	64	64	1005.16	186.704	4.54 GB (19.21%)
1	128	128	3447	185.016	4.56 GB (19.27%)
1	256	256	6128.1	179.998	4.58 GB (19.38%)
1	512	512	8355.09	170.445	4.64 GB (19.60%)
1	1024	1024	8192.38	154.214	4.74 GB (20.05%)
1	2048	2048	6539.28	129.506	5.77 GB (24.41%)
1	4096	4096	4280.31	130.978	10.08 GB (42.61%)
8	32	32	1037	1154	4.57 GB (19.32%)
8	64	64	8909	1150	4.62 GB (19.54%)
8	128	128	11235	1141	4.71 GB (19.93%)
8	256	256	11426	1109	4.90 GB (20.71%)
8	512	512	10215	1050	5.56 GB (23.52%)
8	1024	1024	8716	953	8.39 GB (35.48%)
8	2048	2048	6627	808	17.80 GB (75.28%)
16	32	32	1960.73	2048.13	4.61 GB (19.51%)
16	64	64	11457.77	2023.67	4.71 GB (19.92%)
16	128	128	12099.81	1992.66	4.89 GB (20.69%)
16	256	256	11357.77	1917.56	5.47 GB (23.14%)
16	512	512	10438.64	1785.0	7.12 GB (30.12%)
16	1024	1024	8807.95	1565.22	12.77 GB (53.98%)
32	32	32	3370.3	2278.66	4.70 GB (19.89%)
32	64	64	12461.55	2259.29	4.89 GB (20.68%)
32	128	128	12169.56	2226.39	5.47 GB (23.13%)
32	256	256	11562.06	2150.82	6.96 GB (29.41%)
32	512	512	10565.37	1995.27	10.25 GB (43.34%)

casper-hansen added 8 commits February 24, 2024 22:15

Fix quantization

56120a4

Add gemv fast (new)

0d97169

Add GEMVFast in quantizer and loader

2ef380c

scaled_zeros -> qzeros (standardize name)

5fc7c0e

Enable fused qkv

a8c7a6a

Use correct extension

396786c

Minor fix

31d9266

Remove triton requirement

f0ed32c

casper-hansen merged commit 68c727a into main Feb 24, 2024

casper-hansen mentioned this pull request Feb 24, 2024

AWQ: Implement new kernels (64% faster decoding) vllm-project/vllm#3025

Closed

nivibilla mentioned this pull request Feb 24, 2024

64% faster awq sgl-project/sglang#229

Closed

casper-hansen mentioned this pull request Mar 8, 2024

[WIP] AWQ Faster Kernels vllm-project/vllm#3289

Draft

3 tasks

casper-hansen deleted the new_kernels branch March 11, 2024 14:16

orendar mentioned this pull request Mar 18, 2024

[Feature] Add new AWQ kernels InternLM/lmdeploy#1301

Closed

garycaokai mentioned this pull request Mar 29, 2024

Fix gemv_fast model loading casper-hansen/vllm#1

Merged

thincal mentioned this pull request Apr 8, 2024

Upgrade to AWQ kernels v0.0.6 predibase/lorax#394

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New optimized kernels #365

New optimized kernels #365

casper-hansen commented Feb 24, 2024 •

edited

Loading

New optimized kernels #365

New optimized kernels #365

Conversation

casper-hansen commented Feb 24, 2024 • edited Loading

Benchmarks

New kernel

ExLlamaV2

casper-hansen commented Feb 24, 2024 •

edited

Loading