vulkan: matmul dequantization improvements #12015

netrunnereve · 2025-02-21T22:35:53Z

This basically makes the mul_mm shaders load and dequantize 4 or 8 values at a time like how it's done in mat_vec (old quants only).

Results on my RX 470:

PR

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	1	none	pp512	158.37 ± 0.80
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	100	8	1	none	pp512	153.76 ± 0.52

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   38 runs - 26996.37 us/run -  60.13 GFLOP/run -   2.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   38 runs - 26764.32 us/run -  60.13 GFLOP/run -   2.25 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   34 runs - 30210.91 us/run -  60.13 GFLOP/run -   1.99 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   36 runs - 29015.64 us/run -  60.13 GFLOP/run -   2.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   36 runs - 27984.17 us/run -  60.13 GFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 36 runs - 28179.08 us/run -  60.13 GFLOP/run -   2.13 TFLOPS

Master
PR:

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	1	none	pp512	151.66 ± 0.86
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	100	8	1	none	pp512	149.71 ± 0.14

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   36 runs - 28187.53 us/run -  60.13 GFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   36 runs - 28343.00 us/run -  60.13 GFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   32 runs - 31629.72 us/run -  60.13 GFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   34 runs - 30898.97 us/run -  60.13 GFLOP/run -   1.95 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   36 runs - 28930.81 us/run -  60.13 GFLOP/run -   2.08 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 36 runs - 28959.25 us/run -  60.13 GFLOP/run -   2.08 TFLOPS

I'm only seeing a small improvement as most of the GPU time is spent doing the actual multiplication, and I think we'll see better results on something that supports coopmat.

jeffbolznv · 2025-02-21T23:44:21Z

I did a quick run on RTX 4070 using the KHR_coopmat path (GGML_VK_DISABLE_COOPMAT2=1). Perf is about neutral on average, maybe down a tiny bit?

before
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   332 runs -  3023.55 us/run -  60.13 GFLOP/run -  19.89 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   322 runs -  3114.34 us/run -  60.13 GFLOP/run -  19.31 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  776 runs -  1289.13 us/run -  60.13 GFLOP/run -  46.64 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  748 runs -  1338.91 us/run -  60.13 GFLOP/run -  44.91 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  674 runs -  1485.07 us/run -  60.13 GFLOP/run -  40.49 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  670 runs -  1493.24 us/run -  60.13 GFLOP/run -  40.27 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  632 runs -  1585.79 us/run -  60.13 GFLOP/run -  37.92 TFLOPS
  
after
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   322 runs -  3118.21 us/run -  60.13 GFLOP/run -  19.28 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   320 runs -  3138.63 us/run -  60.13 GFLOP/run -  19.16 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  734 runs -  1365.62 us/run -  60.13 GFLOP/run -  44.03 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  660 runs -  1515.89 us/run -  60.13 GFLOP/run -  39.67 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  710 runs -  1409.35 us/run -  60.13 GFLOP/run -  42.66 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  708 runs -  1414.56 us/run -  60.13 GFLOP/run -  42.51 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  650 runs -  1542.48 us/run -  60.13 GFLOP/run -  38.98 TFLOPS

The backend tests all passed.

netrunnereve · 2025-02-22T23:01:29Z

Perf is about neutral on average, maybe down a tiny bit?

Interesting. Let's wait for some more results.

0cc4m · 2025-02-25T11:27:11Z

Here are my results:

Nvidia RTX 3090

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 3090
  Device memory: 24576 MB (24576 MB free)


MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     23.12 TFLOPS       23.37 TFLOPS        0.25 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     23.37 TFLOPS       23.68 TFLOPS        0.31 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    46.43 TFLOPS       46.16 TFLOPS       -0.27 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    44.80 TFLOPS       44.17 TFLOPS       -0.63 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    41.93 TFLOPS       46.39 TFLOPS        4.46 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    42.65 TFLOPS       46.68 TFLOPS        4.03 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    40.53 TFLOPS       46.34 TFLOPS        5.81 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    42.55 TFLOPS       42.92 TFLOPS        0.37 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    33.67 TFLOPS       34.42 TFLOPS        0.75 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    34.69 TFLOPS       35.68 TFLOPS        0.99 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    28.03 TFLOPS       28.10 TFLOPS        0.07 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    34.40 TFLOPS       34.75 TFLOPS        0.35 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 26.45 TFLOPS       28.34 TFLOPS        1.89 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  29.01 TFLOPS       27.52 TFLOPS       -1.49 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   26.05 TFLOPS       25.80 TFLOPS       -0.25 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 25.85 TFLOPS       28.14 TFLOPS        2.29 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   32.20 TFLOPS       27.11 TFLOPS       -5.09 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   22.22 TFLOPS       23.41 TFLOPS        1.19 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  41.81 TFLOPS       29.14 TFLOPS      -12.67 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   27.23 TFLOPS       26.45 TFLOPS       -0.78 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  35.02 TFLOPS       30.61 TFLOPS       -4.41 TFLOPS

AMD Radeon Pro VII

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: AMD Radeon (TM) Pro VII (RADV VEGA20)
  Device memory: 16368 MB (16368 MB free)


MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      4.96 TFLOPS        4.93 TFLOPS       -0.03 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      4.21 TFLOPS        4.19 TFLOPS       -0.02 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.15 TFLOPS        4.32 TFLOPS        0.17 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.16 TFLOPS        4.32 TFLOPS        0.16 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.95 TFLOPS        4.05 TFLOPS        0.10 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.01 TFLOPS        4.10 TFLOPS        0.09 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.11 TFLOPS        4.18 TFLOPS        0.07 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.92 TFLOPS        3.91 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.61 TFLOPS        3.60 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.59 TFLOPS        3.60 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.44 TFLOPS        3.43 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.62 TFLOPS        3.60 TFLOPS       -0.02 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  3.67 TFLOPS        3.65 TFLOPS       -0.02 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3.74 TFLOPS        3.71 TFLOPS       -0.03 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.07 TFLOPS        3.02 TFLOPS       -0.05 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  3.65 TFLOPS        3.67 TFLOPS        0.02 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.85 TFLOPS        3.87 TFLOPS        0.02 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.66 TFLOPS        3.66 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4.14 TFLOPS        4.29 TFLOPS        0.15 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.71 TFLOPS        3.69 TFLOPS       -0.02 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3.79 TFLOPS        3.79 TFLOPS        0.00 TFLOPS

Intel A770

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(tm) A770 Graphics (DG2)
  Device memory: 16032 MB (16032 MB free)


MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.52 TFLOPS        1.52 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.44 TFLOPS        1.44 TFLOPS        0.00 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.22 TFLOPS        1.41 TFLOPS        0.19 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.18 TFLOPS        1.39 TFLOPS        0.21 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.14 TFLOPS        1.32 TFLOPS        0.18 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.14 TFLOPS        1.38 TFLOPS        0.24 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.14 TFLOPS        1.27 TFLOPS        0.13 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.34 TFLOPS        1.33 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.31 TFLOPS        1.32 TFLOPS        0.01 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.33 TFLOPS        1.33 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.32 TFLOPS        1.32 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.30 TFLOPS        1.30 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  1.13 TFLOPS        1.12 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1.32 TFLOPS        1.31 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.17 TFLOPS        1.16 TFLOPS       -0.01 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  1.16 TFLOPS        1.18 TFLOPS        0.02 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.26 TFLOPS        1.26 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.33 TFLOPS        1.33 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1.19 TFLOPS        1.35 TFLOPS        0.16 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.30 TFLOPS        1.30 TFLOPS        0.00 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1.32 TFLOPS        1.32 TFLOPS        0.00 TFLOPS

Looks like it's mostly perf-neutral on AMD and Intel, probably since they are compute-limited. Some minor improvements. But on RTX 3090 it makes a much larger difference. Looks good for legacy and k-quants, but iq quants seem to be negative. Any ideas?

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp

netrunnereve · 2025-02-26T02:04:25Z

But on RTX 3090 it makes a much larger difference. Looks good for legacy and k-quants, but iq quants seem to be negative. Any ideas?

I mean I didn't touch iq1_s and iq4_xs at all, so I don't know what's going on there. My change only affects the legacy quants and iq4_nl.

Anyways I pushed a new update for iq4_nl which might help a bit, though it's only giving me a 1% improvement on my end. On AMD unpack8 is normally faster since it compiles to an instruction which converts an unsigned byte into a float in a single cycle. Since iq4_nl has no float conversion the unpack compiles to a regular bitfieldextract and we also add the overhead of all the bit fiddling. Then again I'll be rolling my eyes if this causes a 30% difference and I don't know if this is even relevant for Nvidia...

0cc4m · 2025-02-28T07:19:41Z

I reran the results, looks good now. The RTX 3090 coopmat results seem to vary a lot between runs, so the negative delta was maybe just variance.

stduhpf · 2025-03-04T15:35:40Z

On AMD unpack8 is normally faster

It might be faster, but it's also not working reliably on RDNA1 right now.

This reverts commit fbeda90.

* faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8

netrunnereve added 2 commits February 20, 2025 11:58

faster dequant for old quants

63ae609

Merge https://github.com/ggerganov/llama.cpp into vulkan_mm

e554659

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 21, 2025

0cc4m reviewed Feb 25, 2025

View reviewed changes

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp Outdated Show resolved Hide resolved

netrunnereve added 3 commits February 25, 2025 12:55

dont use unpack for iq4_nl

39c7a66

Merge https://github.com/ggerganov/llama.cpp into vulkan_mm

15d073a

vec2 unpack for q8

a79b7b6

0cc4m approved these changes Feb 28, 2025

View reviewed changes

0cc4m merged commit fbeda90 into ggml-org:master Feb 28, 2025
47 checks passed

netrunnereve deleted the vulkan_mm branch February 28, 2025 21:26

This was referenced Mar 4, 2025

Eval bug: ~~Q2_K and Q3_K~~ Q8_0 not working on Vulkan anymore on RX 5700XT #10710

Open

Misc. bug: vulkan on 6900xt #12147

Open

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Mar 4, 2025

Revert "vulkan: matmul dequantization improvements (ggml-org#12015)"

26b71e3

This reverts commit fbeda90.

remyoudompheng mentioned this pull request Mar 7, 2025

vulkan: optimization proposals for coopmat1 mul_mm #12260

Draft

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025

vulkan: matmul dequantization improvements (ggml-org#12015)

907ae6f

* faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: matmul dequantization improvements #12015

vulkan: matmul dequantization improvements #12015

netrunnereve commented Feb 21, 2025

jeffbolznv commented Feb 21, 2025

netrunnereve commented Feb 22, 2025

0cc4m commented Feb 25, 2025

netrunnereve commented Feb 26, 2025

0cc4m commented Feb 28, 2025

stduhpf commented Mar 4, 2025

vulkan: matmul dequantization improvements #12015

vulkan: matmul dequantization improvements #12015

Conversation

netrunnereve commented Feb 21, 2025

jeffbolznv commented Feb 21, 2025

netrunnereve commented Feb 22, 2025

0cc4m commented Feb 25, 2025

netrunnereve commented Feb 26, 2025

0cc4m commented Feb 28, 2025

stduhpf commented Mar 4, 2025