≈65% speedup of the AVX-512 implementation of `ggml_vec_dot_q4_0()` #933

dfyz · 2023-04-13T03:10:33Z

Apologies for the slightly clickbaity title: while technically true, as mentioned in this comment, the current AVX-512 implementation is slower than the regular AVX2 implementation. Compared to the current AVX2 implementation, the new AVX-512 implementation only gets a ≈35% speedup on modern hardware.

The measurements below were made on a laptop with a Tiger Lake CPU (i7-1165G7). 8 threads were used.

LLaMA 7B (./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 256)

Implementation	Time per token, ms (average over 3 runs)
AVX512, old	184
AVX2	149
AVX512, new	110

LLaMA 13B (./main -m ./models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 256)

Implementation	Time per token, ms (average over 3 runs)
AVX512, old	354
AVX2	282
AVX512, new	205

For clarity, I only include the time of generation per token, but the prompt processing improvements are very similar.

These speedup percentages are, of course, an oversimplification:

I report end-to-end token generation timings here. While regular BLAS-less inference is bottlenecked on ggml_vec_dot_q4_0(), it also does quite a lot of other stuff.
This is gcc 12.2.1, the results with other compilers can vary (but not too much). clang, in particular, does some "optimizations" than only hurt performance (fortunately, not in a major way).
AVX-512 is a fragmented ecosystem of extensions. My code makes use of VBMI and VNNI extensions, when available. If they are not present, the code runs slightly slower.

However, my microbenchmark (./build.sh && ./main on a Linux system with libbenchmark and gcc installed) suggests solid improvements over both the AVX2 and old AVX-512 implementations across a variety of CPUs. Unless I screwed up something major, the improvements in the microbenchmark should directly lead to improvements in generation. I would appreciate any performance measurements with this PR applied.

Implementation-wise, the basic idea is that built-in masking and register-wide shuffles in AVX-512 allow us to operate on two Q4_0 blocks at once. I tried to comment the code extensively.

dfyz · 2023-04-13T21:43:04Z

Today, an "official" microbenchmark for ggml_vec_dot_q4_0() was introduced in 95ea26f. It seems to confirm the speed-up (measured on the same Tiger Lake laptop).

`make benchmark` output for 0e07e6a (the "old" AVX-512 version)

Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            380422;           30341.90
        1;       1; 11008;  4096;   128;    11542724608;            377708;           30559.92
        2;       1; 11008;  4096;   128;    11542724608;            374219;           30844.84
        3;       1; 11008;  4096;   128;    11542724608;            376902;           30625.27
        4;       1; 11008;  4096;   128;    11542724608;            376930;           30622.99
        5;       1; 11008;  4096;   128;    11542724608;            376601;           30649.74
        6;       1; 11008;  4096;   128;    11542724608;            377351;           30588.83
        7;       1; 11008;  4096;   128;    11542724608;            378604;           30487.59
        8;       1; 11008;  4096;   128;    11542724608;            377307;           30592.39
        9;       1; 11008;  4096;   128;    11542724608;            377734;           30557.81

`make benchmark` output for 0e07e6a with AVX-512 disabled (the AVX2 version)

Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            274538;           42044.18
        1;       1; 11008;  4096;   128;    11542724608;            273030;           42276.40
        2;       1; 11008;  4096;   128;    11542724608;            275449;           41905.12
        3;       1; 11008;  4096;   128;    11542724608;            273007;           42279.96
        4;       1; 11008;  4096;   128;    11542724608;            274230;           42091.40
        5;       1; 11008;  4096;   128;    11542724608;            273188;           42251.95
        6;       1; 11008;  4096;   128;    11542724608;            273488;           42205.60
        7;       1; 11008;  4096;   128;    11542724608;            273402;           42218.88
        8;       1; 11008;  4096;   128;    11542724608;            273236;           42244.52
        9;       1; 11008;  4096;   128;    11542724608;            274742;           42012.96

`make benchmark` output for d787348 (the "new" AVX-512 version)

Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            193742;           59577.81
        1;       1; 11008;  4096;   128;    11542724608;            189840;           60802.38
        2;       1; 11008;  4096;   128;    11542724608;            192014;           60113.97
        3;       1; 11008;  4096;   128;    11542724608;            191863;           60161.29
        4;       1; 11008;  4096;   128;    11542724608;            191087;           60405.60
        5;       1; 11008;  4096;   128;    11542724608;            191952;           60133.39
        6;       1; 11008;  4096;   128;    11542724608;            192595;           59932.63
        7;       1; 11008;  4096;   128;    11542724608;            190172;           60696.23
        8;       1; 11008;  4096;   128;    11542724608;            190912;           60460.97
        9;       1; 11008;  4096;   128;    11542724608;            190907;           60462.55

I used the following patch for 0e07e6a to force AVX2 instead of AVX-512:

diff --git a/ggml.c b/ggml.c
index 42e3ee3..9c72456 100644
--- a/ggml.c
+++ b/ggml.c
@@ -1967,7 +1967,7 @@ static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void * rest
     }

     sumf = sum0 + sum1;
-#elif defined(__AVX512F__)
+#elif 0 && defined(__AVX512F__)
     // Initialize accumulator with zeros
     __m512 acc0 = _mm512_setzero_ps();
     __m512 acc1 = _mm512_setzero_ps();

I would appreciate if someone with AVX-512-enabled hardware could also run the official microbenchmark to determine if this is worth merging.

dfyz · 2023-04-13T21:51:39Z

FWIW, I didn't intend to close the PR. I merged the latest upstream commits to run the microbenchmark and accidentally force-pushed 0e07e6a (the latest upstream master I used for microbenchmarking) into my branch. GitHub apparently interpreted this as "this PR has already been merged, let's close it". I fixed this now (by force-pushing d787348 and re-opening the PR); sorry for the noise.

KASR · 2023-04-14T12:13:00Z

I would appreciate if someone with AVX-512-enabled hardware could also run the official microbenchmark to determine if this is worth merging.

According to the datasheet my cpu has 2 "AVX-512 FMA Units" so i could give it a try. However, I noticed that the benchmark folder has no CMakeLists.txt ~~i.e. I cant build it~~

~~I tried to create the cmake file but this resulted in a bunch of errors when building on windows/cmake/vs... do you have any tips on how to add the cmake file for the benchmark?~~

I think I was able to build benchmark-q4_0-matmult I've added the following CMakeLists.txt and added the folder to the build directories.

set(TARGET benchmark-q4_0-matmult)
add_executable(${TARGET} benchmark-q4_0-matmult.cpp)
target_link_libraries(${TARGET} PRIVATE ggml ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)

Below you can find the outputs for both the up to date branch and your commit. However, I think I'm doing something wrong (or something stupid 😅) --> I do not obtain any output from the benchmark when I build with AVX512-VBMI enabled.

benchmark when using 43ddef, all settings default

PS C:\DATA\TestLLama\llama.cpp-master\build\bin\Release> .\benchmark-q4_0-matmult.exe
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            390763;           29538.94
        1;       1; 11008;  4096;   128;    11542724608;            392531;           29405.89
        2;       1; 11008;  4096;   128;    11542724608;            389934;           29601.74
        3;       1; 11008;  4096;   128;    11542724608;            379255;           30435.26
        4;       1; 11008;  4096;   128;    11542724608;            398684;           28952.06
        5;       1; 11008;  4096;   128;    11542724608;            383994;           30059.65
        6;       1; 11008;  4096;   128;    11542724608;            384464;           30022.90
        7;       1; 11008;  4096;   128;    11542724608;            377635;           30565.82
        8;       1; 11008;  4096;   128;    11542724608;            387380;           29796.90
        9;       1; 11008;  4096;   128;    11542724608;            384728;           30002.30

benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = OFF | AVX512-VNNI=OFF

PS C:\DATA\TestLLama\llama.cpp-master-dfyz\build\bin\Release> .\benchmark-q4_0-matmult.exe
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            302760;           38125.00
        1;       1; 11008;  4096;   128;    11542724608;            289438;           39879.78
        2;       1; 11008;  4096;   128;    11542724608;            298912;           38615.80
        3;       1; 11008;  4096;   128;    11542724608;            303842;           37989.23
        4;       1; 11008;  4096;   128;    11542724608;            315717;           36560.35
        5;       1; 11008;  4096;   128;    11542724608;            298674;           38646.57
        6;       1; 11008;  4096;   128;    11542724608;            298313;           38693.34
        7;       1; 11008;  4096;   128;    11542724608;            301219;           38320.04
        8;       1; 11008;  4096;   128;    11542724608;            301670;           38262.75
        9;       1; 11008;  4096;   128;    11542724608;            299335;           38561.23

benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = OFF | AVX512-VNNI=ON

PS C:\DATA\TestLLama\llama.cpp-master-dfyz\build\bin\Release> .\benchmark-q4_0-matmult.exe
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            272608;           42341.84
        1;       1; 11008;  4096;   128;    11542724608;            272676;           42331.28
        2;       1; 11008;  4096;   128;    11542724608;            265465;           43481.15
        3;       1; 11008;  4096;   128;    11542724608;            271353;           42537.67
        4;       1; 11008;  4096;   128;    11542724608;            267103;           43214.51
        5;       1; 11008;  4096;   128;    11542724608;            269835;           42776.97
        6;       1; 11008;  4096;   128;    11542724608;            268411;           43003.92
        7;       1; 11008;  4096;   128;    11542724608;            271905;           42451.31
        8;       1; 11008;  4096;   128;    11542724608;            269273;           42866.25
        9;       1; 11008;  4096;   128;    11542724608;            263927;           43734.54

benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = ON | AVX512-VNNI= OFF

PS C:\DATA\TestLLama\llama.cpp-master-dfyz\build\bin\Release> .\benchmark-q4_0-matmult.exe
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
PS C:\DATA\TestLLama\llama.cpp-master-dfyz\build\bin\Release>

unbounded · 2023-04-14T15:30:49Z

I can confirm that this is a big speedup for me, well done!

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

master branch:

(  187.62 ms per token)

...

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            491766;           23471.99
        1;       1; 11008;  4096;   128;    11542724608;            492068;           23457.58
        2;       1; 11008;  4096;   128;    11542724608;            492413;           23441.14
        3;       1; 11008;  4096;   128;    11542724608;            493280;           23399.95
        4;       1; 11008;  4096;   128;    11542724608;            494672;           23334.10
        5;       1; 11008;  4096;   128;    11542724608;            496268;           23259.05
        6;       1; 11008;  4096;   128;    11542724608;            495874;           23277.54
        7;       1; 11008;  4096;   128;    11542724608;            496583;           23244.30
        8;       1; 11008;  4096;   128;    11542724608;            497316;           23210.04
        9;       1; 11008;  4096;   128;    11542724608;            497237;           23213.73

d787348:

(  102.19 ms per token)

...

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            250094;           46153.54
        1;       1; 11008;  4096;   128;    11542724608;            249826;           46203.05
        2;       1; 11008;  4096;   128;    11542724608;            251181;           45953.81
        3;       1; 11008;  4096;   128;    11542724608;            252217;           45765.05
        4;       1; 11008;  4096;   128;    11542724608;            252019;           45801.01
        5;       1; 11008;  4096;   128;    11542724608;            252224;           45763.79
        6;       1; 11008;  4096;   128;    11542724608;            252719;           45674.14
        7;       1; 11008;  4096;   128;    11542724608;            252265;           45756.35
        8;       1; 11008;  4096;   128;    11542724608;            252170;           45773.58
        9;       1; 11008;  4096;   128;    11542724608;            252864;           45647.96

It does make me think we are spending a lot of extra effort shuffling values around because of the memory layout.
How fast could this be if we only did aligned loads of continuous data, e.g. if each row was all the nibbles followed by all the scales?

dfyz · 2023-04-14T15:40:05Z

@KASR
Regarding AVX512-VBMI: I don't think you're doing anything wrong or stupid, it's probably just that your CPU is Cascade Lake, which doesn't support VBMI (according to the "Microarchitecture support" table here), so the benchmarks traps on the first VBMI instruction the CPU tries to execute. I don't have a non-VBMI machine with Windows installed, so I wrote a simple program that tries to use AVX512-BF16 instructions (which are not supported by my CPU) to see how Windows behaves:

#include <immintrin.h>
#include <iostream>

int main() {
	__m512 x = _mm512_set1_ps(1.0f);
	__m512bh y = _mm512_set1_epi8(2);
	__m512bh z = _mm512_set1_epi8(3);	
	char out[64]{};
	_mm512_storeu_ps(out, _mm512_dpbf16_ps(x, y, z));
	std::cout << static_cast<int>(out[1]) << std::endl;	
}

I compiled it with cl /O2 /EHsc /arch:AVX512 /Zi test.cpp, and it indeed exited silently without printing anything:

PS D:\dfyz\tmp> .\test.exe
PS D:\dfyz\tmp>

When I run the program under debugger, I can see that it traps:

0:000> g
(3794.2cdc): Illegal instruction - code c000001d (first chance)
(3794.2cdc): Illegal instruction - code c000001d (!!! second chance !!!)
test!main+0x2f:
00007ff7`45d4776f 62f2764852d0    vdpbf16ps zmm2,zmm1,zmm0

So I believe this is the reason you "do not obtain any output from the benchmark when I build with AVX512-VBMI enabled". There's a LLAMA_NATIVE option in CMakeLists.txt which should take care of setting the compiler flags automatically, but it doesn't work on MSVC right now. So I think unfortunately you still have to manually select proper CMake options when building llama.cpp (-DAVX512=ON + -DAVX512-VNNI=ON in your case).

dfyz · 2023-04-14T21:54:42Z

It does make me think we are spending a lot of extra effort shuffling values around because of the memory layout. How fast could this be if we only did aligned loads of continuous data, e.g. if each row was all the nibbles followed by all the scales?

That sounds like a neat idea. I tried implementing a quick proof-of-concept to see how fast it can potentially be (for AVX-512 only), but it appears to be harder than I thought:

if we only load 32 bytes of contiguous nibbles per iteration (2 blocks, as I currently do), we can't use aligned loads, since every other load will not be aligned to 64 bytes (the size of an AVX-512 registers)
if we load 64 bytes of nibbles per iteration (4 blocks), we need to use two registers for the unpacked nibbles, and we also need to account for the fact that the number of blocks might be not divisble by 4 (the current code only assumes it is divisible by 2)
the numbers of bytes in a row might be not divisible by 64 (e.g., a row consisting of just two blocks takes 40 bytes), so we'd need to also introduce additional padding between rows to make the loads aligned
I still can't think of anything better than VPERMB and VPERMPS to shuffle the nibbles/scales across the register, even when we load contiguous data

At this point I gave up. If anyone wants to take a stab at this, this would be great.

dfyz · 2023-04-14T22:30:50Z

Here are the benchmark results so far, summarized (the value is the average of FLOPS_per_u_Second from 10 iterations of test 2 in benchmark-q4_0-matmult):

Who	CPU	CPU Family	AVX-512, old	AVX2	AVX-512, new	Speedups
@dfyz	Intel Core i7-1165G7	Tiger Lake (VBMI + VNNI)	30587.13	42153.10	60274.68	97% against old AVX-512, 43% against AVX2
@KASR	Intel Xeon W-2295	Cascade Lake (VNNI)	???	29838.15	42873.94	??? against old AVX-512, 44% against AVX2
@unbounded	???	??? (VBMI + VNNI)	23330.94	???	45849.23	97% against old AVX-512, ??? against AVX2

I think that at this point it is only worth benchmarking the new AVX-512 implementation against the current AVX2 implementation, since the current AVX-512 implementation is consistently worse than the current AVX2 one.

I know of two ways to force AVX2 for matrix multiplication:

Apply the patch from this comment before running make benchmark. This is for Linux/macOS users.
Add CMakeLists.txt from this comment to the benchmark folder and build the benchmark with cmake (the default CMake options do not include AVX-512). This is for Windows users.

ggerganov

I cannot test this as I don't have the hardware.
The reported results are encouraging so I think it is OK to merge this and continue to support AVX512 for now.

ultoris · 2023-04-15T15:00:09Z

Test on Raptor Lake fails as the Intel has dropped support for AVX512 in latest generations of consumer CPUs. It supports avx_vnni though (256bit instructions instead of 512bit in AVX512), would it be possible to use it?

dfyz · 2023-04-15T18:40:25Z

Fixed a trivial merge conflict after 0ad9646. Otherwise, nothing has changed.

For posterity's sake: one thing really bothering me in this PR is that I can't use the _mm*_sign_ps() trick from the AVX implementation because apparently VPSIGNB doesn't have an AVX-512 version for some reason. So I have to make two multiplications instead of one.

@Ameobea As the author of the original AVX-512 implementation (in #320), would you be willing to take a look at this PR before I merge it? I would appreciate any feedback (will make separate PRs if necessary). Also, you mentioned here that you have an AMD CPU supporting AVX-512. It would be interesting to run the benchmarks on an AMD CPU, because so far we only have measurements for Intel.

dfyz · 2023-04-15T19:38:41Z

@ultoris

Test on Raptor Lake fails as the Intel has dropped support for AVX512 in latest generations of consumer CPUs. It supports avx_vnni though (256bit instructions instead of 512bit in AVX512), would it be possible to use it?

I think it's possible. The current AVX2 implementation (where AVX-VNNI would belong) multiplies 16-bit integers (split across two different registers) instead of 8-bit integers, so we would only save one instruction instead of two. However, the patch should be just a couple of lines (i.e., replace 2 * _mm256_madd_epi16() + _mm256_add_epi32() with 2 * _mm256_dpwssd_avx_epi32() if AVX-VNNI is available).

I don't have access to any AVX-VNNI-enabled hardware, though, so I can't test/benchmark it. I can make a separate PR if you're willing to help testing it.

ultoris · 2023-04-15T20:13:09Z

I don't have access to any AVX-VNNI-enabled hardware, though, so I can't test/benchmark it. I can make a separate PR if you're willing to help testing it.

Sure. I can run it on an i7-13700.

0x131315 · 2023-04-17T09:49:50Z

It would be interesting to run the benchmarks on an AMD CPU, because so far we only have measurements for Intel.

i have amd 7950x, but me need instructions for run bencmark: "make benchmark" on dfyz:master not working, its no make target

dfyz · 2023-04-17T13:09:55Z

"make benchmark" on dfyz:master not working, its no make target

Huh, this is strange. The head of dfyz:master was 4f46a13 when I was writing that comment, and it did have the benchmark target, so make benchmark should just work.

Anyway, I just rebased onto the latest master and it appears that ggml_vec_dot_q4_0_q8_0() is now the default function that is used for Q4_0 models (both for benchmarking and inference) after e95b655 (I thought it would require an explicit option to enable). This means that ggml_vec_dot_q4_0() is not worth optimizing/benchmarking anymore.

I think I'm going to merge this as is, and then work on optimizing ggml_vec_dot_q4_0_q8_0() in a separate PR. The AVX-512 tricks used in this implementation should be easily portable to ggml_vec_dot_q4_0_q8_0().

dfyz closed this Apr 13, 2023

dfyz force-pushed the master branch from e632ac6 to 0e07e6a Compare April 13, 2023 21:18

dfyz reopened this Apr 13, 2023

ggerganov added performance Speed related topics high priority Very important issue labels Apr 13, 2023

dfyz mentioned this pull request Apr 14, 2023

Fix result validation in benchmark-q4_0-matmult #987

Merged

dfyz mentioned this pull request Apr 15, 2023

Add Q8_0 quantization for intermediate results #951

Merged

4 tasks

ggerganov approved these changes Apr 15, 2023

View reviewed changes

dfyz force-pushed the master branch from d787348 to 4f46a13 Compare April 15, 2023 18:26

dfyz force-pushed the master branch from 4f46a13 to 6a4fa4d Compare April 17, 2023 12:52

dfyz merged commit f266259 into ggerganov:master Apr 17, 2023

Speedup the AVX-512 implementation of ggml_vec_dot_q4_0()

6a4fa4d

unbounded mentioned this pull request Apr 19, 2023

Continuous layouts for quantization q4_0c #1073

Closed

4 tasks

dfyz mentioned this pull request Apr 22, 2023

A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512 #1119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

≈65% speedup of the AVX-512 implementation of `ggml_vec_dot_q4_0()` #933

≈65% speedup of the AVX-512 implementation of `ggml_vec_dot_q4_0()` #933

dfyz commented Apr 13, 2023

dfyz commented Apr 13, 2023 •

edited

Loading

dfyz commented Apr 13, 2023

KASR commented Apr 14, 2023 •

edited

Loading

unbounded commented Apr 14, 2023 •

edited

Loading

dfyz commented Apr 14, 2023

dfyz commented Apr 14, 2023

dfyz commented Apr 14, 2023

ggerganov left a comment

ultoris commented Apr 15, 2023

dfyz commented Apr 15, 2023

dfyz commented Apr 15, 2023

ultoris commented Apr 15, 2023

0x131315 commented Apr 17, 2023 •

edited

Loading

dfyz commented Apr 17, 2023

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

Conversation

dfyz commented Apr 13, 2023

dfyz commented Apr 13, 2023 • edited Loading

dfyz commented Apr 13, 2023

KASR commented Apr 14, 2023 • edited Loading

unbounded commented Apr 14, 2023 • edited Loading

dfyz commented Apr 14, 2023

dfyz commented Apr 14, 2023

dfyz commented Apr 14, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ultoris commented Apr 15, 2023

dfyz commented Apr 15, 2023

dfyz commented Apr 15, 2023

ultoris commented Apr 15, 2023

0x131315 commented Apr 17, 2023 • edited Loading

dfyz commented Apr 17, 2023

≈65% speedup of the AVX-512 implementation of `ggml_vec_dot_q4_0()` #933

≈65% speedup of the AVX-512 implementation of `ggml_vec_dot_q4_0()` #933

dfyz commented Apr 13, 2023 •

edited

Loading

KASR commented Apr 14, 2023 •

edited

Loading

unbounded commented Apr 14, 2023 •

edited

Loading

0x131315 commented Apr 17, 2023 •

edited

Loading