Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() #933

Merged
merged 1 commit into from
Apr 17, 2023

Conversation

dfyz
Copy link
Collaborator

@dfyz dfyz commented Apr 13, 2023

Apologies for the slightly clickbaity title: while technically true, as mentioned in this comment, the current AVX-512 implementation is slower than the regular AVX2 implementation. Compared to the current AVX2 implementation, the new AVX-512 implementation only gets a ≈35% speedup on modern hardware.

The measurements below were made on a laptop with a Tiger Lake CPU (i7-1165G7). 8 threads were used.

LLaMA 7B (./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 256)

Implementation Time per token, ms (average over 3 runs)
AVX512, old 184
AVX2 149
AVX512, new 110

LLaMA 13B (./main -m ./models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 256)

Implementation Time per token, ms (average over 3 runs)
AVX512, old 354
AVX2 282
AVX512, new 205

For clarity, I only include the time of generation per token, but the prompt processing improvements are very similar.

These speedup percentages are, of course, an oversimplification:

  • I report end-to-end token generation timings here. While regular BLAS-less inference is bottlenecked on ggml_vec_dot_q4_0(), it also does quite a lot of other stuff.
  • This is gcc 12.2.1, the results with other compilers can vary (but not too much). clang, in particular, does some "optimizations" than only hurt performance (fortunately, not in a major way).
  • AVX-512 is a fragmented ecosystem of extensions. My code makes use of VBMI and VNNI extensions, when available. If they are not present, the code runs slightly slower.

However, my microbenchmark (./build.sh && ./main on a Linux system with libbenchmark and gcc installed) suggests solid improvements over both the AVX2 and old AVX-512 implementations across a variety of CPUs. Unless I screwed up something major, the improvements in the microbenchmark should directly lead to improvements in generation. I would appreciate any performance measurements with this PR applied.

Implementation-wise, the basic idea is that built-in masking and register-wide shuffles in AVX-512 allow us to operate on two Q4_0 blocks at once. I tried to comment the code extensively.

@dfyz
Copy link
Collaborator Author

dfyz commented Apr 13, 2023

Today, an "official" microbenchmark for ggml_vec_dot_q4_0() was introduced in 95ea26f. It seems to confirm the speed-up (measured on the same Tiger Lake laptop).

`make benchmark` output for 0e07e6a (the "old" AVX-512 version)
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            380422;           30341.90
        1;       1; 11008;  4096;   128;    11542724608;            377708;           30559.92
        2;       1; 11008;  4096;   128;    11542724608;            374219;           30844.84
        3;       1; 11008;  4096;   128;    11542724608;            376902;           30625.27
        4;       1; 11008;  4096;   128;    11542724608;            376930;           30622.99
        5;       1; 11008;  4096;   128;    11542724608;            376601;           30649.74
        6;       1; 11008;  4096;   128;    11542724608;            377351;           30588.83
        7;       1; 11008;  4096;   128;    11542724608;            378604;           30487.59
        8;       1; 11008;  4096;   128;    11542724608;            377307;           30592.39
        9;       1; 11008;  4096;   128;    11542724608;            377734;           30557.81
`make benchmark` output for 0e07e6a with AVX-512 disabled (the AVX2 version)
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            274538;           42044.18
        1;       1; 11008;  4096;   128;    11542724608;            273030;           42276.40
        2;       1; 11008;  4096;   128;    11542724608;            275449;           41905.12
        3;       1; 11008;  4096;   128;    11542724608;            273007;           42279.96
        4;       1; 11008;  4096;   128;    11542724608;            274230;           42091.40
        5;       1; 11008;  4096;   128;    11542724608;            273188;           42251.95
        6;       1; 11008;  4096;   128;    11542724608;            273488;           42205.60
        7;       1; 11008;  4096;   128;    11542724608;            273402;           42218.88
        8;       1; 11008;  4096;   128;    11542724608;            273236;           42244.52
        9;       1; 11008;  4096;   128;    11542724608;            274742;           42012.96
`make benchmark` output for d787348 (the "new" AVX-512 version)
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            193742;           59577.81
        1;       1; 11008;  4096;   128;    11542724608;            189840;           60802.38
        2;       1; 11008;  4096;   128;    11542724608;            192014;           60113.97
        3;       1; 11008;  4096;   128;    11542724608;            191863;           60161.29
        4;       1; 11008;  4096;   128;    11542724608;            191087;           60405.60
        5;       1; 11008;  4096;   128;    11542724608;            191952;           60133.39
        6;       1; 11008;  4096;   128;    11542724608;            192595;           59932.63
        7;       1; 11008;  4096;   128;    11542724608;            190172;           60696.23
        8;       1; 11008;  4096;   128;    11542724608;            190912;           60460.97
        9;       1; 11008;  4096;   128;    11542724608;            190907;           60462.55

I used the following patch for 0e07e6a to force AVX2 instead of AVX-512:

diff --git a/ggml.c b/ggml.c
index 42e3ee3..9c72456 100644
--- a/ggml.c
+++ b/ggml.c
@@ -1967,7 +1967,7 @@ static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void * rest
     }

     sumf = sum0 + sum1;
-#elif defined(__AVX512F__)
+#elif 0 && defined(__AVX512F__)
     // Initialize accumulator with zeros
     __m512 acc0 = _mm512_setzero_ps();
     __m512 acc1 = _mm512_setzero_ps();

I would appreciate if someone with AVX-512-enabled hardware could also run the official microbenchmark to determine if this is worth merging.

@dfyz dfyz reopened this Apr 13, 2023
@ggerganov ggerganov added performance Speed related topics high priority Very important issue labels Apr 13, 2023
@dfyz
Copy link
Collaborator Author

dfyz commented Apr 13, 2023

FWIW, I didn't intend to close the PR. I merged the latest upstream commits to run the microbenchmark and accidentally force-pushed 0e07e6a (the latest upstream master I used for microbenchmarking) into my branch. GitHub apparently interpreted this as "this PR has already been merged, let's close it". I fixed this now (by force-pushing d787348 and re-opening the PR); sorry for the noise.

@KASR
Copy link
Contributor

KASR commented Apr 14, 2023

I would appreciate if someone with AVX-512-enabled hardware could also run the official microbenchmark to determine if this is worth merging.

According to the datasheet my cpu has 2 "AVX-512 FMA Units" so i could give it a try. However, I noticed that the benchmark folder has no CMakeLists.txt i.e. I cant build it

I tried to create the cmake file but this resulted in a bunch of errors when building on windows/cmake/vs... do you have any tips on how to add the cmake file for the benchmark?

I think I was able to build benchmark-q4_0-matmult I've added the following CMakeLists.txt and added the folder to the build directories.

set(TARGET benchmark-q4_0-matmult)
add_executable(${TARGET} benchmark-q4_0-matmult.cpp)
target_link_libraries(${TARGET} PRIVATE ggml ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)

Below you can find the outputs for both the up to date branch and your commit. However, I think I'm doing something wrong (or something stupid 😅) --> I do not obtain any output from the benchmark when I build with AVX512-VBMI enabled.

benchmark when using 43ddef, all settings default
PS C:\DATA\TestLLama\llama.cpp-master\build\bin\Release> .\benchmark-q4_0-matmult.exe
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            390763;           29538.94
        1;       1; 11008;  4096;   128;    11542724608;            392531;           29405.89
        2;       1; 11008;  4096;   128;    11542724608;            389934;           29601.74
        3;       1; 11008;  4096;   128;    11542724608;            379255;           30435.26
        4;       1; 11008;  4096;   128;    11542724608;            398684;           28952.06
        5;       1; 11008;  4096;   128;    11542724608;            383994;           30059.65
        6;       1; 11008;  4096;   128;    11542724608;            384464;           30022.90
        7;       1; 11008;  4096;   128;    11542724608;            377635;           30565.82
        8;       1; 11008;  4096;   128;    11542724608;            387380;           29796.90
        9;       1; 11008;  4096;   128;    11542724608;            384728;           30002.30

benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = OFF | AVX512-VNNI=OFF
PS C:\DATA\TestLLama\llama.cpp-master-dfyz\build\bin\Release> .\benchmark-q4_0-matmult.exe
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            302760;           38125.00
        1;       1; 11008;  4096;   128;    11542724608;            289438;           39879.78
        2;       1; 11008;  4096;   128;    11542724608;            298912;           38615.80
        3;       1; 11008;  4096;   128;    11542724608;            303842;           37989.23
        4;       1; 11008;  4096;   128;    11542724608;            315717;           36560.35
        5;       1; 11008;  4096;   128;    11542724608;            298674;           38646.57
        6;       1; 11008;  4096;   128;    11542724608;            298313;           38693.34
        7;       1; 11008;  4096;   128;    11542724608;            301219;           38320.04
        8;       1; 11008;  4096;   128;    11542724608;            301670;           38262.75
        9;       1; 11008;  4096;   128;    11542724608;            299335;           38561.23
benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = OFF | AVX512-VNNI=ON
PS C:\DATA\TestLLama\llama.cpp-master-dfyz\build\bin\Release> .\benchmark-q4_0-matmult.exe
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            272608;           42341.84
        1;       1; 11008;  4096;   128;    11542724608;            272676;           42331.28
        2;       1; 11008;  4096;   128;    11542724608;            265465;           43481.15
        3;       1; 11008;  4096;   128;    11542724608;            271353;           42537.67
        4;       1; 11008;  4096;   128;    11542724608;            267103;           43214.51
        5;       1; 11008;  4096;   128;    11542724608;            269835;           42776.97
        6;       1; 11008;  4096;   128;    11542724608;            268411;           43003.92
        7;       1; 11008;  4096;   128;    11542724608;            271905;           42451.31
        8;       1; 11008;  4096;   128;    11542724608;            269273;           42866.25
        9;       1; 11008;  4096;   128;    11542724608;            263927;           43734.54
benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = ON | AVX512-VNNI= OFF
PS C:\DATA\TestLLama\llama.cpp-master-dfyz\build\bin\Release> .\benchmark-q4_0-matmult.exe
Starting Test
Allocating Memory of size 645966848 byes, 616 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code ------------------------------------------------------------------------------
cgraph->n_threads=1
            m11: type = 0 ( FP32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is   0.00
             m2: type = 0 ( FP32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is   0.00
    gf.nodes[0]: type = 0 ( FP32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf.nodes[0] is   0.00

------ Test 2 - Matrix Mult via Q4_0 code ------------------------------------------------------------------------------
cgraph->n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - aboout  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
PS C:\DATA\TestLLama\llama.cpp-master-dfyz\build\bin\Release>

@unbounded
Copy link
Collaborator

unbounded commented Apr 14, 2023

I can confirm that this is a big speedup for me, well done!

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

master branch:

(  187.62 ms per token)

...

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            491766;           23471.99
        1;       1; 11008;  4096;   128;    11542724608;            492068;           23457.58
        2;       1; 11008;  4096;   128;    11542724608;            492413;           23441.14
        3;       1; 11008;  4096;   128;    11542724608;            493280;           23399.95
        4;       1; 11008;  4096;   128;    11542724608;            494672;           23334.10
        5;       1; 11008;  4096;   128;    11542724608;            496268;           23259.05
        6;       1; 11008;  4096;   128;    11542724608;            495874;           23277.54
        7;       1; 11008;  4096;   128;    11542724608;            496583;           23244.30
        8;       1; 11008;  4096;   128;    11542724608;            497316;           23210.04
        9;       1; 11008;  4096;   128;    11542724608;            497237;           23213.73

d787348:

(  102.19 ms per token)

...

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; FLOPS_per_u_Second
==============================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            250094;           46153.54
        1;       1; 11008;  4096;   128;    11542724608;            249826;           46203.05
        2;       1; 11008;  4096;   128;    11542724608;            251181;           45953.81
        3;       1; 11008;  4096;   128;    11542724608;            252217;           45765.05
        4;       1; 11008;  4096;   128;    11542724608;            252019;           45801.01
        5;       1; 11008;  4096;   128;    11542724608;            252224;           45763.79
        6;       1; 11008;  4096;   128;    11542724608;            252719;           45674.14
        7;       1; 11008;  4096;   128;    11542724608;            252265;           45756.35
        8;       1; 11008;  4096;   128;    11542724608;            252170;           45773.58
        9;       1; 11008;  4096;   128;    11542724608;            252864;           45647.96

It does make me think we are spending a lot of extra effort shuffling values around because of the memory layout.
How fast could this be if we only did aligned loads of continuous data, e.g. if each row was all the nibbles followed by all the scales?

@dfyz
Copy link
Collaborator Author

dfyz commented Apr 14, 2023

@KASR
Regarding AVX512-VBMI: I don't think you're doing anything wrong or stupid, it's probably just that your CPU is Cascade Lake, which doesn't support VBMI (according to the "Microarchitecture support" table here), so the benchmarks traps on the first VBMI instruction the CPU tries to execute. I don't have a non-VBMI machine with Windows installed, so I wrote a simple program that tries to use AVX512-BF16 instructions (which are not supported by my CPU) to see how Windows behaves:

#include <immintrin.h>
#include <iostream>

int main() {
	__m512 x = _mm512_set1_ps(1.0f);
	__m512bh y = _mm512_set1_epi8(2);
	__m512bh z = _mm512_set1_epi8(3);	
	char out[64]{};
	_mm512_storeu_ps(out, _mm512_dpbf16_ps(x, y, z));
	std::cout << static_cast<int>(out[1]) << std::endl;	
}

I compiled it with cl /O2 /EHsc /arch:AVX512 /Zi test.cpp, and it indeed exited silently without printing anything:

PS D:\dfyz\tmp> .\test.exe
PS D:\dfyz\tmp>

When I run the program under debugger, I can see that it traps:

0:000> g
(3794.2cdc): Illegal instruction - code c000001d (first chance)
(3794.2cdc): Illegal instruction - code c000001d (!!! second chance !!!)
test!main+0x2f:
00007ff7`45d4776f 62f2764852d0    vdpbf16ps zmm2,zmm1,zmm0

So I believe this is the reason you "do not obtain any output from the benchmark when I build with AVX512-VBMI enabled". There's a LLAMA_NATIVE option in CMakeLists.txt which should take care of setting the compiler flags automatically, but it doesn't work on MSVC right now. So I think unfortunately you still have to manually select proper CMake options when building llama.cpp (-DAVX512=ON + -DAVX512-VNNI=ON in your case).

@dfyz
Copy link
Collaborator Author

dfyz commented Apr 14, 2023

It does make me think we are spending a lot of extra effort shuffling values around because of the memory layout. How fast could this be if we only did aligned loads of continuous data, e.g. if each row was all the nibbles followed by all the scales?

That sounds like a neat idea. I tried implementing a quick proof-of-concept to see how fast it can potentially be (for AVX-512 only), but it appears to be harder than I thought:

  • if we only load 32 bytes of contiguous nibbles per iteration (2 blocks, as I currently do), we can't use aligned loads, since every other load will not be aligned to 64 bytes (the size of an AVX-512 registers)
  • if we load 64 bytes of nibbles per iteration (4 blocks), we need to use two registers for the unpacked nibbles, and we also need to account for the fact that the number of blocks might be not divisble by 4 (the current code only assumes it is divisible by 2)
  • the numbers of bytes in a row might be not divisible by 64 (e.g., a row consisting of just two blocks takes 40 bytes), so we'd need to also introduce additional padding between rows to make the loads aligned
  • I still can't think of anything better than VPERMB and VPERMPS to shuffle the nibbles/scales across the register, even when we load contiguous data

At this point I gave up. If anyone wants to take a stab at this, this would be great.

@dfyz
Copy link
Collaborator Author

dfyz commented Apr 14, 2023

Here are the benchmark results so far, summarized (the value is the average of FLOPS_per_u_Second from 10 iterations of test 2 in benchmark-q4_0-matmult):

Who CPU CPU Family AVX-512, old AVX2 AVX-512, new Speedups
@dfyz Intel Core i7-1165G7 Tiger Lake (VBMI + VNNI) 30587.13 42153.10 60274.68 97% against old AVX-512, 43% against AVX2
@KASR Intel Xeon W-2295 Cascade Lake (VNNI) ??? 29838.15 42873.94 ??? against old AVX-512, 44% against AVX2
@unbounded ??? ??? (VBMI + VNNI) 23330.94 ??? 45849.23 97% against old AVX-512, ??? against AVX2

I think that at this point it is only worth benchmarking the new AVX-512 implementation against the current AVX2 implementation, since the current AVX-512 implementation is consistently worse than the current AVX2 one.

I know of two ways to force AVX2 for matrix multiplication:

  • Apply the patch from this comment before running make benchmark. This is for Linux/macOS users.
  • Add CMakeLists.txt from this comment to the benchmark folder and build the benchmark with cmake (the default CMake options do not include AVX-512). This is for Windows users.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot test this as I don't have the hardware.
The reported results are encouraging so I think it is OK to merge this and continue to support AVX512 for now.

@ultoris
Copy link

ultoris commented Apr 15, 2023

Test on Raptor Lake fails as the Intel has dropped support for AVX512 in latest generations of consumer CPUs. It supports avx_vnni though (256bit instructions instead of 512bit in AVX512), would it be possible to use it?

@dfyz
Copy link
Collaborator Author

dfyz commented Apr 15, 2023

Fixed a trivial merge conflict after 0ad9646. Otherwise, nothing has changed.

For posterity's sake: one thing really bothering me in this PR is that I can't use the _mm*_sign_ps() trick from the AVX implementation because apparently VPSIGNB doesn't have an AVX-512 version for some reason. So I have to make two multiplications instead of one.

@Ameobea As the author of the original AVX-512 implementation (in #320), would you be willing to take a look at this PR before I merge it? I would appreciate any feedback (will make separate PRs if necessary). Also, you mentioned here that you have an AMD CPU supporting AVX-512. It would be interesting to run the benchmarks on an AMD CPU, because so far we only have measurements for Intel.

@dfyz
Copy link
Collaborator Author

dfyz commented Apr 15, 2023

@ultoris

Test on Raptor Lake fails as the Intel has dropped support for AVX512 in latest generations of consumer CPUs. It supports avx_vnni though (256bit instructions instead of 512bit in AVX512), would it be possible to use it?

I think it's possible. The current AVX2 implementation (where AVX-VNNI would belong) multiplies 16-bit integers (split across two different registers) instead of 8-bit integers, so we would only save one instruction instead of two. However, the patch should be just a couple of lines (i.e., replace 2 * _mm256_madd_epi16() + _mm256_add_epi32() with 2 * _mm256_dpwssd_avx_epi32() if AVX-VNNI is available).

I don't have access to any AVX-VNNI-enabled hardware, though, so I can't test/benchmark it. I can make a separate PR if you're willing to help testing it.

@ultoris
Copy link

ultoris commented Apr 15, 2023

I don't have access to any AVX-VNNI-enabled hardware, though, so I can't test/benchmark it. I can make a separate PR if you're willing to help testing it.

Sure. I can run it on an i7-13700.

@0x131315
Copy link

0x131315 commented Apr 17, 2023

It would be interesting to run the benchmarks on an AMD CPU, because so far we only have measurements for Intel.

i have amd 7950x, but me need instructions for run bencmark: "make benchmark" on dfyz:master not working, its no make target

@dfyz
Copy link
Collaborator Author

dfyz commented Apr 17, 2023

"make benchmark" on dfyz:master not working, its no make target

Huh, this is strange. The head of dfyz:master was 4f46a13 when I was writing that comment, and it did have the benchmark target, so make benchmark should just work.

Anyway, I just rebased onto the latest master and it appears that ggml_vec_dot_q4_0_q8_0() is now the default function that is used for Q4_0 models (both for benchmarking and inference) after e95b655 (I thought it would require an explicit option to enable). This means that ggml_vec_dot_q4_0() is not worth optimizing/benchmarking anymore.

I think I'm going to merge this as is, and then work on optimizing ggml_vec_dot_q4_0_q8_0() in a separate PR. The AVX-512 tricks used in this implementation should be easily portable to ggml_vec_dot_q4_0_q8_0().

@dfyz dfyz merged commit f266259 into ggerganov:master Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants