perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996

jon-chuang · 2023-04-15T13:58:17Z

Benchmark results:

sizey=sizez=N,sizex=K,n_threads=8

K=8,N=8192,AVX2,FLOPS/us=27148.97
K=8,N=8192,AVX,FLOPS/us=15193.96
K=8,N=8192,default,FLOPS/us=1781.05

K=16,N=8192,AVX2,FLOPS/us=20128.26
K=16,N=8192,AVX,FLOPS/us=8224.13
K=16,N=8192,default,FLOPS/us=3540.52

K=32,N=8192,AVX2,FLOPS/us=13127.55
K=32,N=8192,AVX,FLOPS/us=9397.48
K=32,N=8192,default,FLOPS/us=6386.55

K=48,N=8192,AVX2,FLOPS/us=13206.16
K=48,N=8192,AVX,FLOPS/us=5801.21
K=48,N=8192,default,FLOPS/us=8199.44

K=64,N=8192,AVX2,FLOPS/us=10505.51
K=64,N=8192,AVX,FLOPS/us=6353.32
K=64,N=8192,default,FLOPS/us=13024.33

We choose the K cutoff point to be 32 for AVX and 48 for AVX2.

Partial fix to: #956

Will stop the investigation here for now. Time taken for applying LoRA is quite tolerable with these changes.

We are still quite far from optimal; for instance, I see 250KFLOPs/us on matmuls with high K (K=10000).

LoRA application informal benchmarks:

K=16
AVX2 - 5141.57 ms
AVX - 9831.28 ms
default - 22611.96 ms

KerfuffleV2 · 2023-04-16T15:18:35Z

This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor.

tyzoid · 2023-04-17T04:26:11Z

Branch breaks cmake build:

[ 95%] Building C object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-q4_0-matmult.c.o
In file included from llama.cpp/examples/benchmark/benchmark-q4_0-matmult.c:11:
llama.cpp/./llama.h:77:22: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   77 |     LLAMA_API struct llama_context_params llama_context_default_params();
      |                      ^~~~~~~~~~~~~~~~~~~~
llama.cpp/./llama.h:79:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   79 |     LLAMA_API bool llama_mmap_supported();
      |     ^~~~~~~~~
llama.cpp/./llama.h:80:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   80 |     LLAMA_API bool llama_mlock_supported();
      |     ^~~~~~~~~
llama.cpp/./llama.h:158:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
  158 |     LLAMA_API llama_token llama_token_bos();
      |     ^~~~~~~~~
llama.cpp/./llama.h:159:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
  159 |     LLAMA_API llama_token llama_token_eos();
      |     ^~~~~~~~~
llama.cpp/examples/benchmark/benchmark-q4_0-matmult.c:14:10: fatal error: cstring: No such file or directory
   14 | #include <cstring>
      |          ^~~~~~~~~

Titaniumtown · 2023-04-18T17:04:11Z

I'm able to reproduce the performance improvements, impressive work!

e271828- · 2023-04-19T13:32:38Z

Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well.

acheong08 · 2023-04-20T08:19:26Z

conflicts and failing CI

…jon/tall-and-skinny-matmul

…huang/llama.cpp into jon/tall-and-skinny-matmul

jon-chuang · 2023-04-26T16:57:35Z

This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor.

No, tall and skinny looks like:

 __
|  |     ___________
|  |  X  |__________| 
|__|

Matrix to vector is

______________        __
|             |      |  |
|             |  X   |  |
| ____________|      |__|

KerfuffleV2 · 2023-04-26T18:12:49Z

No, tall and skinny looks like:

Thanks, I don't understand how it's tall and skinny though. :)

                I'm not fat, I'm just big boned.
 __            /
|  |     ,____O_____,
|  |  X =|__________|= 
|__|       /      \

jon-chuang · 2023-04-26T18:19:16Z

Thanks, I don't understand how it's tall and skinny though. :)

🤣

Well, in our specific context, the matrices in the matmul are $B A^T$, so both B, A are tall and skinny. ($A^T$ being short and wide)

A mat mul is tall and skinny as long as the dimension along which the matrices are multiplied is small compared to the adjacent dimension of one of the matrices. So the specific orientation does not matter.

syl-00110111 · 2023-04-28T02:19:59Z

// i can only get it working with sse1 like the following because i have no FMA on my machine
// c_vec = _mm256_fmadd_ps(a, b_vec, c_vec); // FMA: c_vec += a * b_vec
c_vec = _mm_add_ps(c_vec, _mm_mul_ps(b_vec, a)); // i suppose it ruins the effort if one considers avx2

…jon/tall-and-skinny-matmul

jon-chuang · 2023-04-30T10:25:59Z

PTAL @slaren @ggerganov. Please refer to informal LoRA benchmarks for e2e validation.

@syl-00110111 Thanks for that info. I will simply not support no FMA. You may expand on this PR (e.g. in follow up PR) if you wish for non-FMA support by providing appropriate benchmarks.

…jon/tall-and-skinny-matmul

ggerganov · 2023-05-01T09:21:17Z

@e271828-

Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well.

Can you clarify - does the RWKV inference benefit from this change, and if so - can you provide some rough numbers

@jon-chuang

Are we confident that the computation is correct?
Maybe we should add an accuracy test comparing the results against the default matrix multiplication

acheong08 · 2023-05-21T14:13:00Z

No updates on this?

jon-chuang · 2023-05-22T09:19:44Z

Hello, I've been on holiday. I wrote a test, there are some bugs, so I'm fixing them.

jon-chuang · 2023-07-07T07:17:38Z

Apologies, no longer motivated to fix. Anyone who is interested, please take a look and continue.

missing comma

jon-chuang added 5 commits April 15, 2023 19:57

stash

73e7601

Merge branch 'master' into jon/tall-and-skinny-matmul

69511b2

commit

00e86b9

format

6bf6543

minor

a38b9d7

jon-chuang changed the title ~~perf: AVX2/AVX routines for tall and skinny matmul~~ perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup Apr 15, 2023

jon-chuang and others added 2 commits April 15, 2023 22:27

minor

0225861

Merge branch 'master' into jon/tall-and-skinny-matmul

b208c25

jon-chuang added 5 commits April 26, 2023 22:45

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

0a320ed

…jon/tall-and-skinny-matmul

Merge branch 'jon/tall-and-skinny-matmul' of https://github.com/jon-c…

afe94e8

…huang/llama.cpp into jon/tall-and-skinny-matmul

minor

5bb5327

fix

8ead56c

done

8cead20

jon-chuang added 2 commits April 30, 2023 18:19

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

74a8db7

…jon/tall-and-skinny-matmul

fma compile only

fb469ed

jon-chuang added 3 commits April 30, 2023 20:56

minor

470cc4c

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

e112522

…jon/tall-and-skinny-matmul

minor

979010c

ziwang-com mentioned this pull request May 21, 2023

lora 例程，适用于高瘦matmul- 高达 15 倍的加速比 ziwang-com/zero-lora#30

Open

jon-chuang closed this Jul 7, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Update README.md functionary demo typo (ggerganov#996)

37da8e8

missing comma

ltoniazzi mentioned this pull request Jul 6, 2024

[WIP] Hot swap for LoRA #8056

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996

perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996

jon-chuang commented Apr 15, 2023 •

edited

Loading

KerfuffleV2 commented Apr 16, 2023

tyzoid commented Apr 17, 2023

Titaniumtown commented Apr 18, 2023

e271828- commented Apr 19, 2023

acheong08 commented Apr 20, 2023

jon-chuang commented Apr 26, 2023

KerfuffleV2 commented Apr 26, 2023

jon-chuang commented Apr 26, 2023 •

edited

Loading

syl-00110111 commented Apr 28, 2023 •

edited

Loading

jon-chuang commented Apr 30, 2023 •

edited

Loading

ggerganov commented May 1, 2023

acheong08 commented May 21, 2023

jon-chuang commented May 22, 2023

jon-chuang commented Jul 7, 2023 •

edited

Loading

perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996

perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996

Conversation

jon-chuang commented Apr 15, 2023 • edited Loading

KerfuffleV2 commented Apr 16, 2023

tyzoid commented Apr 17, 2023

Titaniumtown commented Apr 18, 2023

e271828- commented Apr 19, 2023

acheong08 commented Apr 20, 2023

jon-chuang commented Apr 26, 2023

KerfuffleV2 commented Apr 26, 2023

jon-chuang commented Apr 26, 2023 • edited Loading

syl-00110111 commented Apr 28, 2023 • edited Loading

jon-chuang commented Apr 30, 2023 • edited Loading

ggerganov commented May 1, 2023

acheong08 commented May 21, 2023

jon-chuang commented May 22, 2023

jon-chuang commented Jul 7, 2023 • edited Loading

jon-chuang commented Apr 15, 2023 •

edited

Loading

jon-chuang commented Apr 26, 2023 •

edited

Loading

syl-00110111 commented Apr 28, 2023 •

edited

Loading

jon-chuang commented Apr 30, 2023 •

edited

Loading

jon-chuang commented Jul 7, 2023 •

edited

Loading