Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996

Closed

Conversation

jon-chuang
Copy link
Contributor

@jon-chuang jon-chuang commented Apr 15, 2023

Benchmark results:

sizey=sizez=N,sizex=K,n_threads=8

K=8,N=8192,AVX2,FLOPS/us=27148.97
K=8,N=8192,AVX,FLOPS/us=15193.96
K=8,N=8192,default,FLOPS/us=1781.05

K=16,N=8192,AVX2,FLOPS/us=20128.26
K=16,N=8192,AVX,FLOPS/us=8224.13
K=16,N=8192,default,FLOPS/us=3540.52

K=32,N=8192,AVX2,FLOPS/us=13127.55
K=32,N=8192,AVX,FLOPS/us=9397.48
K=32,N=8192,default,FLOPS/us=6386.55

K=48,N=8192,AVX2,FLOPS/us=13206.16
K=48,N=8192,AVX,FLOPS/us=5801.21
K=48,N=8192,default,FLOPS/us=8199.44

K=64,N=8192,AVX2,FLOPS/us=10505.51
K=64,N=8192,AVX,FLOPS/us=6353.32
K=64,N=8192,default,FLOPS/us=13024.33

We choose the K cutoff point to be 32 for AVX and 48 for AVX2.

Partial fix to: #956

Will stop the investigation here for now. Time taken for applying LoRA is quite tolerable with these changes.

We are still quite far from optimal; for instance, I see 250KFLOPs/us on matmuls with high K (K=10000).

LoRA application informal benchmarks:

K=16
AVX2 - 5141.57 ms
AVX - 9831.28 ms
default - 22611.96 ms

@jon-chuang jon-chuang changed the title perf: AVX2/AVX routines for tall and skinny matmul perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup Apr 15, 2023
@KerfuffleV2
Copy link
Collaborator

This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor.

@tyzoid
Copy link

tyzoid commented Apr 17, 2023

Branch breaks cmake build:

[ 95%] Building C object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-q4_0-matmult.c.o
In file included from llama.cpp/examples/benchmark/benchmark-q4_0-matmult.c:11:
llama.cpp/./llama.h:77:22: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   77 |     LLAMA_API struct llama_context_params llama_context_default_params();
      |                      ^~~~~~~~~~~~~~~~~~~~
llama.cpp/./llama.h:79:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   79 |     LLAMA_API bool llama_mmap_supported();
      |     ^~~~~~~~~
llama.cpp/./llama.h:80:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   80 |     LLAMA_API bool llama_mlock_supported();
      |     ^~~~~~~~~
llama.cpp/./llama.h:158:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
  158 |     LLAMA_API llama_token llama_token_bos();
      |     ^~~~~~~~~
llama.cpp/./llama.h:159:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
  159 |     LLAMA_API llama_token llama_token_eos();
      |     ^~~~~~~~~
llama.cpp/examples/benchmark/benchmark-q4_0-matmult.c:14:10: fatal error: cstring: No such file or directory
   14 | #include <cstring>
      |          ^~~~~~~~~

@Titaniumtown
Copy link

I'm able to reproduce the performance improvements, impressive work!

@e271828-
Copy link

Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well.

@acheong08
Copy link

conflicts and failing CI

@jon-chuang
Copy link
Contributor Author

This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor.

No, tall and skinny looks like:

 __
|  |     ___________
|  |  X  |__________| 
|__|

Matrix to vector is

______________        __
|             |      |  |
|             |  X   |  |
| ____________|      |__|

@KerfuffleV2
Copy link
Collaborator

No, tall and skinny looks like:

Thanks, I don't understand how it's tall and skinny though. :)

                I'm not fat, I'm just big boned.
 __            /
|  |     ,____O_____,
|  |  X =|__________|= 
|__|       /      \

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 26, 2023

Thanks, I don't understand how it's tall and skinny though. :)

🤣

Well, in our specific context, the matrices in the matmul are $B A^T$, so both B, A are tall and skinny. ($A^T$ being short and wide)

A mat mul is tall and skinny as long as the dimension along which the matrices are multiplied is small compared to the adjacent dimension of one of the matrices. So the specific orientation does not matter.

@syl-00110111
Copy link

syl-00110111 commented Apr 28, 2023

// i can only get it working with sse1 like the following because i have no FMA on my machine
// c_vec = _mm256_fmadd_ps(a, b_vec, c_vec); // FMA: c_vec += a * b_vec
c_vec = _mm_add_ps(c_vec, _mm_mul_ps(b_vec, a)); // i suppose it ruins the effort if one considers avx2

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 30, 2023

PTAL @slaren @ggerganov. Please refer to informal LoRA benchmarks for e2e validation.

@syl-00110111 Thanks for that info. I will simply not support no FMA. You may expand on this PR (e.g. in follow up PR) if you wish for non-FMA support by providing appropriate benchmarks.

@ggerganov
Copy link
Owner

@e271828-

Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well.

Can you clarify - does the RWKV inference benefit from this change, and if so - can you provide some rough numbers

@jon-chuang

Are we confident that the computation is correct?
Maybe we should add an accuracy test comparing the results against the default matrix multiplication

@acheong08
Copy link

No updates on this?

@jon-chuang
Copy link
Contributor Author

Hello, I've been on holiday. I wrote a test, there are some bugs, so I'm fixing them.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Jul 7, 2023

Apologies, no longer motivated to fix. Anyone who is interested, please take a look and continue.

@jon-chuang jon-chuang closed this Jul 7, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
@ltoniazzi ltoniazzi mentioned this pull request Jul 6, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants