-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996
perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup #996
Conversation
This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor. |
Branch breaks
|
I'm able to reproduce the performance improvements, impressive work! |
Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well. |
conflicts and failing CI |
…jon/tall-and-skinny-matmul
…huang/llama.cpp into jon/tall-and-skinny-matmul
No, tall and skinny looks like:
Matrix to vector is
|
Thanks, I don't understand how it's tall and skinny though. :)
|
🤣 Well, in our specific context, the matrices in the matmul are A mat mul is tall and skinny as long as the dimension along which the matrices are multiplied is small compared to the adjacent dimension of one of the matrices. So the specific orientation does not matter. |
// i can only get it working with sse1 like the following because i have no FMA on my machine |
…jon/tall-and-skinny-matmul
PTAL @slaren @ggerganov. Please refer to informal LoRA benchmarks for e2e validation. @syl-00110111 Thanks for that info. I will simply not support no FMA. You may expand on this PR (e.g. in follow up PR) if you wish for non-FMA support by providing appropriate benchmarks. |
…jon/tall-and-skinny-matmul
Can you clarify - does the RWKV inference benefit from this change, and if so - can you provide some rough numbers Are we confident that the computation is correct? |
No updates on this? |
Hello, I've been on holiday. I wrote a test, there are some bugs, so I'm fixing them. |
Apologies, no longer motivated to fix. Anyone who is interested, please take a look and continue. |
Benchmark results:
We choose the K cutoff point to be 32 for AVX and 48 for AVX2.
Partial fix to: #956
Will stop the investigation here for now. Time taken for applying LoRA is quite tolerable with these changes.
We are still quite far from optimal; for instance, I see 250KFLOPs/us on matmuls with high K (K=10000).
LoRA application informal benchmarks: