-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster AVX2 matrix multiplications for lgacy quants #405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Approved. Could you sync to head please? I needed to change the way your earlier contribution is compiled in an effort to make room in the binary size for flash attention. I basically just renamed a file and added an if statement which uses X86_HAVE(AVX2)
to do dispatching at runtime. That helped me get your first iteration into a release and I can cut another once this is merged too. Thanks!
Somehow memcpy is kind of slow, so for getting 4 bytes from 2-byte-aligned data it is faster to just do or on two consecutive 16-bit entries.
However, the way it is currently, we have lost the zen4-tuned version.
1baed32
to
a3bca82
Compare
@jart I adapted to head. But to get back the Ryzen-7950X performance I had to make two separate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have all the microarchitecture targets you need. LGTM. Thanks!
I'd encourage you to work your magic on Cosmopolitan's memcpy() function. https://github.com/jart/cosmopolitan/blob/master/libc/intrin/memmove.c You can run the tests by either running |
Also, did you notice this? https://www.phoronix.com/news/Llamafile-0.8.2-More-AVX2 Congrats! |
It seems some people still use the
ggml
legacy qunatsQ4_0, Q4_1, Q5_0
andQ5_1
, so here is a PR that improves matrix multiplication performance for these quants on AVX2. The gains forQ4_1, Q5_0
andQ5_1
, which do not have tiniBLAS implementation are very significant, but evenQ4_0
is faster than tinyBLAS (see table below).I have gone for a templated implementation. This costs 2-3% in performance but reduces the code by at least a factor of 2.
The implementation requires at least a C++14 compiler because I have used
auto
for the return type of two functions. Is this a problem?Prompt processing speed for a 512-token prompt (PP-512) for a 7B LLaMA model
The PR can also help with token generation (TG) speed. On my system TG is fully memory bound for more than 4-8 threads (depending on quantization type). So, to have a better illustration of the performance differences, here are the TG-128 results with just 2 threads on a Ryzen-7950X for a 7B LLaMA model: