-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding IQ2_TN for use with ternary models #13
Conversation
Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation.
Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B.
With this tweak we get to PP-512 = 431 t/s.
With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads.
PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better.
For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet.
For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s.
For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s.
We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn.
For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.
This is great!
Yeah, I did not particularly optimize the ARM_NEON implementation for recent ARM CPUs (yet), especially since I did not use I see Do you have plans for Either way, I really appreciate your work on this. This was a pleasant surprise to see in my notifications. |
I'm not planning on contributing any of this to the official The Metal implementation should be just a copy/paste operation (and replace For CUDA I did factor out the dot products of the new quants into a separate file to avoid having to have a coffee each time I touch something there and |
Mostly adapted from the IQ2_TN kernels from ikawrakow/ik_llama.cpp#13
Mostly adapted from the IQ2_TN kernels from ikawrakow/ik_llama.cpp#13 which were themselves adapted from the Q2_K kernels.
They have abandoned the
Q1_3
andQ2_2
quants in PR-8151 inllama.cpp
, and have moved on toTQ1_0
andTQ2_0
. Like k-quants, these use blocks of 256 weights and utilizeQ8_K
for quantized dot products on the CPU. This removes support for Bitnet b1.58 (unless one adds padding to a multiple of 256), so they are now focussing on the TriLM models. Unlike the previousQ1_3
andQ2_2
, where the quantized data only holds the ternary-1/0/+1
values and the tensor scale is added via a separateggml_scale
operation, the newTQ1_0
andTQ2_0
include a scale in each block of 256. This basically wastes 0.0625 bpw, but has the advantage that one can simply reuse the standardllama.cpp
computation graphs.Based on the
PP-512
andTG-128
figures posted in PR-8151,TQ2_0
performance is much better than the earlierQ2_0
attempt, so I became curious to see how @compilade's implementation compares to what we can do withiqk_mul_mat
in this repo, and here we are.The PR adds
IQ2_TN
(TN
asTriNet
). Implementation forZen4
,AVX2
,ARM_NEON
,CUDA
andMetal
is provided.Let's look at some performance comparisons. We will focus on the largest TriLM model, which has ~4B parameters. Quantized with 2.0625 bpw the model size is 1.08 GiB.
AVX2
AVX2
was tested on a 32-core Ryzen-5975WX CPU. Not everybody has a 32-core CPU handy, so I have added performance values for fewer threads.I would say @compilade has done remarkably well here, coming to within ~14% for PP performance. Although, for fewer than 32 threads, the performance gap increases to about ~23%. My guess is that the 23% is a more realistic value for the performance difference, and as the number of threads increases we see more the effect of
ggml
inefficiencies (thread synchronization, operations that do not scale with number of threads, etc.), which then narrows the gap. Nevertheless, even 23% is remarkable considering the performance differences for other quants (see main page). For TG the performance is the same for 1 thread (not much one can do there, the bit arrangement is so simple that there aren't many different ways to implement effectively withAVX2
). The implementation in this PR then becomes faster, I guess due to better cache utilization. But this better per thread performance leads to too much memory bandwidth contention above 8 threads, soTQ2_0
is able to arrive at a slightly better performance at 16 threads.Zen4
I have also tested on a
Zen4
CPU (16-core Ryzen-7950X).Zen4
implements some of theAVX512
instruction set, and there is a dedicated implementation for that forIQ2_TN
. TheTQ2_0
quants are implemented in pureAVX2
, so one might think the performance comparison is unfair. But, at least as far as I know, theZen4
core implements 512-bit instructions as two separate 256-bit instructions in hardware, so one does not gain much by operating on 512-bit wide vectors. The main advantage comes from having more vector registers (32 vs 16 onAVX2
), but the way matrix multiplications are done inggml
(a series of vector x vector dot products), one cannot really take advantage of that. Anyway, here is the performance comparison on the Ryzen-7950X CPUHere the PP performance gap is more significant at around 66%, reducing to 55% at 16 threads. If we look at TG performance for 1 thread, the ~7% performance difference comes from using
_mm512_dpbusd_epi32
, which is a fused multiply-add operation, whereas onAVX2
one needs to use_mm256_maddubs_epi16
followed by_mm256_add_epi16
to accumulate the result. The TG performance gap then widens due to better cache utilization, and then decreases towards zero with increasing numbers of threads as the memory bandwidth is saturated. The 66% PP performance gap is hence the combination of the ~7% due to the use a fused multiply-add, and ~60% due to better utilization of vector registers while performing a multiplication of a row in the left matrix with several columns in the right matrix, where the unpacked quants for a block are held in vector registers.ARM_NEON
Here @compilade's implementation does not do very well, at least not on the M2-Max laptop where I have tested. But perhaps this is just due to the fact that @compilade used a Cortex A72 CPU in their development, and that CPU may as well behave very differently from the M2-Max.
CUDA and Metal
There is no GPU implementation in PR-8151, so here just the performance values for this PR.
CUDA
is tested on RTX-4080,Metal
on a 30-code M2-Max GPU.I have not bothered implementing the MMQ stuff, so CUDA PP performance is via dequantize and cuBLAS gemm.