More accurate Q4_0 and Q4_1 quantizations #896

in quantize_row_q4_0_reference and quantize_row_q4_1_reference. This reduces the difference to the vectorized versions to ~10% for quantize_row_q4_0 and <15% for quantize_row_q4_1 on the two CPU's I have tried (Ryzen 7950X and M2 Max).

But we should eventually switch back to nearestInt() and adapt the test.

Somehow I had it hard-wired in my brain that quants need to be in -7...7 to be comparable to the original Q4_0. But this is clearly not the case, and if we relax this requirement this simple change brings the rmse down to 0.001966 at the expense of a somewhat longer computation (~67 seconds vs 49 seconds for the 7B model on M2 Max). Perplexity test is still running but it looks like the improvement compared to the previous version will be quite modest ~0.03) despite the significant improvement in MSE. The change does not affect Q4_1 as there we already use the full range of 16 possible int values.

The RMSE of the 7B model becomes 0.00185228. It looks like the perplexity will end up being around 6.27-6.28.

Basically, we use two Q4_0 quantizations, each having 16 weights, to a quantize a set of 32 weights. We get two separate scaling factors, which we store as fp16, ending up using the exact same 5 bits per weight as the current Q4_0. We end up witn an rmse of ~0.00159, so basically the same as the improved Q4_1. But this should run faster than `Q4_1` (unless fp16 -> fp32 conversion is somehow very slow).

As last commit, but Q4_1 type, using the same memory as existing Q4_1 via fp16. We end up with rmse 0.00125125, maxerr 0.11657715, 95pct<0.0024, median<0.0010 after a quantize - dequantize roundtrip. This is quite a bit better than Q4_1 with groups of 32 weights, but by far not as good as 5-bit quantization that uses the same amount of memory where we had rmse 0.00076131, maxerr 0.05273438, 95pct<0.0016, median<0.0006

q8_0 : rmse 0.00010729, maxerr 0.01030385, 95pct<0.0002, median<0.0002

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More accurate Q4_0 and Q4_1 quantizations #896

More accurate Q4_0 and Q4_1 quantizations #896

Commits on Apr 11, 2023

Commits on Apr 12, 2023

Commits on Apr 13, 2023