Regressions on IQ3_XXS over time #5856

GlasslessPizza · 2024-03-03T17:11:32Z

If I quantize this gguf with this imatrix using this command:

quantize.exe --allow-requantize --imatrix mixtral-8x7b-instruct-v0.1.imatrix mixtral-8x7b-instruct-v0.1.Q8_0.gguf mixtral-8x7b-instruct-v0.1.IQ3_XXS.gguf IQ3_XXS

and I calculate perplexity with this command:

perplexity.exe -f wiki.test.raw --chunks 1000 --seed 42 --threads 8 --log-disable --no-mmap --mlock --ctx-size 512 --n-gpu-layers 999 --model mixtral-8x7b-instruct-v0.1.IQ3_XXS.gguf

I get three much different PPL values on three different versions of quantize.exe, everything else being equal:

b2037 31-1-2024 : 4.7009 +/- 0.02569
b???? 25-2-2024 : 4.7249 +/- 0.02576
b2329 03-3-2024 : 4.8491 +/- 0.02636

I suspect that there have been multiple cumulative regression events on the IQ3_XXS quantization implementation between b2037 and b2329.

cu12.2.0 on Windows 10.

The text was updated successfully, but these errors were encountered:

schmorp · 2024-03-04T04:22:34Z

Out of curiosity, did the resulting gguf sizes also change?

GlasslessPizza · 2024-03-04T06:27:46Z

Out of curiosity, did the resulting gguf sizes also change?

Not enough to justify the difference:

b2037 31-1-2024 : 18,308,777,920 bytes
b???? 25-2-2024 : 18,307,082,176 bytes
b2329 03-3-2024 : 18,240,407,488 bytes

Artefact2 · 2024-03-04T21:58:50Z

Can you try just before #5829?

GlasslessPizza · 2024-03-05T06:50:07Z

Can you try just before #5829?

Sure, that would be b2314:

b2037 | 31-jan-2024 | 4.7009 +/- 0.02569 | 18,308,777,920 bytes
b???? | 25-feb-2024 | 4.7249 +/- 0.02576 | 18,307,082,176 bytes
b2314 | 02-mar-2024 | 4.8530 +/- 0.02642 | 18,240,407,488 bytes
b2329 | 03-mar-2024 | 4.8491 +/- 0.02636 | 18,240,407,488 bytes

Almost indistinguishable from b2329.

bladeswill · 2024-03-09T14:05:25Z

When I try to quantize "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
quantize --imatrix TinyLlama-1.1B-Chat-v1.0\ggml-model-f16.gguf TinyLlama-1.1B-IQ3_XXS.gguf IQ3_XXS

on b2281 - b2356 , Will report an error：

[ 1/ 201] output.weight - [ 2048, 32000, 1, 1], type = f16, quantizing to q5_K .. size = 125.00 MiB -> 42.97 MiB
[ 2/ 201] token_embd.weight - [ 2048, 32000, 1, 1], type = f16, quantizing to iq3_s .. ================================================================= iq3xs_init_impl(grid_size = 512)
iq3xs_init_impl: 24733 neighbours in total
size = 125.00 MiB -> 26.86 MiB
[ 3/ 201] blk.0.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB
[ 4/ 201] blk.0.ffn_down.weight - [ 5632, 2048, 1, 1], type = f16, quantizing to q4_K .. size = 22.00 MiB -> 6.19 MiB
[ 5/ 201] blk.0.ffn_gate.weight - [ 2048, 5632, 1, 1], type = f16, quantizing to iq3_xxs .. ================================================================= iq3xs_init_impl(grid_size = 256)
iq3xs_init_impl: 18985 neighbours in total
size = 22.00 MiB -> 4.21 MiB
[ 6/ 201] blk.0.ffn_up.weight - [ 2048, 5632, 1, 1], type = f16, quantizing to iq3_xxs ..
size = 22.00 MiB -> 4.21 MiB
[ 7/ 201] blk.0.ffn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MB
[ 8/ 201] blk.0.attn_k.weight - [ 2048, 256, 1, 1], type = f16,

============================================================
Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization
The result will be garbage, so bailing out

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization
main: failed to quantize model from 'My gguf path‘

but on b2223 from llama-b2223-bin-win-cublas-cu12.2.0-x64.zip
It works.
I think there should be some changes between these versions.
These changes also prevent all models of i-series ggufs (such as iq3-xxs) I made with the new version from working properly on LM Studio v0.2.16.
The i-series gguf made using the b2223 version of quantize works fine on 0.2.16.

ggerganov · 2024-03-09T14:28:43Z

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization

Likely your imatrix is messed up - generate a new one:

./imatrix -m models/tinyllama-1b/ggml-model-f16.gguf -f some-data.txt -ngl 99

bladeswill · 2024-03-09T14:42:49Z

llama_model_quantize：量化失败：在极低位量化中缺少张量 blk.0.attn_k.weight 的重要性矩阵

可能你的 imatrix 被搞乱了 - 生成一个新的：
./imatrix -m models/tinyllama-1b/ggml-model-f16.gguf -f some-data.txt -ngl 99

Thank you, you are right, I changed a txt and reused "b2281 - b2356" to create a new imatrix.dat, and then successfully created TinyLlama-1.1B-IQ3_XXS.gguf

But the problem that all i-series gguf models (such as iq3-xxs) produced by the new version cannot work properly on LM Studio v0.2.16 still exists. Maybe I should wait for the update of LM Studio.

GlasslessPizza · 2024-03-09T16:44:59Z

I noticed that there are no fluctuations on other quantization types (such as Q3_K_M, Q4_K_S or Q4_0) but there are some variations on smaller non-mixtral models, so I tested a great deal of releases of llama.cpp since b2015 (IQ3_XXS introduced) on llama-2-7b.Q8_0.gguf:

b2015 | 6.3080 +/- 0.03552 | 2687315648 bytes
...
b2252 | 6.3065 +/- 0.03551 | 2687315648 bytes
b2253 | 6.2925 +/- 0.03533 | 2687315648 bytes
...
b2274 | 6.2927 +/- 0.03533 | 2687315648 bytes
b2275 | 6.3261 +/- 0.03564 | 2585390784 bytes
...
b2314 | 6.3262 +/- 0.03565 | 2585390784 bytes
b2316 | 6.3139 +/- 0.03570 | 2585390784 bytes
...
b2364 | 6.3141 +/- 0.03571 | 2585390784 bytes

The result for llama-2-7b.Q8_0 is sane (the only regression at b2275 coincides with a decrease in model size) so unfortunately I'd have to test specifically on mixtral which is going to take a while.

GlasslessPizza · 2024-03-17T15:04:28Z

I managed to finish the mixtral test after a week of effort.

Here's the result using always the same imatrix for all versions (same as in my OP). The only parameter change from my OP is the number of chunks for ppl that i reduced to 200 as i noticed that you don't need that many chunks to see the variations:

b2015 | 4.8074 +/- 0.04731 | 18308777920 bytes
...
b2136 | 4.8086 +/- 0.04731 | 18308777920 bytes
b2137 | 4.8263 +/- 0.04754 | 18307082176 bytes
...
b2252 | 4.8263 +/- 0.04754 | 18307082176 bytes
b2253 | 4.8394 +/- 0.04753 | 18307082176 bytes
...
b2274 | 4.8393 +/- 0.04753 | 18307082176 bytes
b2275 | 4.9652 +/- 0.04869 | 18238711744 bytes
...
b2286 | 4.9653 +/- 0.04869 | 18238711744 bytes
b2287 | 4.9454 +/- 0.04841 | 18240407488 bytes
...
b2436 | 4.9467 +/- 0.04839 | 18240407488 bytes

Here's the result by recreating the imat ex-novo for every version:

b2015 | 4.7965 +/- 0.04708 | 18308777920 bytes
...
b2136 | 4.7984 +/- 0.04712 | 18308777920 bytes
b2137 | 4.8194 +/- 0.04738 | 18307082176 bytes
...
b2252 | 4.8195 +/- 0.04738 | 18307082176 bytes
b2253 | 4.8256 +/- 0.04726 | 18307082176 bytes
...
b2274 | 4.8254 +/- 0.04725 | 18307082176 bytes
b2275 | 4.9476 +/- 0.04840 | 18238711744 bytes
...
b2286 | 4.9472 +/- 0.04840 | 18238711744 bytes
b2287 | 4.9296 +/- 0.04814 | 18240407488 bytes
...
b2314 | 4.9296 +/- 0.04814 | 18240407488 bytes
b2316 | 4.9333 +/- 0.04822 | 18240407488 bytes
...
b2436 | 4.9336 +/- 0.04822 | 18240407488 bytes

The ellipsis signifies omission for brevity (= no changes in that range).

Conclusion: differently from llama-2-7b, all variations found for mixtral are anomalous (can't be explained by a suitably large change in model size). The variations happened at b2137, b2253, b2275, b2287 and an extra one for recalculated imat at version b2316.

Quantizing mixtral to IQ3_XXS using llama.cpp version b2015 instead of recent versions results in a gguf that performs 0.137 ppl better.

GlasslessPizza · 2024-04-20T09:19:06Z

I tried b2699 hoping the regressions were fixed along the way:

ikawrakow imatrix:

b2436 | 4.9467 +/- 0.04839 | 18240407488 bytes
b2699 | 4.9473 +/- 0.04839 | 18240407488 bytes

Recalculated imatrix:

b2436 | 4.9336 +/- 0.04822 | 18240407488 bytes
b2699 | 5.3510 +/- 0.05293 | 18240407488 bytes

Alas, quite the contrary: the biggest regression yet. We are now at a whopping +0.5436 ppl behind b2015 for recreated imatrix.
It just keeps getting worse.

github-actions · 2024-06-05T01:06:40Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

GlasslessPizza · 2024-07-08T11:20:04Z

The problem is pretty much still present. I tried on b3334 today:

ikawrakow imatrix:

b3334 | 4.9494 +/- 0.04843 | 18240407648 bytes

Recalculated imatrix:

b3334 | 5.3557 +/- 0.05299 | 18240407840 bytes

Worse yet again.

Also I noticed alot of "llama_model_quantize_internal: did not find weights for" log lines. I suspect that at some point since b2436 imatrix generation stopped working completely.

GlasslessPizza · 2024-07-28T17:20:46Z

I guess I'll bump this issue every two weeks to prevent the bot from autoclosing it, this is my life now.
Tried on b3484 today:

ikawrakow imatrix:

b3484 | 4.9538 +/- 0.04846 | 18240407648 bytes

Recalculated imatrix:

b3484 | 5.3537 +/- 0.05293 | 18240407840 bytes

Fixed imatrix worse yet again, recreated ex-novo imatrix still broken.

slaren · 2024-07-28T17:48:04Z

I have added the bug tag that will prevent the bot from closing the issue. Pointing at the specific PRs that introduced a regression would improve the chances of this being fixed dramatically. If there are multiple regressions, it might be better to create a different issue for each one, with instructions to reproduce the issue with the smallest model possible.

GlasslessPizza · 2024-08-16T22:39:13Z

I have added the bug tag that will prevent the bot from closing the issue. Pointing at the specific PRs that introduced a regression would improve the chances of this being fixed dramatically. If there are multiple regressions, it might be better to create a different issue for each one, with instructions to reproduce the issue with the smallest model possible.

Thanks! A few posts above I tested every release of llama.cpp since b2015 and pointed out the specific releases that introduced regressions. I since stopped testing all the releases as the sheer quantity of releases of llama.cpp outpaced the free time i have. I may open an issue for the broken imatrix creation tho.

In other news, i tried today with b3599:

ikawrakow imatrix:

b3599 | 4.9534 +/- 0.04846 | 18240407648 bytes

Recalculated imatrix:

b3599 | 5.3574 +/- 0.05298 | 18240407840 bytes

No change to note since b3484.

GlasslessPizza · 2024-09-07T13:32:36Z

In order to fix the imatrix creation i had to recreate the base q8 from the original repo using a new llama.cpp version:

b3680 | 4.9383 +/- 0.04829 | 18242464544

Alas, this only puts the ppl at the level of b2436, if not slightly worse.

GlasslessPizza added the bug-unconfirmed label Mar 3, 2024

BarfingLemurs mentioned this issue Mar 11, 2024

New IQ1_S somehow much worse than previous version #5996

Closed

github-actions bot added the stale label May 21, 2024

github-actions bot closed this as completed Jun 5, 2024

mofosyne reopened this Jul 13, 2024

github-actions bot removed the stale label Jul 14, 2024

slaren added bug Something isn't working and removed bug-unconfirmed labels Jul 28, 2024

GlasslessPizza mentioned this issue Aug 25, 2024

Bug: mixtral 8x7b instruct imatrix creation and quantization to IQ3_XXS broken: llama_model_quantize_internal: did not find weights for #9167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regressions on IQ3_XXS over time #5856

Regressions on IQ3_XXS over time #5856

GlasslessPizza commented Mar 3, 2024

schmorp commented Mar 4, 2024

GlasslessPizza commented Mar 4, 2024

Artefact2 commented Mar 4, 2024

GlasslessPizza commented Mar 5, 2024

bladeswill commented Mar 9, 2024 •

edited

Loading

ggerganov commented Mar 9, 2024

bladeswill commented Mar 9, 2024 •

edited

Loading

GlasslessPizza commented Mar 9, 2024 •

edited

Loading

GlasslessPizza commented Mar 17, 2024

GlasslessPizza commented Apr 20, 2024

github-actions bot commented Jun 5, 2024

GlasslessPizza commented Jul 8, 2024

GlasslessPizza commented Jul 28, 2024

slaren commented Jul 28, 2024 •

edited

Loading

GlasslessPizza commented Aug 16, 2024 •

edited

Loading

GlasslessPizza commented Sep 7, 2024

Regressions on IQ3_XXS over time #5856

Regressions on IQ3_XXS over time #5856

Comments

GlasslessPizza commented Mar 3, 2024

schmorp commented Mar 4, 2024

GlasslessPizza commented Mar 4, 2024

Artefact2 commented Mar 4, 2024

GlasslessPizza commented Mar 5, 2024

bladeswill commented Mar 9, 2024 • edited Loading

============================================================ Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization The result will be garbage, so bailing out

ggerganov commented Mar 9, 2024

bladeswill commented Mar 9, 2024 • edited Loading

GlasslessPizza commented Mar 9, 2024 • edited Loading

GlasslessPizza commented Mar 17, 2024

GlasslessPizza commented Apr 20, 2024

github-actions bot commented Jun 5, 2024

GlasslessPizza commented Jul 8, 2024

GlasslessPizza commented Jul 28, 2024

slaren commented Jul 28, 2024 • edited Loading

GlasslessPizza commented Aug 16, 2024 • edited Loading

GlasslessPizza commented Sep 7, 2024

bladeswill commented Mar 9, 2024 •

edited

Loading

============================================================
Missing importance matrix for tensor blk.0.attn_k.weight in a very low-bit quantization
The result will be garbage, so bailing out

bladeswill commented Mar 9, 2024 •

edited

Loading

GlasslessPizza commented Mar 9, 2024 •

edited

Loading

slaren commented Jul 28, 2024 •

edited

Loading

GlasslessPizza commented Aug 16, 2024 •

edited

Loading