-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regressions on IQ3_XXS over time #5856
Comments
Out of curiosity, did the resulting gguf sizes also change? |
Not enough to justify the difference:
|
Can you try just before #5829? |
Sure, that would be b2314:
Almost indistinguishable from b2329. |
When I try to quantize "TinyLlama/TinyLlama-1.1B-Chat-v1.0" on b2281 - b2356 , Will report an error: [ 1/ 201] output.weight - [ 2048, 32000, 1, 1], type = f16, quantizing to q5_K .. size = 125.00 MiB -> 42.97 MiB ============================================================
|
Likely your imatrix is messed up - generate a new one: ./imatrix -m models/tinyllama-1b/ggml-model-f16.gguf -f some-data.txt -ngl 99 |
Thank you, you are right, I changed a txt and reused "b2281 - b2356" to create a new imatrix.dat, and then successfully created TinyLlama-1.1B-IQ3_XXS.gguf But the problem that all i-series gguf models (such as iq3-xxs) produced by the new version cannot work properly on LM Studio v0.2.16 still exists. Maybe I should wait for the update of LM Studio. |
I noticed that there are no fluctuations on other quantization types (such as Q3_K_M, Q4_K_S or Q4_0) but there are some variations on smaller non-mixtral models, so I tested a great deal of releases of llama.cpp since b2015 (IQ3_XXS introduced) on llama-2-7b.Q8_0.gguf:
The result for llama-2-7b.Q8_0 is sane (the only regression at b2275 coincides with a decrease in model size) so unfortunately I'd have to test specifically on mixtral which is going to take a while. |
I managed to finish the mixtral test after a week of effort. Here's the result using always the same imatrix for all versions (same as in my OP). The only parameter change from my OP is the number of chunks for ppl that i reduced to 200 as i noticed that you don't need that many chunks to see the variations:
Here's the result by recreating the imat ex-novo for every version:
The ellipsis signifies omission for brevity (= no changes in that range). Conclusion: differently from llama-2-7b, all variations found for mixtral are anomalous (can't be explained by a suitably large change in model size). The variations happened at b2137, b2253, b2275, b2287 and an extra one for recalculated imat at version b2316. Quantizing mixtral to IQ3_XXS using llama.cpp version b2015 instead of recent versions results in a gguf that performs 0.137 ppl better. |
I tried b2699 hoping the regressions were fixed along the way: ikawrakow imatrix:
Recalculated imatrix:
Alas, quite the contrary: the biggest regression yet. We are now at a whopping +0.5436 ppl behind b2015 for recreated imatrix. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
The problem is pretty much still present. I tried on b3334 today: ikawrakow imatrix:
Recalculated imatrix:
Worse yet again. Also I noticed alot of "llama_model_quantize_internal: did not find weights for" log lines. I suspect that at some point since b2436 imatrix generation stopped working completely. |
I guess I'll bump this issue every two weeks to prevent the bot from autoclosing it, this is my life now. ikawrakow imatrix:
Recalculated imatrix:
Fixed imatrix worse yet again, recreated ex-novo imatrix still broken. |
I have added the bug tag that will prevent the bot from closing the issue. Pointing at the specific PRs that introduced a regression would improve the chances of this being fixed dramatically. If there are multiple regressions, it might be better to create a different issue for each one, with instructions to reproduce the issue with the smallest model possible. |
Thanks! A few posts above I tested every release of llama.cpp since b2015 and pointed out the specific releases that introduced regressions. I since stopped testing all the releases as the sheer quantity of releases of llama.cpp outpaced the free time i have. I may open an issue for the broken imatrix creation tho. In other news, i tried today with b3599: ikawrakow imatrix:
Recalculated imatrix:
No change to note since b3484. |
In order to fix the imatrix creation i had to recreate the base q8 from the original repo using a new llama.cpp version:
Alas, this only puts the ppl at the level of b2436, if not slightly worse. |
If I quantize this gguf with this imatrix using this command:
and I calculate perplexity with this command:
I get three much different PPL values on three different versions of quantize.exe, everything else being equal:
I suspect that there have been multiple cumulative regression events on the IQ3_XXS quantization implementation between b2037 and b2329.
cu12.2.0 on Windows 10.
The text was updated successfully, but these errors were encountered: