-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importance matrix support for legacy quants #4969
Conversation
I notice with the Importance Matrix calculations that, if you go past the original model's supported context length (32k tokens is when it starts, your recommendation is to use 50k tokens), the PPL of the batches collected seems to start rising. This is odd because I specify a short context length which matches my batch size exactly [-c 2048 -b 2048], so each chunk should be roughly the same, and there should be no slow regression of average PPL over time as such. Do you know why this happens? |
Don't really understand the question (or better, don't understand what is being done and what is the observation) |
No. It splits into chunks of You gain nothing by calculating the importance matrix with a large context. My experience is that a context of 512 works best, at least according to perplexity. I.e., if I prepare an importance matrix with a context of 512 and then use it to quantize and run perplexity for a context of 8192, the PPL is slightly lower compared to using a context of 8192 for the importance matrix and running perplexity for a context of 8192. |
Then what is the explanation for why there seems to be an observable trend of ppl declining as if the context size was the size of the dataset for all the batches? I saw this for a 32k context model as well after the ~32k mark |
Perplexity goes up and down, no? It depends on the text being processed. Some part of the test set is predicted better and the perplexity goes down. Some other part is predicted worse, perplexity goes up. That's why we run all ~330k tokens from |
Hmm, I think it was probably just a coincidence then, that it happened to look like the average kept going down consistently on both models at around the context length point. |
@ikawrakow did you recently test the hellaswag scores? I have been getting very low scores (even for fp16 models). |
FP16=Final estimate: PPL = 548.0413 +/- 11.80300 old without legacy quants. New with legacy quants This is unusal so this is just for fun.,but legacy imatrix does help a lot for those fallback quants. |
* imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
TL;DR See title and PR #4861, #4930 for more details.
Opinions on adding importance matrix support for legacy quants were divided (see #4932), but given @ggerganov's comment there I decided to go ahead and prepare this PR.
I observe quite significant improvement in perplexity for all models I have tested. In addition,
Q4_1
andQ5_1
no longer have the erratic behavior of having a higher perplexity thanQ4_0/Q5_0
for some models despite using more bits.The following tables give a few representative perplexity examples. The
QError
columns are defined asPPL(Q)/PPL(fp16)-1
. Perplexity is for a context of 512 tokens.Q4_0
Q4_1
Q5_0
Q5_1