QWEN int4 bad perfomance #360

chunniunai220ml · 2024-02-23T02:39:37Z

when i test QWEN model , got normal results

but after Auto -awq, the performance got badly

wikitext-ppl=13.560

is it normal? when i tested AWQ-official for llama2, the ppl did not reduce very much.

by the way , i modified the code to support zero=True(symmetric mode) as :

then got much worse ppl=1056944.250,

stem | hunmaities | other | social | avg
26.26 | 27.12 | 24.01 | 23.79 | 25.51

finnaly, any differences for 3 types of version(gemm, gemv, marlin) in theory and use?

casper-hansen · 2024-02-23T14:25:40Z

Hi @chunniunai220ml, this is not normal performance. Quantization error measured by perplexity is usually 1-2%. Did you use a custom dataset?

benjamin-marie · 2024-02-25T04:11:10Z

I have also observed a similar performance drop. For instance, on winogrande, arc challenge, and hellaswag, Qwen-1.5 7B, when quantized with AWQ, performs 10 points (or more) of accuracy lower than Qwen-1.5 quantized with GTPQ 4-bit.

Here is my config:
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

I use the last version of AutoAWQ.

chunniunai220ml · 2024-02-26T03:42:02Z

Hi @chunniunai220ml, this is not normal performance. Quantization error measured by perplexity is usually 1-2%. Did you use a custom dataset?
not custom dataset, no code change, and follow examples

bratao · 2024-02-26T04:54:37Z

+1 here. Qwen/Qwen1.5-14B-Chat-GPTQ-Int4 produces much better results than Qwen/Qwen1.5-14B-Chat-AWQ. Way closer to the original model.

chunniunai220ml · 2024-03-25T07:42:27Z

@casper-hansen i close the clip step , got reasonable results, but how to explain this?

Relissc · 2024-04-04T13:54:59Z

Hello, @chunniunai220ml when I use autoawq to qauntize Qwen, but there comes a error:"RuntimeError: Failed to import transformers.generation.utils because of the following error (look up to see its traceback):", what is your Environment(especially the version of 'transformers')? can you share?

casper-hansen · 2024-04-06T13:48:26Z

You can now use apply_clip=False on the quantize() method. I didn't find that it improved the model much, but the option is there now.

For reference, I am not able to reproduce the bad performance of QWen in my testing:

Model: https://huggingface.co/Qwen/Qwen1.5-7B-Chat
FP16: 11.296 perplexity
INT4: 11.680 perplexity

That's roughly a 1% quantization error.

casper-hansen mentioned this issue Apr 6, 2024

v0.2.5 issue tracker #425

Closed

13 tasks

casper-hansen closed this as completed Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QWEN int4 bad perfomance #360

QWEN int4 bad perfomance #360

chunniunai220ml commented Feb 23, 2024

casper-hansen commented Feb 23, 2024

benjamin-marie commented Feb 25, 2024

chunniunai220ml commented Feb 26, 2024

bratao commented Feb 26, 2024

chunniunai220ml commented Mar 25, 2024

Relissc commented Apr 4, 2024 •

edited

Loading

casper-hansen commented Apr 6, 2024

QWEN int4 bad perfomance #360

QWEN int4 bad perfomance #360

Comments

chunniunai220ml commented Feb 23, 2024

casper-hansen commented Feb 23, 2024

benjamin-marie commented Feb 25, 2024

chunniunai220ml commented Feb 26, 2024

bratao commented Feb 26, 2024

chunniunai220ml commented Mar 25, 2024

Relissc commented Apr 4, 2024 • edited Loading

casper-hansen commented Apr 6, 2024

Relissc commented Apr 4, 2024 •

edited

Loading