-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect/Garbage Responses for Llama-2-7b-hf with INT4 GPTQ/RTN Asymmetric Quantization #19450
Comments
Any update on this? @yufenglee / @kunal-vaishnavi |
Hi @VishalX, could you please try quantizing the model directly with command like: |
Hey @yufenglee, python -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b Let me try quantizing it using command line. I hope below command is good enough: python -m onnxruntime.quantization.matmul_4bits_quantizer --input_model <path-to-fp32-model> --output_model <path-to-int4-model> --block_size 32 --symmetric False --accuracy_level 0 --verbose |
Yes, please try this command line. |
@yufenglee Namespace(input_model='llama2-7b-fp32/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32_opt.onnx', output_model='bw_asym/model.onnx', block_size=32, symmetric=True, accuracy_level=0, verbose=True, nodes_to_exclude=[]) I'll fix this locally and try again. |
@yufenglee I get garbage outputs with Asymmetric. 📌 Running on Windows With Asymmetric Quantization (block size = 32, accuracy level 0)
With Symmetric Quantization (block size = 32, accuracy level 0)
|
Interestingly, If I run the same model on Linux (Ubuntu 18.04), I get somewhat better results with Asymmetric model but I still see non-English sentence/words within Responses. However, the responses with Symmetric quantized model are matching on Windows and Linux. With Asymmetric Quantization (block size = 32, accuracy level 0)📌 Running on Linux
|
I can repro the issue locally.
|
And for the issue in the original post, do you run on Windows or Linux? |
@yufenglee I ran on Windows. |
@VishalX, just FYI. It turns out something wrong with mmap on windows. If I turns off mmap, Asymmetric works on Windows. You can try it out with this branch if you want to: https://github.com/microsoft/onnxruntime/tree/yufeng/hot_fix. I will investigate more. |
Tried this fix, I get the exact same response as I am getting on Linux for Asym. |
@yufenglee, The earlier published numbers from: #17390, suggests GPTQ ( With the responses like these, I'm not so sure that the above can be reproduced. I'll generate the score and see what I get. |
@VishalX, it would be great if you can try reproducing and get the score. |
@yufenglee I can reproduce the published numbers with minor difference.
Accuracy for lambada_openai is: 0.7314185911119736 I'll check for Wikitext as well. |
For Wikitext
This also looks close to the published numbers. |
### Description <!-- Describe your changes. --> Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. We need to set high dwFileOffsetHigh for this case. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The bug was found from #19450
### Description <!-- Describe your changes. --> Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. We need to set high dwFileOffsetHigh for this case. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The bug was found from #19450
### Description <!-- Describe your changes. --> Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. We need to set high dwFileOffsetHigh for this case. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The bug was found from #19450 ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
### Description <!-- Describe your changes. --> Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. We need to set high dwFileOffsetHigh for this case. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The bug was found from #19450
Describe the issue
I am trying to quantize and run
Llama-2-7b-hf
model using the example here.I was able to successfully generate the
int4
model with GPTQ quantization by running below command.Settings:
However, when I try to run on CPU, I get garbage results for any prompt.
Similar output is observed with RTN Asymmetric INT4 model as well.
To reproduce
Following onnxruntime-inference-examples WOQ README.
I have used the inference code from here with some changes mentioned below
Urgency
No response
Platform
Windows
OS Version
Windows 11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
v1.17.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
The text was updated successfully, but these errors were encountered: