Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223

JasonZhu1313 · 2023-12-20T21:15:19Z

Earlier, there was an awesome PR #916 on supporting the GPTQ Exllama kernel in a 4-bit quantization setup. This PR introduces additional kernels for use cases with different quantization bits, sourced from the AutoGPTQ repository, which is utilized by HF for GPTQ quantization.

The same kernel can also be leveraged by our recent post training quantization work (QuantEase, we'll release the QuantEase algorithm repo soon) https://arxiv.org/abs/2309.01885 where we achieved better performance on Zero-Shot accuracy for 3-bit quantization.

We are adding two additional flags to GPTQConfig which are well aligned with AutoGPTQ & HF convention:

use_triton: if using triton kernel under 2, 4, 8 bit setup which will be slower than exllama kernel and cuda kernel
disable_exllama: if disable exllama kernel under 4-bit setup, cuda or triton kernel will used based on use_triton flag
Under 3-bit setup, default cuda kernel will be used

Test:

Tested on llama 7b model

You need to add the additional args to the saved quantize_config.json after GPTQ quantization, an example:

{
  "bits": 3,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": true,
  "static_groups": false,
  "sym": true,
  "true_sequential": true,
  "model_name_or_path": null,
  "model_file_base_name": null,
  "use_triton": false,
  "disable_exllama": true
}

Test script

prompt = "What is large language model?"
sampling_params = SamplingParams(temperature=0.8, top_p=0.5, max_tokens=100)
model_path = "..."
llm = LLM(model=model_path, trust_remote_code=True, tensor_parallel_size=2, quantization="gptq", tokenizer_mode="slow")
outputs = llm.generate(prompt, sampling_params)

Output from exllama kernel under 4-bit quantization

total time 1.5789778232574463
average time 1.5789778232574463
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.236552625894547, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from triton kernel under 4-bit quantization

total time 6.523277759552002
average time 6.523277759552002
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.21131780743599, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from CUDA kernel under 4-bit quantization

total time 2.3482797145843506
average time 2.3482797145843506
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.14222851395607, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from CUDA kernel under 3-bit quantization


total time 3.6984071731567383
average time 3.6984071731567383
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n2018-01-25 00:31:12\nWhat is large language model?\nA language model is a statistical model that describes the relationship between a set of variables and a set of observations. The variables are called the predictors, and the observations are called the outcomes.\nThe model is used to predict the outcome of interest, given the predictors.\nA large language model is a statistical model that describes the relationship between a set', token_ids=[13, 29906, 29900, 29896, 29947, 29899, 29900, 29896, 29899, 29906, 29945, 29871, 29900, 29900, 29901, 29941, 29896, 29901, 29896, 29906, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 29909, 4086, 1904, 338, 263, 24148, 1904, 393, 16612, 278, 9443, 1546, 263, 731, 310, 3651, 322, 263, 731, 310, 13917, 29889, 450, 3651, 526, 2000, 278, 8500, 943, 29892, 322, 278, 13917, 526, 2000, 278, 714, 26807, 29889, 13, 1576, 1904, 338, 1304, 304, 8500, 278, 21957, 310, 4066, 29892, 2183, 278, 8500, 943, 29889, 13, 29909, 2919, 4086, 1904, 338, 263, 24148, 1904, 393, 16612, 278, 9443, 1546, 263, 731], cumulative_logprob=-51.450629502534866, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n2018-01-25 00:31:12\nWhat is large language model?\nA language model is a statistical model that describes the relationship between a set of variables and a set of observations. The variables are called the predictors, and the observations are called the outcomes.\nThe model is used to predict the outcome of interest, given the predictors.\nA large language model is a statistical model that describes the relationship between a set'

WoosukKwon · 2023-12-21T07:08:28Z

Hi @JasonZhu1313, is this PR ready for review? If so, could you fix the formatting issue? You can simply run the following in the root dir of the repo:

pip install -r requirements-dev.txt
./format.sh

JasonZhu1313 · 2023-12-21T17:56:52Z

Hi @JasonZhu1313, is this PR ready for review? If so, could you fix the formatting issue? You can simply run the following in the root dir of the repo:
pip install -r requirements-dev.txt
./format.sh

Hey @WoosukKwon, the PR is ready for review and thanks for reminding, all checks are passed now.

JasonZhu1313 · 2023-12-27T00:58:35Z

@WoosukKwon Could you provide review help here? Thanks a lot!

chu-tianxiang · 2023-12-27T08:45:24Z

I'm also working on similar things in the gptq_8bit branch. It's not quite ready yet. The code is adapted from exllamav2, which actually contains the main components necessary to accelerate other bit GPTQ models. 3-bit is almost able to match the speed of 4-bit model.

JasonZhu1313 · 2023-12-28T00:54:51Z

I'm also working on similar things in the gptq_8bit branch. It's not quite ready yet. The code is adapted from exllamav2, which actually contains the main components necessary to accelerate other bit GPTQ models. 3-bit is almost able to match the speed of 4-bit model.

@chu-tianxiang Thanks for chiming in and sharing your work, I am happy to collaborate on this, right now my PR is ready for review and tested, we can work together on getting exllama version in as a follow up enhancement after this? Or you could merge your changes in this PR and we can coauthor this. Either works

JasonZhu1313 · 2024-01-02T16:22:45Z

@chu-tianxiang @WoosukKwon @zhuohan123 Could you help provide review help for this PR thanks!

chu-tianxiang · 2024-01-03T15:52:23Z

I am not maintainer of vLLM, but I would suggest move the AutoGPTQ kernels under the vllm_extension if there's no special reason. Besides, I think the triton kernels and some cuda kernels should work for all precisions, maybe it could address this issue as well?

And it's more about @WoosukKwon's decision regarding the direction for future developments regarding quantization, such as whether to incorporate a range of different kernels impls, whether triton will be preferred, whether to adopt a unified packing and kernel usage across different quantization methods like AWQ and GPTQ.

lapp0 · 2024-01-09T11:35:00Z

In docker build,

193.3 ERROR: Cannot install -r requirements.txt (line 8) and triton==2.0.0 because these package versions have conflicting dependencies.


193.3 The conflict is caused by:
193.3     The user requested triton==2.0.0
193.3     torch 2.1.2 depends on triton==2.1.0; platform_system == "Linux" and platform_machine == "x86_64"

Is there a reason triton version can't be 2.1.0? I upgraded the dependency on my end, but I've only tested the CUDA kernel?

JasonZhu1313 · 2024-01-09T19:49:34Z

In docker build,

193.3 ERROR: Cannot install -r requirements.txt (line 8) and triton==2.0.0 because these package versions have conflicting dependencies.


193.3 The conflict is caused by:
193.3     The user requested triton==2.0.0
193.3     torch 2.1.2 depends on triton==2.1.0; platform_system == "Linux" and platform_machine == "x86_64"

Is there a reason triton version can't be 2.1.0? I upgraded the dependency on my end, but I've only tested the CUDA kernel?

Hey @lapp0 , I think 2.1.0 works and you can change the dependency in requirement.txt to

triton

Instead of using a pinned version

lapp0 · 2024-01-18T13:14:15Z

Smoke tested CUDA and ExLlama kernels on A100. Saw a substantial memory reduction. Worked without problems.

JasonZhu1313 · 2024-01-18T15:26:03Z

Smoke tested CUDA and ExLlama kernels on A100. Saw a substantial memory reduction. Worked without problems.

cc @WoosukKwon @simon-mo

hmellor · 2024-03-06T16:50:41Z

Should this be closed as this functionality was added by #2330?

simon-mo · 2024-03-08T06:03:14Z

Closing as #2330 (and Marlin) is merged. However, we look forward to separate PR if this PR have do have better kernels!

JasonZhu1313 added 3 commits December 20, 2023 11:21

Add GPTQ quantization kernels for 2, 3, 8-bit

959c3ad

update requirements.txt and cuda kernel code

02dd7b8

clean up unused kernels

9378b52

WoosukKwon added the quantization label Dec 21, 2023

JasonZhu1313 added 3 commits December 21, 2023 09:27

fix 3-bit sharing issue

c41cfaa

reformat code

0d13426

change format of custom_autotune.py

7e1c5a7

JasonZhu1313 added 2 commits January 2, 2024 13:54

update awq code to be compliant with 3-bit quantization change

ed34519

update the code in squeezellm to be compliant with gptq 3-bit change

1c579dc

chu-tianxiang mentioned this pull request Jan 3, 2024

Add Support for 2/3/8-bit GPTQ Quantization Models #2330

Merged

simon-mo mentioned this pull request Jan 29, 2024

Question: Would a PR integrating ExLlamaV2 kernels with AWQ be accepted? #2645

Closed

WoosukKwon mentioned this pull request Feb 28, 2024

[v0.3.3] Release Tracker #3097

Closed

5 tasks

simon-mo closed this Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223

Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223

JasonZhu1313 commented Dec 20, 2023 •

edited

Loading

WoosukKwon commented Dec 21, 2023

JasonZhu1313 commented Dec 21, 2023

JasonZhu1313 commented Dec 27, 2023

chu-tianxiang commented Dec 27, 2023

JasonZhu1313 commented Dec 28, 2023 •

edited

Loading

JasonZhu1313 commented Jan 2, 2024 •

edited

Loading

chu-tianxiang commented Jan 3, 2024

lapp0 commented Jan 9, 2024 •

edited

Loading

JasonZhu1313 commented Jan 9, 2024

lapp0 commented Jan 18, 2024

JasonZhu1313 commented Jan 18, 2024

hmellor commented Mar 6, 2024

simon-mo commented Mar 8, 2024

Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223

Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223

Conversation

JasonZhu1313 commented Dec 20, 2023 • edited Loading

Test:

Test script

Output from exllama kernel under 4-bit quantization

Output from triton kernel under 4-bit quantization

Output from CUDA kernel under 4-bit quantization

Output from CUDA kernel under 3-bit quantization

WoosukKwon commented Dec 21, 2023

JasonZhu1313 commented Dec 21, 2023

JasonZhu1313 commented Dec 27, 2023

chu-tianxiang commented Dec 27, 2023

JasonZhu1313 commented Dec 28, 2023 • edited Loading

JasonZhu1313 commented Jan 2, 2024 • edited Loading

chu-tianxiang commented Jan 3, 2024

lapp0 commented Jan 9, 2024 • edited Loading

JasonZhu1313 commented Jan 9, 2024

lapp0 commented Jan 18, 2024

JasonZhu1313 commented Jan 18, 2024

hmellor commented Mar 6, 2024

simon-mo commented Mar 8, 2024

JasonZhu1313 commented Dec 20, 2023 •

edited

Loading

JasonZhu1313 commented Dec 28, 2023 •

edited

Loading

JasonZhu1313 commented Jan 2, 2024 •

edited

Loading

lapp0 commented Jan 9, 2024 •

edited

Loading