Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223

Conversation

JasonZhu1313
Copy link
Contributor

@JasonZhu1313 JasonZhu1313 commented Dec 20, 2023

Earlier, there was an awesome PR #916 on supporting the GPTQ Exllama kernel in a 4-bit quantization setup. This PR introduces additional kernels for use cases with different quantization bits, sourced from the AutoGPTQ repository, which is utilized by HF for GPTQ quantization.

The same kernel can also be leveraged by our recent post training quantization work (QuantEase, we'll release the QuantEase algorithm repo soon) https://arxiv.org/abs/2309.01885 where we achieved better performance on Zero-Shot accuracy for 3-bit quantization.

We are adding two additional flags to GPTQConfig which are well aligned with AutoGPTQ & HF convention:

  • use_triton: if using triton kernel under 2, 4, 8 bit setup which will be slower than exllama kernel and cuda kernel
  • disable_exllama: if disable exllama kernel under 4-bit setup, cuda or triton kernel will used based on use_triton flag
  • Under 3-bit setup, default cuda kernel will be used

Test:

Tested on llama 7b model

You need to add the additional args to the saved quantize_config.json after GPTQ quantization, an example:

{
  "bits": 3,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": true,
  "static_groups": false,
  "sym": true,
  "true_sequential": true,
  "model_name_or_path": null,
  "model_file_base_name": null,
  "use_triton": false,
  "disable_exllama": true
}

Test script

prompt = "What is large language model?"
sampling_params = SamplingParams(temperature=0.8, top_p=0.5, max_tokens=100)
model_path = "..."
llm = LLM(model=model_path, trust_remote_code=True, tensor_parallel_size=2, quantization="gptq", tokenizer_mode="slow")
outputs = llm.generate(prompt, sampling_params)

Output from exllama kernel under 4-bit quantization

total time 1.5789778232574463
average time 1.5789778232574463
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.236552625894547, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from triton kernel under 4-bit quantization

total time 6.523277759552002
average time 6.523277759552002
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.21131780743599, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from CUDA kernel under 4-bit quantization

total time 2.3482797145843506
average time 2.3482797145843506
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model', token_ids=[13, 3833, 880, 313, 29896, 29929, 29929, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29896, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29941, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29945, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29955, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29900, 29929, 29897, 13, 5618, 338, 2919, 4086, 1904, 29973, 313, 29906, 29900, 29896, 29900, 29897, 13, 5618, 338, 2919, 4086, 1904], cumulative_logprob=-24.14222851395607, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n Milton (1995)\nWhat is large language model?\nWhat is large language model? (2001)\nWhat is large language model? (2003)\nWhat is large language model? (2005)\nWhat is large language model? (2007)\nWhat is large language model? (2009)\nWhat is large language model? (2010)\nWhat is large language model'

Output from CUDA kernel under 3-bit quantization


total time 3.6984071731567383
average time 3.6984071731567383
RequestOutput(request_id=0, prompt='What is large language model?', prompt_token_ids=[2, 1724, 338, 2919, 4086, 1904, 29973], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n2018-01-25 00:31:12\nWhat is large language model?\nA language model is a statistical model that describes the relationship between a set of variables and a set of observations. The variables are called the predictors, and the observations are called the outcomes.\nThe model is used to predict the outcome of interest, given the predictors.\nA large language model is a statistical model that describes the relationship between a set', token_ids=[13, 29906, 29900, 29896, 29947, 29899, 29900, 29896, 29899, 29906, 29945, 29871, 29900, 29900, 29901, 29941, 29896, 29901, 29896, 29906, 13, 5618, 338, 2919, 4086, 1904, 29973, 13, 29909, 4086, 1904, 338, 263, 24148, 1904, 393, 16612, 278, 9443, 1546, 263, 731, 310, 3651, 322, 263, 731, 310, 13917, 29889, 450, 3651, 526, 2000, 278, 8500, 943, 29892, 322, 278, 13917, 526, 2000, 278, 714, 26807, 29889, 13, 1576, 1904, 338, 1304, 304, 8500, 278, 21957, 310, 4066, 29892, 2183, 278, 8500, 943, 29889, 13, 29909, 2919, 4086, 1904, 338, 263, 24148, 1904, 393, 16612, 278, 9443, 1546, 263, 731], cumulative_logprob=-51.450629502534866, logprobs=None, finish_reason=length)], finished=True)
Generated text from vllm: '\n2018-01-25 00:31:12\nWhat is large language model?\nA language model is a statistical model that describes the relationship between a set of variables and a set of observations. The variables are called the predictors, and the observations are called the outcomes.\nThe model is used to predict the outcome of interest, given the predictors.\nA large language model is a statistical model that describes the relationship between a set'

@WoosukKwon
Copy link
Collaborator

Hi @JasonZhu1313, is this PR ready for review? If so, could you fix the formatting issue? You can simply run the following in the root dir of the repo:

pip install -r requirements-dev.txt
./format.sh

@JasonZhu1313
Copy link
Contributor Author

Hi @JasonZhu1313, is this PR ready for review? If so, could you fix the formatting issue? You can simply run the following in the root dir of the repo:

pip install -r requirements-dev.txt
./format.sh

Hey @WoosukKwon, the PR is ready for review and thanks for reminding, all checks are passed now.

@JasonZhu1313
Copy link
Contributor Author

@WoosukKwon Could you provide review help here? Thanks a lot!

@chu-tianxiang
Copy link
Contributor

I'm also working on similar things in the gptq_8bit branch. It's not quite ready yet. The code is adapted from exllamav2, which actually contains the main components necessary to accelerate other bit GPTQ models. 3-bit is almost able to match the speed of 4-bit model.

@JasonZhu1313
Copy link
Contributor Author

JasonZhu1313 commented Dec 28, 2023

I'm also working on similar things in the gptq_8bit branch. It's not quite ready yet. The code is adapted from exllamav2, which actually contains the main components necessary to accelerate other bit GPTQ models. 3-bit is almost able to match the speed of 4-bit model.

@chu-tianxiang Thanks for chiming in and sharing your work, I am happy to collaborate on this, right now my PR is ready for review and tested, we can work together on getting exllama version in as a follow up enhancement after this? Or you could merge your changes in this PR and we can coauthor this. Either works

@JasonZhu1313
Copy link
Contributor Author

JasonZhu1313 commented Jan 2, 2024

@chu-tianxiang @WoosukKwon @zhuohan123 Could you help provide review help for this PR thanks!

@chu-tianxiang
Copy link
Contributor

I am not maintainer of vLLM, but I would suggest move the AutoGPTQ kernels under the vllm_extension if there's no special reason. Besides, I think the triton kernels and some cuda kernels should work for all precisions, maybe it could address this issue as well?

And it's more about @WoosukKwon's decision regarding the direction for future developments regarding quantization, such as whether to incorporate a range of different kernels impls, whether triton will be preferred, whether to adopt a unified packing and kernel usage across different quantization methods like AWQ and GPTQ.

@lapp0
Copy link

lapp0 commented Jan 9, 2024

In docker build,

193.3 ERROR: Cannot install -r requirements.txt (line 8) and triton==2.0.0 because these package versions have conflicting dependencies.


193.3 The conflict is caused by:
193.3     The user requested triton==2.0.0
193.3     torch 2.1.2 depends on triton==2.1.0; platform_system == "Linux" and platform_machine == "x86_64"

Is there a reason triton version can't be 2.1.0? I upgraded the dependency on my end, but I've only tested the CUDA kernel?

@JasonZhu1313
Copy link
Contributor Author

In docker build,

193.3 ERROR: Cannot install -r requirements.txt (line 8) and triton==2.0.0 because these package versions have conflicting dependencies.


193.3 The conflict is caused by:
193.3     The user requested triton==2.0.0
193.3     torch 2.1.2 depends on triton==2.1.0; platform_system == "Linux" and platform_machine == "x86_64"

Is there a reason triton version can't be 2.1.0? I upgraded the dependency on my end, but I've only tested the CUDA kernel?

Hey @lapp0 , I think 2.1.0 works and you can change the dependency in requirement.txt to

triton

Instead of using a pinned version

@lapp0
Copy link

lapp0 commented Jan 18, 2024

Smoke tested CUDA and ExLlama kernels on A100. Saw a substantial memory reduction. Worked without problems.

@JasonZhu1313
Copy link
Contributor Author

Smoke tested CUDA and ExLlama kernels on A100. Saw a substantial memory reduction. Worked without problems.

cc @WoosukKwon @simon-mo

@hmellor
Copy link
Member

hmellor commented Mar 6, 2024

Should this be closed as this functionality was added by #2330?

@simon-mo
Copy link
Collaborator

simon-mo commented Mar 8, 2024

Closing as #2330 (and Marlin) is merged. However, we look forward to separate PR if this PR have do have better kernels!

@simon-mo simon-mo closed this Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants