-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223
Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223
Conversation
Hi @JasonZhu1313, is this PR ready for review? If so, could you fix the formatting issue? You can simply run the following in the root dir of the repo:
|
Hey @WoosukKwon, the PR is ready for review and thanks for reminding, all checks are passed now. |
@WoosukKwon Could you provide review help here? Thanks a lot! |
I'm also working on similar things in the gptq_8bit branch. It's not quite ready yet. The code is adapted from exllamav2, which actually contains the main components necessary to accelerate other bit GPTQ models. 3-bit is almost able to match the speed of 4-bit model. |
@chu-tianxiang Thanks for chiming in and sharing your work, I am happy to collaborate on this, right now my PR is ready for review and tested, we can work together on getting exllama version in as a follow up enhancement after this? Or you could merge your changes in this PR and we can coauthor this. Either works |
@chu-tianxiang @WoosukKwon @zhuohan123 Could you help provide review help for this PR thanks! |
I am not maintainer of vLLM, but I would suggest move the AutoGPTQ kernels under the And it's more about @WoosukKwon's decision regarding the direction for future developments regarding quantization, such as whether to incorporate a range of different kernels impls, whether triton will be preferred, whether to adopt a unified packing and kernel usage across different quantization methods like AWQ and GPTQ. |
In docker build,
Is there a reason triton version can't be 2.1.0? I upgraded the dependency on my end, but I've only tested the CUDA kernel? |
Hey @lapp0 , I think 2.1.0 works and you can change the dependency in requirement.txt to
Instead of using a pinned version |
Smoke tested CUDA and ExLlama kernels on A100. Saw a substantial memory reduction. Worked without problems. |
|
Should this be closed as this functionality was added by #2330? |
Closing as #2330 (and Marlin) is merged. However, we look forward to separate PR if this PR have do have better kernels! |
Earlier, there was an awesome PR #916 on supporting the GPTQ Exllama kernel in a 4-bit quantization setup. This PR introduces additional kernels for use cases with different quantization bits, sourced from the AutoGPTQ repository, which is utilized by HF for GPTQ quantization.
The same kernel can also be leveraged by our recent post training quantization work (QuantEase, we'll release the QuantEase algorithm repo soon) https://arxiv.org/abs/2309.01885 where we achieved better performance on Zero-Shot accuracy for 3-bit quantization.
We are adding two additional flags to GPTQConfig which are well aligned with AutoGPTQ & HF convention:
Test:
Tested on llama 7b model
You need to add the additional args to the saved quantize_config.json after GPTQ quantization, an example:
Test script
Output from exllama kernel under 4-bit quantization
Output from triton kernel under 4-bit quantization
Output from CUDA kernel under 4-bit quantization
Output from CUDA kernel under 3-bit quantization