-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8bit quantization #3261
Comments
Try GPT-Q? We support 2/3/4/8 bits. |
@simon-mo is it possible to support eetq, like huggingface/text-generation-inference? https://github.com/NetEase-FuXi/EETQ It's super useful because you don't even need an offline quantization step, you just point it at a normal unquantized model and pass Here's the PR where they added it in TGI: |
Good idea. Is it possible to also integrate the W4A16kernel optimization in tensorrtllm? |
That's a good idea. EETQ works out of the box and we'd like to integrate it into vLLM. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Does vLLM support 8 bit quantization? We need to use vLLM with large context window (>1K tokens). We tried AWQ but the generation quality is not good. Any pointer will be greatly appreciated.
The text was updated successfully, but these errors were encountered: