-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: bitsandbytes support #4033
Comments
BNB 4-bit is a very useful feature. Many models don't have GPTQ or AWQ quantization versions, and it requires some hard work to quantize a large model using post-training methods. Everyone know post-trianing quantization get better performance , but many guys like me doesn't care about the little performance loss when we try the demo product. |
After the release of Llama3, I only can play the 8B version with vLLM, and I have to switch to Ollama to run the 70B version. |
want +1 |
+1 |
want +1 |
+1 Would be great to run CohereForAI/c4ai-command-r-plus-4bit. |
+1 |
2 similar comments
+1 |
+1 |
It will be very usefull for QLORA finetunned models, is there a roadmap for this addition? |
+1 |
+1 |
2 similar comments
+1 |
+1 |
Please stop commenting |
Refer to : #4776 |
want +1 |
related to #3339 |
What's required to implement this? FP4 and NF4 support? It seems line bnb uses 2 esponent digits and 1 mantissa digit format for FP4. |
+1 |
Hi, those who need this feature should check out what @chenqianfzh is working on here: #4776 |
Hi Team when can we expect this feature ? |
+1 any update on this it seems @chenqianfzh #4776 is not working with LLAMA 3 |
|
It's not working for LLama 3 , https://github.com/bd-iaas-us/vllm/blob/e16bcb69495540b21a3bd9423cdd5df8a78405ea/tests/quantization/test_bitsandbytes.py replace it with llama3 8b , it's failing the tests @hmellor @chenqianfzh |
@hmellor, how do you load in 8-bit? This version seems to only be able to load in 4-bit via |
🚀 The feature, motivation and pitch
Bitsandbytes 4bit quantization support.
I know many want that, and also it is discuused before and marked as unplaned, but after I looked how TGI implemented that
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L285
And TGI is based on VLLM ofc.
Alternatives
I know that GPTQ is better quan. compared to b&b 4b, but B&B is great for QLORA merged peft models, while it is almost impossible to gptq/awq quan. a b&b 4bit model (and I am not even talking about nf4 vs fp4 perpelxity case) as they are not offically supporting that (even though others sometimes successfully quantize from merged b&b qlora model to gptq or awq, but I don't for example)
Additional context
As I mentioned above,
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L285
It looks very simple implementation of the Linear4bit class for b&b, I could add a pr myself to vllm, I just wondered why it is not added, maybe something I miss?
The text was updated successfully, but these errors were encountered: