Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: bitsandbytes support #4033

Closed
orellavie1212 opened this issue Apr 12, 2024 · 26 comments
Closed

[Feature]: bitsandbytes support #4033

orellavie1212 opened this issue Apr 12, 2024 · 26 comments

Comments

@orellavie1212
Copy link
Contributor

orellavie1212 commented Apr 12, 2024

🚀 The feature, motivation and pitch

Bitsandbytes 4bit quantization support.
I know many want that, and also it is discuused before and marked as unplaned, but after I looked how TGI implemented that
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L285
And TGI is based on VLLM ofc.

Alternatives

I know that GPTQ is better quan. compared to b&b 4b, but B&B is great for QLORA merged peft models, while it is almost impossible to gptq/awq quan. a b&b 4bit model (and I am not even talking about nf4 vs fp4 perpelxity case) as they are not offically supporting that (even though others sometimes successfully quantize from merged b&b qlora model to gptq or awq, but I don't for example)

Additional context

As I mentioned above,
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L285
It looks very simple implementation of the Linear4bit class for b&b, I could add a pr myself to vllm, I just wondered why it is not added, maybe something I miss?

@EvilPsyCHo
Copy link
Contributor

BNB 4-bit is a very useful feature. Many models don't have GPTQ or AWQ quantization versions, and it requires some hard work to quantize a large model using post-training methods.

Everyone know post-trianing quantization get better performance , but many guys like me doesn't care about the little performance loss when we try the demo product.

@EvilPsyCHo
Copy link
Contributor

After the release of Llama3, I only can play the 8B version with vLLM, and I have to switch to Ollama to run the 70B version.

@oushu1zhangxiangxuan1
Copy link
Contributor

want +1

@kevaldekivadiya2415
Copy link

+1

@Lu0Key
Copy link

Lu0Key commented Apr 27, 2024

want +1

@timbmg
Copy link

timbmg commented Apr 27, 2024

+1

Would be great to run CohereForAI/c4ai-command-r-plus-4bit.

@cheney369
Copy link

+1

2 similar comments
@warlockedward
Copy link

+1

@aaron-imani
Copy link

+1

@javierquin
Copy link

It will be very usefull for QLORA finetunned models, is there a roadmap for this addition?

@dhruvil237
Copy link

+1

@dariemp
Copy link

dariemp commented May 6, 2024

+1

2 similar comments
@qashzar
Copy link

qashzar commented May 6, 2024

+1

@salt00n9
Copy link

salt00n9 commented May 8, 2024

+1

@qdm12
Copy link

qdm12 commented May 10, 2024

Please stop commenting +1, just react to the original post with the thumbs up emoji. Commenting with such comment does not add any value and notifies all people subscribed to this issue.

@jeejeelee
Copy link
Contributor

Refer to : #4776

@Vegetable-Chicken-Coder

want +1

@duchengyao
Copy link

related to #3339

@epignatelli
Copy link

What's required to implement this? FP4 and NF4 support?

It seems line bnb uses 2 esponent digits and 1 mantissa digit format for FP4.
https://github.com/TimDettmers/bitsandbytes/blob/25abf8d95f8a33f38e2ce6f637768b442379ccd9/bitsandbytes/functional.py#L1049-L1059

@hmellor hmellor mentioned this issue May 20, 2024
@flaviusburca
Copy link

+1

@jeejeelee
Copy link
Contributor

Hi, those who need this feature should check out what @chenqianfzh is working on here: #4776

chenqianfzh added a commit to bd-iaas-us/vllm that referenced this issue May 29, 2024
@VpkPrasanna
Copy link

Hi Team when can we expect this feature ?

@devlup
Copy link

devlup commented Jul 1, 2024

+1 any update on this it seems @chenqianfzh #4776 is not working with LLAMA 3

@hmellor
Copy link
Collaborator

hmellor commented Jul 4, 2024

bitsandbytes is now supported https://docs.vllm.ai/en/latest/quantization/supported_hardware.html

@devlup
Copy link

devlup commented Jul 8, 2024

It's not working for LLama 3 , https://github.com/bd-iaas-us/vllm/blob/e16bcb69495540b21a3bd9423cdd5df8a78405ea/tests/quantization/test_bitsandbytes.py replace it with llama3 8b , it's failing the tests @hmellor @chenqianfzh

@junzhang-zj
Copy link

@hmellor, how do you load in 8-bit? This version seems to only be able to load in 4-bit via
quantization="bitsandbytes", load_format="bitsandbytes"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests