Support 4bit on CPU backend #1206

Xia-Weiwen · 2024-05-10T07:36:36Z

Adds implementation for the following ops on CPU backend:

quantize_4bit
dequantize_4bit
gemv_4bit

Limitations:

quant_storage must be torch.uint8
compress_statistics is not supported yet (bnb_4bit_use_double_quant must be false)
fp4 is slow currently because there is no fused kernel yet.

Difference from CUDA implementation:

On CPU backend, it is not required that A is a vector to go to the fused dequant-gemm kernel. CUDA requires that. So, the op is called gemv_4bit. But on CPU backend, it's actually GEMM.
Different numerical accuracy due to different kernel implementations

Here is the code snippet of an example to run HuggingFace models with 4bit on CPU backend: https://gist.github.com/Xia-Weiwen/592d6e24e03f904a18692b3e27794c53. You will have to bypass CUDA checks in transformers to run.

cc @jiqing-feng @jgong5 @jianan-gu

jiqing-feng · 2024-05-13T03:28:14Z

bitsandbytes/backends/cpu_xpu_common.py

+    out_dq = torch.empty(out_uint8.shape).to(quant_state.dtype)
+    for i in range(len(quant_state.code)):
+        out_dq[out_uint8 == i] = quant_state.code[i]


Using index select will be faster out_dq = quant_state.code[out_uint8.to(torch.int32)].

Looks like torch.compile result of this code gives wrong results. And removing torch.compile results in lower performance. Let's keep this implementation for now.

A bug in torch.compile? Can you submit a bug to PyTorch? I will try to fix it.

However, I cannot reproduce the issue with the script below. May need more investigation.

import torch NF4_DEQUANT_TABLE = torch.Tensor([ -1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453, -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0, 0.07958029955625534, 0.16093020141124725, 0.24611230194568634, 0.33791524171829224, 0.44070982933044434, 0.5626170039176941, 0.7229568362236023, 1.0, ]) @torch.compile def dequant_nf4_compile(t_in: torch.Tensor, out_dtype): return NF4_DEQUANT_TABLE[t_in.to(torch.int)].to(out_dtype) def dequant_nf4_eager(t_in: torch.Tensor, out_dtype): return NF4_DEQUANT_TABLE[t_in.to(torch.int)].to(out_dtype) x = torch.randint(0, 16, (1024, 1024), dtype=torch.uint8) y1 = dequant_nf4_compile(x, torch.bfloat16) y1 = dequant_nf4_compile(x, torch.bfloat16) y2 = dequant_nf4_eager(x, torch.bfloat16) print(torch.equal(y1, y2)) print("max diff =", torch.abs(y1 - y2).max())

jiqing-feng · 2024-05-23T07:10:02Z

Hi @Titus-von-Koeller . Here is the test results on Intel 4th Gen Xeon CPU of this PR:

The big difference between NF4 and FP4 is that we can use fused ops in NF4, but they are not prepared in FP4. FP4 will also support fused ops and is supposed to get the same performance as NF4, maybe in the next Ipex release. Would you please review it? Thx!

test script

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time

MAX_NEW_TOKENS = 64
model_id = "meta-llama/Llama-2-7b-chat-hf"

text = 'I am happy because'
tokenizer = AutoTokenizer.from_pretrained(model_id)
input_ids = tokenizer(text, return_tensors="pt").input_ids

print('Loading model {}...'.format(model_id))
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_quant_type="fp4",
                                         bnb_4bit_use_double_quant=False,
                                         bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, torch_dtype=torch.bfloat16)
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)

print('model dtype = {}'.format(model.dtype))

with torch.no_grad():
    # warmup
    model.generate(input_ids, max_length=MAX_NEW_TOKENS)
    model.generate(input_ids, max_length=MAX_NEW_TOKENS)
    print("warm-up complite")
    t0 = time.time()
    generated_ids = model.generate(input_ids, max_length=MAX_NEW_TOKENS, do_sample=False, num_beams=1)
    latency = time.time() - t0
    print(input_ids.shape)
    print(generated_ids.shape)
    result = "| latency: " + str(round(latency * 1000, 3)) + " ms |"
    print('+' + '-' * (len(result) - 2) + '+')
    print(result)
    print('+' + '-' * (len(result) - 2) + '+')

output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"output: {output}")

Titus-von-Koeller · 2024-05-24T14:29:03Z

Dear @Xia-Weiwen et al,

Unfortunately we're (mostly me alone) quite resource constrained and humbled by the workload associated with the multi-backend-refactor. I just talked with my colleague @younesbelkada about the topic how to best handle the next steps.

We both took a look at this PR and the one from AMD and think that at first glance everything looks really good. At this time, both me and Younes are not in a position to give detailed feedback and I need to focus on concretizing the path forward on how to integrate with the PyTorch dispatcher (tensor driven dispatch, as requested) through the torch.library Python-level APIs. After extensive research and yesterday's consultation with 3 PyTorch devs at Meta that are experts on the topic I need to focus on making this new input concrete.

However, for the purpose of iterative progress (as agreed in our prior conversations), we've decided to already go ahead and merge both the open Intel and AMD branches into multi-backend-refactor, where interested parties can then compile from source and give the new functionality (we're so excited and grateful about this!) a thorough testing.

Once we've made some progress on the torch.library based refactor, I'll next focus on enabling the nightly releases for that branch as well. We're also looking forward to your feedback on the this torch.library / tensor-driven dispatch topic once the code is there on the basis of which to discuss (and refactor the backend specific code towards that new target, after we agreed with all of you that this is the right path).

Among other things, there's also been extensive ongoing work in the background on things like moving BNB to a new independent/non-profit Github org, but under the umbrella of Hugging Face and the support of their infra team for managing the complexities of the CI/CD backend and runners. Also, we're working to make Github runners for the different hardware platforms a reality (thanks for your help on that!).

Thanks again for the good work and active collaboration! ❤️ 🚀

Titus-von-Koeller · 2024-05-24T14:51:40Z

P.S. Also see this: README: asking for help from volunteer alpha testers

Let us know if you have further thoughts on this and how you think it's best to communicate about this.

Xia-Weiwen · 2024-05-27T01:14:38Z

Hi @Titus-von-Koeller Thanks a lot for your help on this. We are glad to provide feedbacks on the adoption of torch.library. Please let use know when there is any update.
Also, we would love to volunteer to conduct regular tests on Intel CPU/GPU as alpha testers. I think we will need to be aligned on many aspects of the tests, such as test code base, methods, frequency, scopes and how we sync and publish the results. Maybe we can create an issue to track this. I will have discussions with my colleagues and come back later.

jgong5 · 2024-05-27T01:53:13Z

At this time, both me and Younes are not in a position to give detailed feedback and I need to focus on concretizing the path forward on how to integrate with the PyTorch dispatcher (tensor driven dispatch, as requested) through the torch.library Python-level APIs. After extensive research and yesterday's consultation with 3 PyTorch devs at Meta that are experts on the topic I need to focus on making this new input concrete.

Hi @Titus-von-Koeller May I learn more details about how you are going to refactor things via torch.library? I guess this is one of the official ways of integrating native backend implementations with PyTorch, which provides python bindings and backend dispatching mechanism. I shared similar comments earlier too: #898 (comment).

Meanwhile, it would be beneficial to allow the flexibility of backend integration without adding native code explicitly to bitsandbytes too, like optimizing via torch.compile as what this PR does and integration via third-party Python extensions like "ipex" (Intel extension for PyTorch). This is a more light-weight approach than adding native code.

Xia-Weiwen added 3 commits May 8, 2024 02:10

Support NF4 on CPU backend

09cc153

Minor improvements

177bd39

Add fp4 support; add UT; fix lint issues

881b5fc

Xia-Weiwen changed the title ~~[WIP] Support NF4 on CPU backend~~ [WIP] Support 4bit on CPU backend May 11, 2024

Xia-Weiwen added 3 commits May 10, 2024 23:57

Reduce memory usage

dd15734

Fix UT

85a01b0

reduce memory usage for nf4

2c489f8

jiqing-feng reviewed May 13, 2024

View reviewed changes

matthewdouglas mentioned this pull request May 15, 2024

BitsandBytes Enablement on ROCm #1207

Merged

Xia-Weiwen changed the title ~~[WIP] Support 4bit on CPU backend~~ Support 4bit on CPU backend May 21, 2024

Xia-Weiwen requested a review from jiqing-feng May 21, 2024 03:01

Xia-Weiwen marked this pull request as ready for review May 21, 2024 03:02

Titus-von-Koeller merged commit 701c5aa into bitsandbytes-foundation:multi-backend-refactor May 24, 2024
1 of 2 checks passed

Xia-Weiwen requested a review from jgong5 May 27, 2024 00:46

jiqing-feng mentioned this pull request May 29, 2024

Enable BNB multi-backend support huggingface/transformers#31098

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support 4bit on CPU backend #1206

Support 4bit on CPU backend #1206

Xia-Weiwen commented May 10, 2024 •

edited

Loading

jiqing-feng May 13, 2024

Xia-Weiwen May 21, 2024

jgong5 May 21, 2024

Xia-Weiwen May 21, 2024

jiqing-feng commented May 23, 2024 •

edited

Loading

Titus-von-Koeller commented May 24, 2024

Titus-von-Koeller commented May 24, 2024 •

edited

Loading

Xia-Weiwen commented May 27, 2024 •

edited

Loading

jgong5 commented May 27, 2024 •

edited

Loading

Support 4bit on CPU backend #1206

Support 4bit on CPU backend #1206

Conversation

Xia-Weiwen commented May 10, 2024 • edited Loading

jiqing-feng May 13, 2024

Choose a reason for hiding this comment

Xia-Weiwen May 21, 2024

Choose a reason for hiding this comment

jgong5 May 21, 2024

Choose a reason for hiding this comment

Xia-Weiwen May 21, 2024

Choose a reason for hiding this comment

jiqing-feng commented May 23, 2024 • edited Loading

Titus-von-Koeller commented May 24, 2024

Titus-von-Koeller commented May 24, 2024 • edited Loading

Xia-Weiwen commented May 27, 2024 • edited Loading

jgong5 commented May 27, 2024 • edited Loading

Xia-Weiwen commented May 10, 2024 •

edited

Loading

jiqing-feng commented May 23, 2024 •

edited

Loading

Titus-von-Koeller commented May 24, 2024 •

edited

Loading

Xia-Weiwen commented May 27, 2024 •

edited

Loading

jgong5 commented May 27, 2024 •

edited

Loading