quantization_cpu base version #1190

tombawor · 2024-09-23T21:46:38Z

Before submitting

Was this discussed/approved via a Github issue? Fixes quantization: process tensors on meta device directly, maybe implement CPU quantization (if it is easy) #1111.
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs? No docs were updated as this is primarily a code implementation.
Did you write any new necessary tests? Yes, relevant tests for CPU quantization were added.

What does this PR do?

This PR addresses Issue #1111 by introducing an implementation of CPU-based 4-bit quantization in quantization_cpu.py. The code is adapted from a not-yet-merged branch of the bitsandbytes library. Key highlights include:

Implementing quantize_weight for CPU without returning the quantization state to optimize for inference.
A note has been added in the code to revisit returning the quantization state if fine-tuning or dequantization becomes necessary in future use cases.

PR review

Anyone in the community is welcome to review this PR once all tests have passed. Given the connection to issue #1111, feedback and suggestions for improvements are appreciated. If the PR hasn't been discussed in the issue, please reach out for context to improve its chances of merging.

Did you have fun?

Absolutely 🙃

for more information, see https://pre-commit.ci

t-vi

Hi @tombawor . Great first PR! Thank you and welcome!

I added a few comments for discussion.

Thanks again for taking the initiative on this and digging up the quantization code.

t-vi · 2024-09-24T07:58:17Z

thunder/tests/test_quantization.py

@@ -0,0 +1,81 @@
+import torch


Awesome to have comprehensive tests!
I think these could go into test_transforms (where there are also some tests that seem to need an update).

t-vi · 2024-09-24T08:01:41Z

thunder/tests/test_quantization.py

+    ), "Quantized tensor should have fewer or equal elements due to compression"
+
+
+# Optional: Performance tests


Hey, I think these are cool to demonstrate the value, but tests are maybe not the best place for it.
If we want to show the performance, maybe we could create a quantization notebook or so? Similar to the "what's going on under the hood in fsdp".

t-vi · 2024-09-24T08:04:27Z

thunder/transforms/quantization.py

-            w_work = torch.zeros_like(w, device="cuda")
-        elif w.device.type != "cuda":
+            num_elements = w.numel()
+            return torch.empty((num_elements, 1), device="meta", dtype=torch.uint8)


I think the formula to computue the size is not quite right?
Also, the quantize_weight function needs to return both the quantized weight and a quantization state.
Maybe it would be possible to use (or adapt) the CPU code in impl to handle meta?

t-vi · 2024-09-24T08:04:53Z

thunder/transforms/quantization.py

+        # if future use cases require more flexibility, such as further model training or analysis
+        # of quantization effects on the CPU.
+        if w.device.type == "cpu":
+            return quantize_4bit_impl(w, quant_type="nf4")[0]


I think we want both the quantized weight and quantization state.

t-vi · 2024-09-24T08:06:09Z

thunder/transforms/quantization_cpu.py

@@ -0,0 +1,224 @@
+# NOTE: The code for CPU quantization in this file has been adapted from a not-yet-merged branch of the


Can you please add the original copyright notice and a link to the license?
(also, referring to permalinks instead of branches is a bit more reliable)

into Issue-1111

…under into Issue-1111

into Issue-1111

for more information, see https://pre-commit.ci

…under into Issue-1111

t-vi · 2024-10-18T19:18:32Z

Hey, so something seems still up.
But there is some progress outside:

@ali-alshaar7 fixed PyTorch Lightning to work with newer bitsandbytes update BitsandBytes version pytorch-lightning#20313
if we get a version of litgpt using that and allowing newer bitsandbytes, we can bump our dependency if we want.

into Issue-1111

tombawor · 2024-10-21T11:22:01Z

The multi-backend refactor branch is still not included in the latest release of bitsandbytes as of version 0.44.x.
Last changes for this PR contains:

absmax dtype consistency. There is torch.float32 on CUDA.
quantize out tensor dimension consistency. There is out tensor as 2D shape [N, 1] on CUDA.
With above changes test_transforms.py::test_quantization_on_meta PASSED on version before last change with move litgpt imports into tests.

into Issue-1111

quantization_cpu base version

1e28a0d

tombawor requested review from mruberry, lantiga and t-vi as code owners September 23, 2024 21:46

[pre-commit.ci] auto fixes from pre-commit.com hooks

688c323

for more information, see https://pre-commit.ci

t-vi reviewed Sep 24, 2024

View reviewed changes

Tomasz Bawor and others added 8 commits October 3, 2024 17:56

Merge branch 'main' of https://github.com/Lightning-AI/lightning-thunder

80ffa9f

into Issue-1111

quantize_weight update for meta andd cpu

d3bcbf9

Merge branch 'Issue-1111' of https://github.com/tombawor/lightning-th…

fb86f37

…under into Issue-1111

Merge branch 'main' of https://github.com/Lightning-AI/lightning-thunder

8896d57

into Issue-1111

update

cdb1116

[pre-commit.ci] auto fixes from pre-commit.com hooks

d0b683f

for more information, see https://pre-commit.ci

Merge branch 'Issue-1111' of https://github.com/tombawor/lightning-th…

250f666

…under into Issue-1111

meta and cpu shape update

63fc6d4

Tomasz Bawor added 2 commits October 21, 2024 11:52

Merge branch 'main' of https://github.com/Lightning-AI/lightning-thunder

eab93bb

into Issue-1111

META, CPU and CUDA consistency changes

b4e7dae

Tomasz Bawor added 3 commits October 31, 2024 20:43

Merge branch 'main' of https://github.com/Lightning-AI/lightning-thunder

4186b0a

into Issue-1111

bitsandbytes import update

4b15a72

update

dbca6e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantization_cpu base version #1190

quantization_cpu base version #1190

tombawor commented Sep 23, 2024

t-vi left a comment

t-vi Sep 24, 2024

t-vi Sep 24, 2024

t-vi Sep 24, 2024

t-vi Sep 24, 2024

t-vi Sep 24, 2024

t-vi commented Oct 18, 2024

tombawor commented Oct 21, 2024

		), "Quantized tensor should have fewer or equal elements due to compression"


		# Optional: Performance tests

		@@ -0,0 +1,224 @@
		# NOTE: The code for CPU quantization in this file has been adapted from a not-yet-merged branch of the

quantization_cpu base version #1190

Are you sure you want to change the base?

quantization_cpu base version #1190

Conversation

tombawor commented Sep 23, 2024

What does this PR do?

PR review

Did you have fun?

t-vi left a comment

Choose a reason for hiding this comment

t-vi Sep 24, 2024

Choose a reason for hiding this comment

t-vi Sep 24, 2024

Choose a reason for hiding this comment

t-vi Sep 24, 2024

Choose a reason for hiding this comment

t-vi Sep 24, 2024

Choose a reason for hiding this comment

t-vi Sep 24, 2024

Choose a reason for hiding this comment

t-vi commented Oct 18, 2024

tombawor commented Oct 21, 2024