support W4A8 Marlin kernel #1113

HandH1998 · 2024-10-18T08:45:11Z

Summary

We inroduce a mixed precision GEMM kernel for INT4-Weight and INT8-Activation. We implemented the W4A8 GEMM based on Marlin GEMM. The kernel is designed to support our W4A8 quantization method QQQ. For more details on the kernel implementation, you can refer to our paper. The kernel demonstrates excellent performance and has been merged into the official vLLM project (see vllm-project/vllm#5218).

We hope the w4a8 GEMM can also provide a practical speedup for other W4A8 quantization methods in the community.
Additionally, since torchao is widely used in frameworks like SGLang, we can extend support for W4A8 once the kernel is integrated into torchao.

Performance

Here is the speedup over PyTorch FP16 GEMM (Calling CUTLASS) of all GEMMs under different numbers of input tokens. The weight matrix size is (N=8192, K=21760). You can reproduce the benchmark results using bench_w4a8.py in my repo.

pytorch-bot · 2024-10-18T08:45:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1113

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

GLIBC not found in Nova workflows

✅ No Failures

As of commit 2690ff4 with merge base 39f16f4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2024-10-18T20:56:50Z

can we do some comparisons between this and #880?

jerryzh168 · 2024-10-21T18:38:02Z

thanks, looks pretty good, can you add the benchmark code in https://github.com/pytorch/ao/tree/main/benchmarks as well

and what GPU are the kernels benchmarked on?

torchao/quantization/marlin_qqq/README.md

HandH1998 · 2024-10-22T02:24:08Z

@drisspg @jerryzh168 I am working on the benckmark and I will give the comparisons with #880. The kernel is benchmarked on A100-80G GPU and it can work on SM > 8.0.

HandH1998 · 2024-10-25T07:34:35Z

@jerryzh168 @drisspg @msaroufim
I have made the following modifications (code modifications refer to #621 and #880):

Added benchmark code for marlin_qqq_w4a8 GEMM inbenchmarks/benchmark_marlin_qqq.py
Summarized the main differences between marlin_qqq_w4a8 GEMM and marlin_w4a16 GEMM intorchao/quantization/marlin_qqq/README.md
Supported marlin_qqq in torchao/quantization/quant_api.py
Added some unit tests in test/quantization/test_marlin_qqq.py

w4a8-cutlass is a great work. In comparison, we believe marlin_qqq_w4a8 can support weight per-group quantization in addition to weight per-channel quantization. However, marlin_qqq_w4a8 does have some limitations: it only supports symmetric quantization and the output dtype can only be torch.float16.

In addition, we also provide the performance of torchao/_models/llama/generate.py here. -g128 means weight per-group quantization and the group size is 128.

`-q parameter`	Precison	Average tokens/sec	Average Bandwidth in GB/s	Peak Memory Usage in GB	Model Size in GB
`--compile`	fp16	112.45	1486.00	13.93	13.21
`-q marlin_qqq --compile`	w4a8	197.45	653.50	4.79	3.31
`-q marlin_qqq --compile`	w4a8-g128	187.62	640.32	4.82	3.41

HandH1998 · 2024-11-01T08:55:02Z

@jerryzh168 @msaroufim @drisspg I have resolved the conficts. Look forward to see your new advice.

drisspg · 2024-11-05T23:10:20Z

Hey @HandH1998 rekicking off the internal CI/CD here that failed

Overall this works looks very good, will do a more thorough review

drisspg · 2024-11-05T23:45:23Z

It seems that you also applied formatting to a lot of files. This makes it pretty hard to review since all the changes get mixed together.

I opened #1226

Would you mind opening a separate PR for some of the other files you touched? And let's first add the formatting and then we can merge in your changes and make it easier to review.

HandH1998 · 2024-11-06T11:47:51Z

@drisspg I have removed the extra formatting in this PR, which should now simplify the review process.

alexsamardzic · 2024-11-06T19:46:54Z

I've added a benchmarking script to #880, that makes it possible to compare the performance between the two W4A8 kernels. As CUTLASS-based version doesn't support group quantization, at the moment it is only possible to make the comparison with group_size=-1 in the Marlin-based version. Marlin-based version performs clearly better for input sizes less than 256, while CUTLASS-based version is faster for input sizes of 256 and higher. Consequently, Marlin-based version performs better on Llama generator too, with tokens/sec about 25% higher than CUTLASS-based version (note that in both cases I ran the generator as follows python generate.py --compile --precision torch.float16 -q ...). Please note that the comparison is not completely apples-to-apples, as besides group quantization support there are other small differences between kernels, but still it seems this is pretty much current state of affairs regarding the performance.

Let me return the compliment by stating that this Marlin-based kernel is a great work too. In particular, for me it clearly shows where the CUTLASS-based kernel should be improved.

torchao/quantization/quant_api.py

torchao/quantization/marlin_qqq/README.md

test/quantization/test_marlin_qqq.py

torchao/quantization/marlin_qqq/__init__.py

torchao/quantization/marlin_qqq/utils.py

torchao/quantization/quant_primitives.py

torchao/dtypes/affine_quantized_tensor.py

drisspg · 2024-11-07T05:30:38Z

Overall looking really good, would you mind reporting lib size increase from this PR? I plan to take another once over of the cuda code tomorrow and then once CI is good this should be good to go :)

torchao/quantization/quant_api.py

HandH1998 · 2024-11-07T12:03:02Z

@drisspg @jerryzh168 Thanks for your reviews. I have resolved the most issues according to your advice. The increased lib size is about 5M. Let's go forward :)

drisspg · 2024-11-07T19:44:53Z

Lib increases from 3.5M -> 5.0M for provenance, I think this is acceptable

drisspg · 2024-11-08T17:44:30Z

@HandH1998 looks like there is 1 true failure on your PR would you mind fixing and then we can land?

torchao/dtypes/affine_quantized_tensor.py

HandH1998 · 2024-11-11T03:09:21Z

@HandH1998 looks like there is 1 true failure on your PR would you mind fixing and then we can land?

I will try to fix it soon.

jerryzh168 · 2024-11-13T00:36:35Z

can you add a table similar to https://github.com/pytorch/ao/tree/main/torchao/quantization#sparse-marlin to README to show the performance? otherwise looks good to me

msaroufim

just need to fix lint

msaroufim · 2024-10-25T17:38:11Z

torchao/_models/llama/generate.py

            quantize_(model, int4_weight_only(group_size=groupsize))
-        if "marlin" in quantization:
+        # NOTE(HandH1998): `marlin_qqq` should be put before `marlin` to avoid going to the wrong branch


note to self: this is the real code and it seems reasonable - rest is linting changes

HandH1998 · 2024-11-13T12:18:18Z

can you add a table similar to https://github.com/pytorch/ao/tree/main/torchao/quantization#sparse-marlin to README to show the performance? otherwise looks good to me

I have added it.

jerryzh168

thanks!

jerryzh168 · 2024-11-13T18:47:13Z

please fix lint: https://github.com/pytorch/ao/actions/runs/11812082822/job/32933511633?pr=1113

zhyncs · 2024-11-14T10:20:11Z

@HandH1998 It's so coooooool!

support Marlin W4A8 kernel

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 18, 2024

jerryzh168 requested a review from msaroufim October 21, 2024 18:34

msaroufim reviewed Oct 21, 2024

View reviewed changes

torchao/quantization/marlin_qqq/README.md Show resolved Hide resolved

HandH1998 force-pushed the dev branch from 952f65e to b986b97 Compare October 25, 2024 07:16

HandH1998 force-pushed the dev branch from b986b97 to bbbc915 Compare November 1, 2024 08:26

HandH1998 force-pushed the dev branch from bbbc915 to 869a632 Compare November 4, 2024 08:17

HandH1998 force-pushed the dev branch from 869a632 to 6ff423d Compare November 6, 2024 11:43

HandH1998 force-pushed the dev branch from 6ff423d to d2e6b5b Compare November 7, 2024 02:42