[TPU] Add Load-time W8A16 quantization for TPU Backend #7005

lsy323 · 2024-07-31T20:48:30Z

Add Load-time W8A16 quantization for TPU Backend. The workflow is similar to the existing load-time fp8 quantization. Open the PR to help discussion process.

Added a new quantization type tpu_int8 for load-time int8 weight only quantization for tpu Backend. (e.g. LLM(model="google/gemma-2b", quantization="tpu_int8")
Added TPUInt8LinearMethod which quantizes bfloat16 weights to int8 weights for linear layers, and calls TPU quantized ops in forward.

github-actions · 2024-07-31T20:48:40Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

mgoin

Interesting, this looks reasonable to me. The important note is to lazily import the torch xla function when needed, rather than at the top of the quant file.

examples/offline_inference_tpu.py

vllm/model_executor/layers/quantization/tpu_int8.py

robertgshaw2-neuralmagic · 2024-07-31T22:04:04Z

Super cool!

As a follow up, we can work on hooking this up to some of the existing checkpoints we have in addition to inplace quantization

robertgshaw2-neuralmagic · 2024-07-31T22:05:04Z

By chance, what schemes does the following support:

torch.ops.xla.quantized_matmul(x, weight, scale)

Channelwise?
Activations?

lsy323 · 2024-08-02T23:52:29Z

Hi @mgoin, @robertgshaw2-neuralmagic,

Thank your for reviewing my PR! Excited to work with you to enable quantization for TPU backend through compressed-tensors!

By chance, what schemes does the following support:

torch.ops.xla.quantized_matmul(x, weight, scale)

Channelwise? Activations?

We have the quantized ops (Equivalent to the quantized cuda kernels in vLLM, but for TPU) in PyTorch/XLA here. The quantized matmul kernel is registered as a torch op and is compatible with torch.compile, it can be configured to support per-channel/blockwise quantization, both int8 and int4 are supported (int4 is not optimized now) The quantized matmul supporting matrix is here.

robertgshaw2-neuralmagic · 2024-08-02T23:54:44Z

Hi @mgoin, @robertgshaw2-neuralmagic,

Thank your for reviewing my PR! Excited to work with you to enable quantization for TPU backend through compressed-tensors!

By chance, what schemes does the following support:

torch.ops.xla.quantized_matmul(x, weight, scale)

Channelwise? Activations?

We have the quantized ops (Equivalent to the quantized cuda kernels in vLLM, but for TPU) in PyTorch/XLA here. The quantized matmul kernel is registered as a torch op and is compatible with torch.compile, it can be configured to support per-channel/blockwise quantization, both int8 and int4 are supported (int4 is not optimial) The quantized matmul supporting matrix is here.

This is so awesome!!!! Running the same compressed models on various hardware backends is going to be an awesome feature

WoosukKwon

@lsy323 Thanks for the PR! I'm really looking forward to using this feature!

I think we have two things to figure out on this PR:

Where to put this quantization config and the linear method? Do we want to put this as a new quantization config (like in the current PR) or in compressed-tensors?
IIRC, this currently does not support the case when the BF16 weights exceed the HBM size of the TPU while INT8 weights do (e.g., Llama 8B on TPUv5e which has 16 GB HBM). Could you please remind us of why this isn't supported?

vllm/model_executor/layers/quantization/tpu_int8.py

vllm/model_executor/model_loader/loader.py

vllm/model_executor/layers/quantization/tpu_int8.py

miladm · 2024-08-05T21:57:47Z

TorchXLA:TPU FP8 support is WIP (partially supported). @lsy323 do we have an outlined plan somewhere that extends this effort to FP8?

lsy323 · 2024-08-05T23:59:22Z

Where to put this quantization config and the linear method? Do we want to put this as a new quantization config (like in the current PR) or in compressed-tensors?

I think for compressed-tensors config, we assume all the checkpoints are in compressed-tensors format. This load-time quantization doesn't seem to belong that flow, hence I think keeping it in a separate file looks cleaner.

IIRC, this currently does not support the case when the BF16 weights exceed the HBM size of the TPU while INT8 weights do (e.g., Llama 8B on TPUv5e which has 16 GB HBM). Could you please remind us of why this isn't supported?

Sure, the current flow is:

move bfloat16 weights from host to TPU
Quantize bfloat16 weights to int8

When the BF16 weights exceed the HBM size of the TPU, step 1 would hit OOM. To avoid this problem, we can delay weight transferring if load-time quantization is enabled.

lsy323 · 2024-08-06T17:24:44Z

TorchXLA:TPU FP8 support is WIP (partially supported). @lsy323 do we have an outlined plan somewhere that extends this effort to FP8?

I don't have a crystal plan for this, alternatives are as follows:

Reuse fp8.py which works for CUDA workflow. (Support both load-time fp8 quantization and offline quantized fp8 ckpt using TensorRT-LLM ref)
Have another quantizaiotn config for fp8 on TPU (e.g. fp8_tpu)
Extend compressed_tensors to support TPU backend (This will support fp8 ckpt in compressed_tensors format)

robertgshaw2-neuralmagic · 2024-08-06T17:36:19Z

TorchXLA:TPU FP8 support is WIP (partially supported). @lsy323 do we have an outlined plan somewhere that extends this effort to FP8?

I don't have a crystal plan for this, alternatives are as follows:

Reuse fp8.py which works for CUDA workflow. (Support both load-time fp8 quantization and offline quantized fp8 ckpt using TensorRT-LLM ref)

Have another quantizaiotn config for fp8 on TPU (e.g. fp8_tpu)

Extend compressed_tensors to support TPU backend (This will support fp8 ckpt in compressed_tensors format)

Hey guys - there are a couple considerations here. For vLLM, we want to support both cases:

In-place quantization
Pre-quantized checkpoints

We will be making all go forward checkpoints inside the compressed-tensors integration for mixed precision, integer activation quantization, and floating point activation quantization. And so, I think we should focus on this pathway.

Both fp8.py and compressed-tensors share the same backend code (you can see they both use apply_fp8_linear. We factored this utility out, such that the kernel calls are shared by the various integrations. If you add the TPU calls to this function, then you should get all these integrations "for free"

vllm/model_executor/model_loader/loader.py

vllm/model_executor/layers/quantization/tpu_int8.py

lsy323 · 2024-08-07T20:36:59Z

IIRC, this currently does not support the case when the BF16 weights exceed the HBM size of the TPU while INT8 weights do (e.g., Llama 8B on TPUv5e which has 16 GB HBM). Could you please remind us of why this isn't supported?

hi @WoosukKwon, I looked into this in detail, it doesn't look like to be a straightforward change, I think we can consider support that in a separate PR.

In the current flow, weights are moved to device as the model is initialized (ref), then load time quant will be done on device ref. We need to introduce a new flow to support this case.

WoosukKwon · 2024-08-08T03:55:53Z

@lsy323 Seems like my previous comment was not addressed for some reason. Can you please check it again?

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

lsy323 · 2024-08-08T17:11:15Z

@lsy323 Seems like my previous comment was not addressed for some reason. Can you please check it again?

@WoosukKwon Somehow I force pushed without the suggested change commits. Now should be fixed. Thank you for reminding!

WoosukKwon

LGTM! Thanks for the PR! It works well on my machine 🎉 🎉

Looking forward to the next step! (adding INT8 activation quantization in tpu-int8).

…7005)

WoosukKwon added the tpu Related to Google TPUs label Jul 31, 2024

mgoin reviewed Jul 31, 2024

View reviewed changes

examples/offline_inference_tpu.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/tpu_int8.py Outdated Show resolved Hide resolved

lsy323 changed the title ~~Add Load-time W8A16 quantization for TPU Backend~~ [TPU] Add Load-time W8A16 quantization for TPU Backend Aug 1, 2024

lsy323 force-pushed the lsiyuan/quant branch 2 times, most recently from f4b8dd7 to b1a04b3 Compare August 2, 2024 21:37

lsy323 requested a review from mgoin August 2, 2024 23:47

WoosukKwon reviewed Aug 3, 2024

View reviewed changes

robertgshaw2-neuralmagic reviewed Aug 6, 2024

View reviewed changes

vllm/model_executor/model_loader/loader.py Outdated Show resolved Hide resolved

robertgshaw2-neuralmagic reviewed Aug 6, 2024

View reviewed changes

vllm/model_executor/layers/quantization/tpu_int8.py Outdated Show resolved Hide resolved

lsy323 added 3 commits August 7, 2024 17:12

add w8a16 load-time quantizaiton for TPU

6d74940

Run yapf and ruff

c7ba159

make quantize_weight a private method

459bcae

lsy323 force-pushed the lsiyuan/quant branch from 2e2a79a to 459bcae Compare August 7, 2024 17:50

lsy323 requested review from robertgshaw2-neuralmagic and WoosukKwon August 7, 2024 20:37

move import to apply

14b44c0

lsy323 and others added 3 commits August 8, 2024 10:09

Update vllm/model_executor/layers/quantization/tpu_int8.py

920dd43

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Update vllm/model_executor/layers/quantization/tpu_int8.py

9ba7f17

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Update vllm/model_executor/layers/quantization/tpu_int8.py

817de8a

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon approved these changes Aug 9, 2024

View reviewed changes

WoosukKwon merged commit 0fa1490 into vllm-project:main Aug 9, 2024
27 checks passed

lsy323 deleted the lsiyuan/quant branch August 9, 2024 21:09

sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024

[TPU] Add Load-time W8A16 quantization for TPU Backend (vllm-project#…

7cfddef

…7005)

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[TPU] Add Load-time W8A16 quantization for TPU Backend (vllm-project#…

dc976dd

…7005)

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024

[TPU] Add Load-time W8A16 quantization for TPU Backend (vllm-project#…

54df021

…7005)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] Add Load-time W8A16 quantization for TPU Backend #7005

[TPU] Add Load-time W8A16 quantization for TPU Backend #7005

lsy323 commented Jul 31, 2024

github-actions bot commented Jul 31, 2024

mgoin left a comment

robertgshaw2-neuralmagic commented Jul 31, 2024

robertgshaw2-neuralmagic commented Jul 31, 2024

lsy323 commented Aug 2, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 2, 2024

WoosukKwon left a comment

miladm commented Aug 5, 2024 •

edited

Loading

lsy323 commented Aug 5, 2024

lsy323 commented Aug 6, 2024

robertgshaw2-neuralmagic commented Aug 6, 2024 •

edited

Loading

lsy323 commented Aug 7, 2024

WoosukKwon commented Aug 8, 2024

lsy323 commented Aug 8, 2024 •

edited

Loading

WoosukKwon left a comment

[TPU] Add Load-time W8A16 quantization for TPU Backend #7005

[TPU] Add Load-time W8A16 quantization for TPU Backend #7005

Conversation

lsy323 commented Jul 31, 2024

github-actions bot commented Jul 31, 2024

mgoin left a comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Jul 31, 2024

robertgshaw2-neuralmagic commented Jul 31, 2024

lsy323 commented Aug 2, 2024 • edited Loading

robertgshaw2-neuralmagic commented Aug 2, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

miladm commented Aug 5, 2024 • edited Loading

lsy323 commented Aug 5, 2024

lsy323 commented Aug 6, 2024

robertgshaw2-neuralmagic commented Aug 6, 2024 • edited Loading

lsy323 commented Aug 7, 2024

WoosukKwon commented Aug 8, 2024

lsy323 commented Aug 8, 2024 • edited Loading

WoosukKwon left a comment

Choose a reason for hiding this comment

lsy323 commented Aug 2, 2024 •

edited

Loading

miladm commented Aug 5, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 6, 2024 •

edited

Loading

lsy323 commented Aug 8, 2024 •

edited

Loading