[FEAT] Add custom CUDA `tinygemm` unpacker #415

jeromeku · 2024-06-21T20:56:36Z

Description

Adds CUDA custom ops to unpack weights that have been packed with torch.ops.aten._convert_weight_to_int4pack for use with torch.ops.aten._weight_int4pack_mm.

Currently there is only a packing function that permutes and prepacks the weights in tensor-core format. However, there is no equivalent unpacking function that reorders the weights back to the original logical layout.

The implementation is an adaptation of the original packing code (int4mm.cu) with modifications to simplify indexing logic and fused unpacking & dequantization.

Motivation

Fast unpacking of packed weights is needed when switching quantized gemm backends during inference.

As workloads transition from memory-bound to compute-bound (i.e., context length growth during decoding), users might wish to switch to a different kernel implementation that is more performant in this regime than tinygemm.

In order to do this, the weights need to be unpacked from the packed format. Alternative would be to store 2 copies of the weights -- one packed, one in logical format -- but this is clearly not ideal given memory burden.

Features

Add 2 custom CUDA ops, registered per the instructions in torchao custom op documentation:

torchao.ops.unpack_int4 - unpacks the packed weight to the original N x K logical layout with dtype torch.int. Can be used within TensorCoreTiledAQTLayout.get_plain to recover original layout of the (quantized) tensor.
torchao.ops.dequantize_int4 - dequantizes the packed weight to bfloat16 tensor with original N x K logical layout. This is useful for developers who want to unpack and dequantize the packed weight when switching quantized matmul backends on the fly.

Tests

Tests have been added to test/test_ops.py for both correctness as well as for correct custom op registration.

Note: opcheck test_aot_dispatch_dynamic currently failing, investigating.

TODO

Fuse dequant into unpacking kernel
- ~~Kernel works against a reference implementation per my understanding of dequantization but needs further verification (see notes in test/test_ops.py:test_dequant_int4_correctness)~~
Implement dequantize for ZeroPointDomain.Float per update
Debug test_aot_dispatch_dynamic opcheck failure
Integrate with AQT get_plain

@msaroufim

pytorch-bot · 2024-06-21T20:56:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/415

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Checkout action fails due to incompatible GLIBC

❌ 1 New Failure

As of commit e90e280 with merge base e5548b7 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CPU 2.2.2, linux.4xlarge, torch==2.2.2 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/integration/test_integration.py::SmoothquantIntegrationTest::test_on_dummy_distilbert

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-06-21T22:54:45Z

great, after this is landed we can replace this workaround code

ao/torchao/dtypes/affine_quantized_tensor.py

Line 505 in 9dc2c11

def get_plain(self):

with the op I think

jeromeku · 2024-06-22T15:36:31Z

@jerryzh168

Took a look at get_plain for int4 AQT type: will revise the kernel so that it more closely aligns with the API of get_plain.

Are there tests for get_plain that I can adapt to verify my implementation?

Currently working on fusing dequant into the unpacking kernel, however, a simple sanity check using the same logic as get_plain is failing.

That is, I'm using tinygemm to dequantize by passing in an identity matrix as operand a and packed weights, scales, and zeros, respectively. Comparing against a reference dequantize method that does (q - zero) * scale on the unpacked weights, scales, and zeros does not check out. Am I misinterpreting the scales and zeros?

jerryzh168 · 2024-06-24T16:58:24Z

@jerryzh168

Took a look at get_plain for int4 AQT type: will revise the kernel so that it more closely aligns with the API of get_plain.

Are there tests for get_plain that I can adapt to verify my implementation?

Currently working on fusing dequant into the unpacking kernel, however, a simple sanity check using the same logic as get_plain is failing.

That is, I'm using tinygemm to dequantize by passing in an identity matrix as operand a and packed weights, scales, and zeros, respectively. Comparing against a reference dequantize method that does (q - zero) * scale on the unpacked weights, scales, and zeros does not check out. Am I misinterpreting the scales and zeros?

we don't have get_plain() tests yet, but I'm planning to add some tests for AffineQuantizedTensor in the future

the way that tinygemm dequantize implmeneted is a bit different from the normal path, here is how it's implmeneted:

function call to our primitive ops:

ao/torchao/quantization/utils.py

Line 370 in 37c348e

    
           return dequantize_affine(w_int4x8, block_size, scales, zeros, input_dtype, quant_min, quant_max, zero_point_domain=ZeroPointDomain.FLOAT, output_dtype=scales.dtype)

code path:

ao/torchao/quantization/quant_primitives.py

Line 266 in 37c348e

else:

main difference is the zero_point is in floating point domain (while the the quant/dequant that we are more familiar with is in integer domain):

ao/torchao/quantization/quant_primitives.py

Lines 43 to 44 in 37c348e

    
               integer domain: quantized_val = (float_val / scale) (integer) + zero_point (integer) 
        
               float domain: quantized_val = (float_val - (zero_point (float) - scale * mid_point)) / scale

jeromeku · 2024-06-24T20:57:08Z

@jerryzh168

Many thanks on the clarification!

Helps explain why tinygemm kernel is able to use a single fma to dequantize (using integer zero-point would require a sub then mul unless zeros are stored as scales * zeros, which is not the case).

This is good to know as the original motivation for this PR was to help answer.ai / hqq developers who are using tinygemm as a quantized matmul backend. However, I believe hqq is using the integer zeropoint derivation (but keeping the zeropoint in the original floating-point dtype), which will result in incorrect results when using tinygemm kernel, which is dequantizing based on the floating-point zeropoint calculation.

What is the mathematical derivation of the float dequantization method vs the more common integer quantization scheme? Are there any papers / blogs that explain the reasoning for this difference?

jerryzh168 · 2024-06-24T21:41:13Z

Helps explain why tinygemm kernel is able to use a single fma to dequantize (using integer zero-point would require a sub then mul unless zeros are stored as scales * zeros, which is not the case).

yes, the motivation for tinygemm to have zero_point in floating point domain is exactly to use a single fma (I talked to Jeff about this).

This is good to know as the original motivation for this PR was to help answer.ai / hqq developers who are using tinygemm as a quantized matmul backend. However, I believe hqq is using the integer zeropoint derivation (but keeping the zeropoint in the original floating-point dtype), which will result in incorrect results when using tinygemm kernel, which is dequantizing based on the floating-point zeropoint calculation.

for hqq yeah we need to make sure this detail is correct since they are using tinygemm kernels. cc @HDCharles @mobicham

What is the mathematical derivation of the float dequantization method vs the more common integer quantization scheme? Are there any papers / blogs that explain the reasoning for this difference?

I'm not aware of any formal papers or blogs. so the differences are shown in our quant_primitive ops in these two flags:

ao/torchao/quantization/quant_primitives.py

Lines 303 to 318 in c2cf973

    
                   preserve_zero (bool): a flag to indicate whether we need zero to be exactly 
        
                     representable or not, this is typically required for ops that needs zero padding, like convolution 
        
                     it's less important for ops that doesn't have zero padding in the op itself, like linear. 
        
                     For example, given a floating point Tensor [1.2, 0.1, 3.0, 4.0, 0.4, 0], if `preserve_zero` is True, 
        
                     we'll make sure there is a integer value corresponding to the floating point 0, e.g. [-3, -8, 3, 7, -7, -8], 0 will be mapped to `-8` without loss. But if `preserve_zero` is not True, there won't be such 
        
                     gurantee. 
        
                     If we don't need zero to be exactly representable, we won't do rounding and clamping for zero_point 
        
                   zero_point_domain (ZeroPointDomain): the domain that zero_point is in, should be eitehr integer or float 
        
                       if zero_point is in integer domain, zero point is added to the quantized integer value during 
        
                       quantization 
        
                       if zero_point is in floating point domain, zero point is subtracted from the floating point (unquantized) 
        
                       value during quantization 
        
                       default is ZeroPointDomain.INT

traditional integer quantization:

preserve_zero is True: this is because traditionally we use quantization on conv, and it can have zero_padding, so there is a domain specific requirement of floating point zero has to be exactly representable: https://github.com/google/gemmlowp/blob/master/doc/quantization.md#domain-specific-constraint-the-real-value-0-must-be-exactly-representable
zero_point is in integer domain
This is probably for static quantization where there are hardwares that only support integer compute

tinygemm:

preserve_zero is False because mainly we care about linear, this will also help improve accuracy in some cases since we don't need to always include zero during quantization
zero_point is in floating point domain
this is because of fma I think

jeromeku · 2024-06-24T21:47:12Z

@jerryzh168

Thanks!

Looking at gpt-fast quantization code here, scales / zeros are calculated as:

  scales = (max_val - min_val).clamp(min=1e-6) / max_int
  zeros = min_val + scales * (2 ** (n_bit - 1))

where 2 ** (nbit-1) is the mid-point. Hence it is fusing the mid-point into the zeros.

Then in the tinygemm kernel, dequantization is performed as q * scale + zero, which is close to but does not match exactly ZeroPointDomain.FLOAT dequant. Can you elaborate on what zero_point (float) refers to here?

jerryzh168 · 2024-06-24T21:54:31Z

torchao/csrc/cuda/unpack_int4/unpack_int4.cu

+  m.impl("torchao::unpack_int4_to_int", &_unpack_int4_to_int);
+  m.impl("torchao::dequantize_int4", &_dequantize_int4);


nit: I feel we probably need to mention tensor_core_tiled layout in the name of these ops if these are specific to that packing format

jerryzh168 · 2024-06-24T22:10:16Z

@jerryzh168

Thanks!

Looking at gpt-fast quantization code here, scales / zeros are calculated as:
  scales = (max_val - min_val).clamp(min=1e-6) / max_int
  zeros = min_val + scales * (2 ** (n_bit - 1)) 
where 2 ** (nbit-1) is the mid-point. Hence it is fusing the mid-point into the zeros.

Then in the tinygemm kernel, dequantization is performed as q * scale + zero, which is close to but does not match exactly ZeroPointDomain.FLOAT dequant. Can you elaborate on what zero_point (float) refers to here?

sorry I just wrote down the quantize function there, it's not the dequant function, I should probably add all algorithms (choose_qparams, quant, dequant) there. the dequant we are using is here:

ao/torchao/quantization/quant_primitives.py

Lines 267 to 274 in 37c348e

    
           assert zero_point_domain == ZeroPointDomain.FLOAT, f"Unexpected zero point domain: {zero_point_domain}" 
        
           mid_point = (quant_max + quant_min + 1) / 2 
        
           # This should allocate new memory and avoid input modification 
        
           dequant = input - mid_point 
        
           dequant = dequant.to(output_dtype) 
        
           dequant *= scale 
        
           if zero_point is not None: 
        
               dequant += zero_point

jeromeku · 2024-06-24T22:32:01Z

@jerryzh168
Sorry for persisting on this matter but still a gap in my understanding:

If we unpack what gpt-fast and tinygemm is doing:

  scales = (max_val - min_val).clamp(min=1e-6) / max_int
  zeros = min_val + scales * (2 ** (n_bit - 1)) = min_val + scales * mid_point

Then in tinygemm, dequantization is calculated as:

x = q * scales + zeros 
   =  q * scales + min_val + scales * mid_point

Where x is dequantized value, q is quantized value.

Comparing this to torchao dequant per your link:

 x = (q - mid_point) * scales + zeros 
    = q * scales - scales * mid_point + zeros

Assuming zero_point is calculated per gpt_fast:

x = q * scales - scales * mid_point + min_val + scales * mid_point
   = q * scales + min_val

How to reconcile these differences? How are zeros expected to be calculated in torchao?

jerryzh168 · 2024-06-24T22:55:01Z

@jerryzh168 Sorry for persisting on this matter but still a gap in my understanding:

If we unpack what gpt-fast and tinygemm is doing:
  scales = (max_val - min_val).clamp(min=1e-6) / max_int
  zeros = min_val + scales * (2 ** (n_bit - 1)) = min_val + scales * mid_point
Then in tinygemm, dequantization is calculated as:
x = q * scales + zeros 
   =  q * scales + min_val + scales * mid_point
Where x is dequantized value, q is quantized value.

Comparing this to torchao dequant per your link:
 x = (q - mid_point) * scales + zeros 
    = q * scales - scales * mid_point + zeros
Assuming zero_point is calculated per gpt_fast:
x = q * scales - scales * mid_point + min_val + scales * mid_point
   = q * scales + min_val
How to reconcile these differences? How are zeros expected to be calculated in torchao?

yeah no problem, I think the main difference as you listed is this part:

> Then in `tinygemm`, dequantization is calculated as:
> 
> ```python
> x = q * scales + zeros 
>    =  q * scales + min_val + scales * mid_point
> ```

I feel the q here should probably be (q - mid_point)

I'm not very familiar with tinygemm kernel implementation itself, but I think this should be accounted for either by preprocess of q or post process of the results after that dequant op.

also for some additional context, current quant primitives in torchao are adapted from the original gpt-fast/tinygemm choose_qparams/quantize/dequantize implemnetations and we have regression tests to make sure they match:

ao/test/quantization/test_quant_primitives.py

Line 49 in c2cf973

# Legacy tinygemm ops

jerryzh168 · 2024-06-24T23:04:17Z

maybe related to https://github.com/pytorch/pytorch/blob/93a33bf3ac0b4c9560b49780eabcad2f76dcf43e/aten/src/ATen/native/cuda/int4mm.cu#L197

cc @HDCharles do you know how tinygemm kernel dequant implementation match up with the python dequant implementation?

jeromeku · 2024-06-25T01:07:19Z

@jerryzh168

For the purposes of this PR, then, what should be the expected behavior of dequantize_int4?

That is, given packed weights, scales, zeros, etc., what should be the calculation to dequantize the weights from int4 to bfloat16?

Checking the quant_primitives dequant methodology against calling tinygemm with an identity matrix to dequant gives a small error (~ 1e-2), see this script.

jerryzh168 · 2024-06-26T00:21:25Z

@jeromeku OK I just confirmed with Jeff Johnson that this code is actually doing both a uint4 -> int4 conversion ([0, 15] --> [-8, 7]) which is equivalent to (q_val - mid_point) in our dequant code, and also a conversion to bfloat16

so I feel the dequantize_op in this case should follow what we are doing in dequantize_affine op, at least for int4, I also need to think a bit about this uint4 -> int4 conversion stuff, I feel it should probably be done outside of quant primitives op

for test, what you described make sense, I think you can do two test:

quant + dequant v.s. (quant + pack + packed_dequant_int4 (your op))
quant + pack + tinygemm_int4_mm with identity matrix v.s. (quant + pack + packed_dequant_int4 (your op))

jeromeku · 2024-06-26T00:49:05Z

@jerryzh168

Thanks for the clarification.

Will update the dequant kernel to reflect this change: u4 -> s4 + upcast followed by scale + shift and add relevant tests.

Would be good to add some additional documentation explaining the pre-processing / post-processing needed to use quantized weights, scales, and zero-points prepared using "conventional" (ZeroPointDomain.INT) schemes for use with tinygemm, e.g., hqq.

jerryzh168 · 2024-06-26T01:27:42Z

sure thanks, I'll add some docs for quant_primitives in our README

test/test_ops.py

torchao/ops.py

jerryzh168 · 2024-07-01T17:50:46Z

torchao/ops.py

+    torch._check(scales_and_zeros.size(1) == N, lambda: "scales_and_zeros must have N at dim 1")
+    torch._check(scales_and_zeros.size(2) == 2, lambda: "scales_and_zeros must have 2 at dim 2")
+
+    return torch.empty((N, K), dtype=torch.bfloat16, device=packed_w.device)


same for this, is this supposed to call dequantize_tensor_core_tiled_layout

I thought this was the expected pattern for registering a custom op? I was following the example of the pre-existing fp6_linear custom op already in ops.py.

Previously one would register an abstract impl for composability with torch.compile. Thought this was the expected interface with the new custom op registration API. That is, register a "fake" implementation that runs checks and just returns the expected shape of the output.

oh OK, I think I understand now, register_custom_op is calling register_fake/impl_abstract, I feel we need to rename this util to something more accurate, cc @msaroufim

torchao/ops.py

jerryzh168

looks good overall, thanks for working on this @jeromeku! just had a few nits + requested additional tests and questions around the motivation of having two ops

Summary: att, per request in pytorch#415 (comment) Test Plan: doc changes Reviewers: Subscribers: Tasks: Tags:

Summary: att, per request in #415 (comment) Test Plan: doc changes Reviewers: Subscribers: Tasks: Tags:

jeromeku · 2024-07-03T20:13:18Z

@jerryzh168

Fixed all the above:

Renamed innerKTiles -> inner_k_tiles
Changed unpack test from testing for closeness to equality
Added additional (unpack_tensor_core_tiled_layout_op + dequant) vs. dequantize_tensor_core_tiled_layout_op test
Added comments clarifying the logic of the fused dequant kernel tests
- Since tinygemm id matrix dequant hack and the fused dequant kernel utilize the same underlying fast CUDA numeric conversion path from u4 -> s4 -> bf16, they have identical numerics
- Both result in ~1e-2 discrepancies when compared with ao groupwise_affine_dequantize
- These conditions are tested for in the dequantize_tensor_core_layout tests:
  - difference between tinygemm id matrix dequant and dequant_tensor_core_layout is 0
  - the difference between tinygemm id matrix vs. groupwise_affine_dequant is the same as the difference between dequantize_tensor_core_layout and groupwise_affine_dequant.

jerryzh168 · 2024-07-03T23:17:09Z

torchao/ops.py

+
+    return torch.empty((N, K), dtype=torch.int32, device=packed_w.device)
+
+def dequantize_tensor_core_tiled_layout(packed_w: Tensor, scales_and_zeros: Tensor, group_size: int, inner_k_tiles: int) -> Tensor:


is this specific to uint4 btw?

looks like so, maybe we can add uint4 to the name as well in that case, unless this layout makes sense for other dtypes as well and we want to extend it to other dtypes in the future

jerryzh168

LGTM! really appreciate adding this functionality and the thorough comments/testing!

msaroufim · 2024-07-04T00:01:09Z

Just a minor merge conflict and this should be good to merge

…ack_int4

jeromeku · 2024-07-04T01:33:47Z

@jerryzh168 @msaroufim

Getting CI failure unrelated to PR:

 =========================== short test summary info ============================
  FAILED test/integration/test_integration.py::SmoothquantIntegrationTest::test_on_dummy_distilbert - requests.exceptions.ReadTimeout: (ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 72162155-ccbe-44a5-9304-e6e08c061a4e)')
  ====== 1 failed, 202 passed, 556 skipped, 11 warnings in 93.47s (0:01:33) ======
  Error: Process completed with exit code 1.

msaroufim

Cool thank you for the awesome work @jeromeku and thank you for the thorough review @jerryzh168

The CI failure indeed seems unrelated, most likely a flake due to connection issues with HF

…#469) Summary: att, per request in pytorch#415 (comment) Test Plan: doc changes Reviewers: Subscribers: Tasks: Tags:

* add unpack cuda * add tests * fix tests * refactor tinygemm unpacking kernel * add dequant * add additional dequant check * update tinygemm dequantize test * correct dequant kernel logic * clean up kernel * update dequantize kernel tests * rename kernel ops to tensor_core_tiled_layout * add renamed kernel source * add back test_aot_dispatch opcheck * rename innerKTiles to inner_k_tiles * add unpack and dequant test * additional numerical checks for unpack then dequant * rebase test_ops on main * remove commented out code * skip dynamic opcheck unless torch>=2.5

jeromeku added 2 commits June 21, 2024 12:46

add unpack cuda

dc5b10f

add tests

fff3e8a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 21, 2024

fix tests

39f23cf

jeromeku marked this pull request as draft June 22, 2024 16:26

jeromeku added 3 commits June 23, 2024 14:39

refactor tinygemm unpacking kernel

e41b682

add dequant

3a3d788

add additional dequant check

a2ca149

jerryzh168 reviewed Jun 24, 2024

View reviewed changes

jeromeku added 6 commits June 27, 2024 00:34

update tinygemm dequantize test

052d482

correct dequant kernel logic

18c505f

clean up kernel

d831a5e

update dequantize kernel tests

48a8062

rename kernel ops to tensor_core_tiled_layout

279b79a

add renamed kernel source

b6ad9f7