Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AQTLayout, PlainAQTLayout and TensorCoreTiledAQTLayout #278

Merged
merged 3 commits into from
May 29, 2024

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented May 25, 2024

Summary:
Today AffineQuantizedTensor has hardcoded storage format of int_data, scale, zero_point. But this does not work if we want to support packed weight. In this PR, we added support to hide the storage details for AffineQuantizedTensor in a family of tensor subclasses, all should inherit from the base Layout storage type: AQTLayout (affine quantized tensor storage). This is part of the project of refactoring AffineQuantizedTensor to be a proper clean API for both developers (as a showcase of how people can add new dtypes/quantization techniques) and modeling users.

This PR just added support for

  • base AQTLayout
  • plain layout tensor (PlainAQTLayout) that stores int_data, scale and zero_point tensors directly
  • tensor core tiled storage tensor (TensorCoreTiledAQTLayout) storing packed weight (result of torch.ops.aten._convert_weight_to_int4pack) and the combined scale_and_zero tensor, for now we only have numeric checks, we'll do performance checks on example models later
    this improves performance for current AffineQuantizedTensor because we are doing packing beforehand, instead of on the fly.

AffineQuantizedTensor will have the following:

  • layout_tensor: AQTLayout (can store data of different storage formats)
  • layout: str (a string represents the type of storage_tensor we have, can be used in dispatch) (overwrite the original tensor layout property)

Users can add a new type of layout tensor for a specific layout with:

@register_aqt_layout_cls("my_layout")
class MyAQTLayout(AQTLayout):
    ...

Target Audience:
As mentioned before,
(1). this serves as an example for developers who are developing new quantization techniques/dtypes, and who also need packing / special storage layout for the weight, they can copy paste what we are doing here and apply to their own use cases
(2). for modeling users, packing weight before hand would give a performance boost to the model, but they'll need to specify the specific extended_layout they want to pack into, e.g. tensor_core_tiled

Test Plan:
python test/quantization/test_quant_api.py

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented May 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/278

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4758982 with merge base 62745fc (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 25, 2024
@jerryzh168 jerryzh168 requested a review from cpuhrsch May 25, 2024 05:42
@jerryzh168 jerryzh168 force-pushed the layout branch 3 times, most recently from 4d232b8 to 5dd3200 Compare May 25, 2024 22:51
@jerryzh168 jerryzh168 changed the title Add support for AQTStorage and PlainAQTStorage Add support for AQTStorage, PlainAQTStorage and TiledAQTStorage May 25, 2024
@jerryzh168 jerryzh168 changed the title Add support for AQTStorage, PlainAQTStorage and TiledAQTStorage Add support for AQTStorage, PlainAQTStorage and TensorCoreTiledAQTStorage May 25, 2024
@cpuhrsch cpuhrsch requested a review from msaroufim May 28, 2024 17:09
@cpuhrsch
Copy link
Contributor

Nice! Nit: Let's please call it Layout :)

Summary:
Today `AffineQuantizedTensor` has hardcoded storage format of `int_data`, `scale`, `zero_point`. But this does not work if we want to support
packed weight. In this PR, we added support to hide the storage details for `AffineQuantizedTensor` in a family of tensor subclasses, all
should inherit from the base Storage type: `AQTStorage` (affine quantized tensor storage)

This PR just added support for a plain storage tensor (`PlainAQTStorage`) that stores `int_data`, `scale` and `zero_point` tensors directly,
in the next PR we'll also support storing packed weight (result of `torch.ops.aten._convert_weight_to_int4pack`) in a different
type of `AQTStorage`.

`AffineQuantizedTensor` will have the following:
- storage_tensor: AQTStorage (can store data of different storage formats)
- storage_layout: str (a string represents the type of storage_tensor we have, can be used in dispatch)

Test Plan:
python test/quantization/test_quant_api.py

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168 jerryzh168 changed the title Add support for AQTStorage, PlainAQTStorage and TensorCoreTiledAQTStorage Add support for AQTLayout, PlainAQTLayout and TensorCoreTiledAQTLayout May 29, 2024
@@ -496,7 +496,7 @@ def test_quantized_tensor_subclass_int8_dyn_quant(self):
m = ToyLinearModel(1024, 1024, 1024).eval().to(torch.bfloat16).to("cuda")
m_copy = copy.deepcopy(m)
# setting batch_size to 20 to be compatible with the kernel
example_inputs = tuple(map(lambda x: x.to(torch.bfloat16).to("cuda"), m.example_inputs(batch_size=20)))
example_inputs = m.example_inputs(batch_size=20, dtype=torch.bfloat16, device="cuda")
m = quantize(m, get_apply_int8dyn_quant())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests explicitly testing TensorCoreTiledAQTLayout

self.int_data = int_data
self.scale = scale
self.zero_point = zero_point
self.layout_tensor = layout_tensor
Copy link
Member

@msaroufim msaroufim May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the benefit of group the above 3 configs into the same object but I don't think I'm quite clear on why this needs to be a subclass? Is it because we need to override the behavior of detach to do bit unpacking?

Copy link
Contributor Author

@jerryzh168 jerryzh168 May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a subclass so that we can support different packing format, like the ones that's used by tensor core

so the ideal abstraction here I feel should be "storage subclass" cc @ezyang @albanD

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a non-standard layout, I don't doubt the need for a subclass (since you need to override the behavior of operations to know how to deal with the layout), but I doubt the need for a wrapper subclass, just do a basic storage and have all the smarts in the tensor. That's how non-contiguous layouts work on dense tensor!

Copy link
Contributor Author

@jerryzh168 jerryzh168 May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang IIUC you mean we just put the layout related logic to tensor subclass itself and create different tensor subclasses for different layout?

the problem here is that we still want the same tensor (AffineQuantizedTensor) but can be packed into different formats (different storage)

btw, this is just a use case I want to make sure you are aware, not a feature request since we can workaround with tensor subclass, even though not that "naturally"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the tensor is repacked multiple ways, what kind of lifetime do you have for the packing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, I meant we still use the same tensor subclass, not the same tensor instance

Copy link
Contributor

@cpuhrsch cpuhrsch May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang - the analogy here is changing from strided to COO to BSR to strided etc.

Essentially we want to introduce new layouts for this Tensor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also be useful for https://github.com/pytorch/ao/blob/4c1d568263582db3e48299d2eba78d8ad1f6274d/torchao/dtypes/uint4.py to implement different bitpacking formats (different layouts).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ways to skin this cat. But the way I would think about it is that if you want a bunch of different behaviors to live in a single class (no extra subclass hierarchy), well, you can't use the class dispatch mechanism to dispatch them. So you need your own dispatch mechanism. Luckily, in Python, that's pretty easy to build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. Coming up with a good way to skin this cat could help us make it a generic recipe for PyTorch users. @jerryzh168 and I have been chatting a bit about it.

@@ -348,7 +348,7 @@ def apply_int4wo_quant(weight):
preserve_zero = False
zero_point_dtype = torch.bfloat16
zero_point_domain = ZeroPointDomain.FLOAT
return to_aq(weight, mapping_type, block_size, target_dtype, quant_min, quant_max, eps, zero_point_dtype=zero_point_dtype, preserve_zero=preserve_zero, zero_point_domain=zero_point_domain)
return to_aq(weight, mapping_type, block_size, target_dtype, quant_min, quant_max, eps, zero_point_dtype=zero_point_dtype, preserve_zero=preserve_zero, zero_point_domain=zero_point_domain, extended_layout="tensor_core_tiled")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msaroufim "tensor_core_tiled" is tested here

@jerryzh168 jerryzh168 merged commit 374fec4 into pytorch:main May 29, 2024
13 checks passed
@jerryzh168 jerryzh168 deleted the layout branch May 29, 2024 21:31
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
…Layout` (pytorch#278)

Summary:
Today `AffineQuantizedTensor` has hardcoded storage format of `int_data`, `scale`, `zero_point`. But this does not work if we want to support
packed weight. In this PR, we added support to hide the storage details for `AffineQuantizedTensor` in a family of tensor subclasses, all
should inherit from the base Storage type: `AQTStorage` (affine quantized tensor storage)

This PR just added support for a plain storage tensor (`PlainAQTStorage`) that stores `int_data`, `scale` and `zero_point` tensors directly,
in the next PR we'll also support storing packed weight (result of `torch.ops.aten._convert_weight_to_int4pack`) in a different
type of `AQTStorage`.

`AffineQuantizedTensor` will have the following:
- storage_tensor: AQTStorage (can store data of different storage formats)
- storage_layout: str (a string represents the type of storage_tensor we have, can be used in dispatch)

Test Plan:
python test/quantization/test_quant_api.py

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants