-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sparsify API to torchao #473
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/473
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 PendingAs of commit 950c476 with merge base a895699 (): NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
In some cases we rewrote popular GenAI models to be significantly faster in native PyTorch as in no C++/CUDA to achieve at the time SOTA inference performance. These involve more intrusive code changes. | ||
```python | ||
from torchao.sparsity import sparsify | ||
from torchao.sparsity.prototype.dynamic_quant_sparse import int8_dynamic_activation_int8_2x4_sparse_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the plan to move this out of prototype? is this related to composing sparsity and quant properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I plan to move this out as part of 0.4, implementing a layout like here: https://github.com/pytorch/ao/compare/jcaip/affine-quantize-sparse?expand=1
This works, but I am running into a performance regression so need to debug that first before we can merge.
|
||
#### With intrusive code changes | ||
m = sparsify(m, to_sparse_semi_structured) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this is using to_sparse_semi_structured
while the other one is using int8_dynamic_activation_int8_2x4_sparse_weight()
which might be a bit confusing, I'd suggest to just align
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK ill add a semi_sparse_weight()
wrapper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Minor questions above
@@ -37,7 +38,7 @@ def test_sparse(self): | |||
apply_fake_sparsity(model) | |||
dense_result = model(input) | |||
|
|||
apply_sparse_semi_structured(model) | |||
model = sparsify(model, to_sparse_semi_structured) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is sparsify an in place op? This came up recently since quantize
is in place and here it looks like the api used to be in place but now its not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there's some discussion on whether we can use quantize_
or quantize
for the in-place op, I'm not sure if we came to a conclusion. cc @jerryzh168 do you have a preference for what to use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use the inplace version for now I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases we rewrote popular GenAI models to be significantly faster in native PyTorch as in no C++/CUDA to achieve at the time SOTA inference performance. These involve more intrusive code changes. | ||
```python | ||
from torchao.sparsity import sparsify | ||
from torchao.sparsity.prototype.dynamic_quant_sparse import int8_dynamic_activation_int8_2x4_sparse_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might have briefly chatted about this when we were discussing the quantize api but just thinking out loud here
If I add quantize(sparsify(m))
is that different from sparsify(quantize(m))
and if so in what order if any are optimizations applied in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, the composition of int8 quantization and 2:4 sparsity is treated as it's own distinct technique, so you can either go:
quantize(int8dynamic + 2:4 sparse)
or sparsify(int8dynamic + 2:4 sparse)
. Once we implement sparsity as a AQTLayout we can add support for a "composable" API, where we go
quantize(int8 dynamic)
sparsify(to_sparse_semi_structured)
or vice versa.
Mathematically, the how you apply the optimizations will matter, but I think we should make them the so it doesn't matter for our API, for two reasons:
- currently only quantize -> sparsify is supported, it would be extra work to support sparsify -> quantize.
- One of these orderings will be "better", I can't really see a situation where the order of how you apply the optimizations will differ across different layers. So we should always just default to the "best" one and not give users an option to shoot themselves in their foot.
torchao/sparsity/sparse_api.py
Outdated
m = sparsify(m, to_sparse_semi_structured, filter_fn) | ||
|
||
# for int8 dynamic quantization + 2:4 sparsity | ||
from torchao.sparity.prototype import int8_dynamic_activation_int8_2x4_sparse_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here sparity
mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight)) | ||
Currently, we support two options for sparsity: | ||
- semi-sturctured (2:4) sparsity with `to_sparse_semi_structured` | ||
- int8 dynamic quantization + 2:4 sparsity with `int8_dynamic_activation_int8_2x4_sparse_weight`, which is also available via the quantize API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where I'm a bi confused, it's not clear whether the quantize and sparsify apis compose reading the docstrings
torchao/sparsity/sparse_api.py
Outdated
if filter_fn(mod, name): | ||
mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight)) | ||
Currently, we support two options for sparsity: | ||
- semi-sturctured (2:4) sparsity with `to_sparse_semi_structured` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here as well sturctured
The failure in nightly regression is a flake with smoothquant where it tries to download something from HF, so this is safe to merge |
* Add sparsify API to torchao * fix typo
This PR removes the old
apply_sparse_semi_structured
API in favor ofsparsify
, so that we are more inline with the existingquantize
API.I also updated the README so that it's a bit more clear that sparsity doesn't require intrusive code changes and added a bit of example code.