Refactor int8 dynamic quantization with call to `quantize` #294

jerryzh168 · 2024-05-30T02:49:37Z

Summary:
Previously we added quantize as a general API (#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general.

The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch).

This PR we started replacing the implementation of int8 dynamic quant API with quantize API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model.

refactor int4 weight only and int8 weight only apis
add some unit tests for affine quantized tensor and layout tensor subclasses
8da4w can happen a bit later since we need to coordinate with executorch team on how to make the switch

Test Plan:
TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py

reference: elapsed_time: 1.4821058654785155 milliseconds
after refactor: elapsed_time: 1.4826457214355468 milliseconds

generated code diff: https://gist.github.com/jerryzh168/edca9d421363582b66d41c9cc2db0f7b
(meta only) paste diff: https://www.internalfb.com/phabricator/paste/view/P1385188208

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-05-30T02:49:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/294

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

✅ No Failures

As of commit 4ec820d with merge base e7837d7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/quantization/quant_api.py

torchao/dtypes/aqt.py

test/quantization/test_quant_api.py

msaroufim · 2024-05-30T22:23:14Z

test/quantization/test_quant_api.py

+        benchmark_model(m_ref, WARMUP, input_tensor)
+        ref_elapsed_time = benchmark_model(m_ref, RUNS, input_tensor)
+
+        # recent measurement result:


specify the device, I assume you mean A100? Also I'd much rather this is in docs vs code

yeah this is in A100, I can remove these as well

msaroufim · 2024-05-30T22:51:00Z

test/quantization/test_quant_api.py

+        # elapsed time: 0.22115007400512696, ref elapsed time: 0.32116992950439455
+        # elapsed time: 0.2349068832397461, ref elapsed time: 0.2812371253967285
+        print(f"elapsed time: {elapsed_time}, ref elapsed time: {ref_elapsed_time}")
+        self.assertTrue(elapsed_time < 1.05 * ref_elapsed_time)


this test might end up being flaky, also how long does this test take? strange to do a benchmark for unit tests

this is pretty quick when I run it in my A100 machine, finishes in a few seconds. I could also skip this by default and just have people run this locally when making changes to these APIs

skipped this one by default

Yeah doing benchmarks in unit tests is a known anti pattern. Test environments don't need to be inconsistent and it's likely a waste of resources to make them so.

msaroufim · 2024-05-30T22:51:53Z

torchao/dtypes/aqt.py

            args[1],
            args[2],
            args[0],
        )
+        try:
+            return _quantized_linear_op(input_tensor, weight_tensor, bias)
+        except:


there' a bunch of code duplication here, also why do we need the try except block?

oh we actually need to call the function in different ways here, not sure when the change is reverted, will fix

try except is used as a fallback when the specific configuration of input and weight tensor is not caught by any of the special dispatches in _quantized_linear_op

added some comments

msaroufim · 2024-05-30T22:52:56Z

torchao/quantization/subclass.py

+            )
+
+        if func is aten.t.default:
+            return return_and_correct_aliasing(


n00b q: could you remind me what return_and_correct_aliasing are doing?

the code is mostly copy pasted from other places, but looks like it's a trick for tensor subclass to work with torch.compile and make sure we have correct aliasing behavior: https://github.com/pytorch/pytorch/blob/214dd44608a92802f9c13471451ae09cf6b25fd0/torch/utils/_python_dispatch.py#L518

jerryzh168 · 2024-05-31T00:23:11Z

torchao/dtypes/aqt.py

+                input_tensor = input_tensor.dequantize()
+            if isinstance(weight_tensor, AffineQuantizedTensor):
+                weight_tensor = weight_tensor.dequantize()
+            return func(input_tensor, weight_tensor)


so here is a difference of how we call the function, since we have aten.mm here, the order of passing around the args are different from aten.addmm

…antize` Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags:

Summary: Similar to pytorch#294 we replaced the implementation of int8 weight only quant to used the newly added `quantize` function, as a part of the unification effort for affine quantization Test Plan: 1. unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int8_wo_quant_perf elapsed time: 0.23909856796264647, ref elapsed time: 0.25150911331176756 elapsed time: 0.24894208908081056, ref elapsed time: 0.2570047950744629 elapsed time: 0.21607391357421876, ref elapsed time: 0.22809568405151368 2. integration test: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Reference: elapsed_time: 1.355208740234375 milliseconds After refactor: elapsed_time: 1.32778857421875 milliseconds code diff (gist): https://gist.github.com/jerryzh168/921a722cf20d476c8fc5888482e722dc code diff (meta-only paste): https://www.internalfb.com/phabricator/paste/view/P1387333845 Reviewers: Subscribers: Tasks: Tags:

Summary: This is similar to pytorch#294 but applied for int4 weight only quantization Test Plan: unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int4_wo_quant_perf elapsed time: 0.2166275215148926, ref elapsed time: 0.2191881561279297 elapsed time: 0.2376406478881836, ref elapsed time: 0.22721023559570314 elapsed time: 0.21919679641723633, ref elapsed time: 0.2154969596862793 integration perf test: reference: elapsed_time: 2.5900126953125 milliseconds after refactor: elapsed_time: 2.56680078125 milliseconds diff: no diff TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Before: After: generated code diff: Reviewers: Subscribers: Tasks: Tags:

* Replace implementation for int8 dynamic quantization with call to `quantize` Summary: Previously we added `quantize` as a general API (#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags: * Refactor int8 weight only quant to use `quantize` Summary: Similar to #294 we replaced the implementation of int8 weight only quant to used the newly added `quantize` function, as a part of the unification effort for affine quantization Test Plan: 1. unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int8_wo_quant_perf elapsed time: 0.23909856796264647, ref elapsed time: 0.25150911331176756 elapsed time: 0.24894208908081056, ref elapsed time: 0.2570047950744629 elapsed time: 0.21607391357421876, ref elapsed time: 0.22809568405151368 2. integration test: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Reference: elapsed_time: 1.355208740234375 milliseconds After refactor: elapsed_time: 1.32778857421875 milliseconds code diff (gist): https://gist.github.com/jerryzh168/921a722cf20d476c8fc5888482e722dc code diff (meta-only paste): https://www.internalfb.com/phabricator/paste/view/P1387333845 Reviewers: Subscribers: Tasks: Tags: * Replace implementation for int8 dynamic quantization with call to `quantize` Summary: Previously we added `quantize` as a general API (#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags: * Refactor int4 weight only quantization with call to `quantize` Summary: This is similar to #294 but applied for int4 weight only quantization Test Plan: unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int4_wo_quant_perf elapsed time: 0.2166275215148926, ref elapsed time: 0.2191881561279297 elapsed time: 0.2376406478881836, ref elapsed time: 0.22721023559570314 elapsed time: 0.21919679641723633, ref elapsed time: 0.2154969596862793 integration perf test: reference: elapsed_time: 2.5900126953125 milliseconds after refactor: elapsed_time: 2.56680078125 milliseconds diff: no diff TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Before: After: generated code diff: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: Mark Saroufim <marksaroufim@meta.com>

Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags:

…torch#301) * Replace implementation for int8 dynamic quantization with call to `quantize` Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags: * Refactor int8 weight only quant to use `quantize` Summary: Similar to pytorch#294 we replaced the implementation of int8 weight only quant to used the newly added `quantize` function, as a part of the unification effort for affine quantization Test Plan: 1. unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int8_wo_quant_perf elapsed time: 0.23909856796264647, ref elapsed time: 0.25150911331176756 elapsed time: 0.24894208908081056, ref elapsed time: 0.2570047950744629 elapsed time: 0.21607391357421876, ref elapsed time: 0.22809568405151368 2. integration test: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Reference: elapsed_time: 1.355208740234375 milliseconds After refactor: elapsed_time: 1.32778857421875 milliseconds code diff (gist): https://gist.github.com/jerryzh168/921a722cf20d476c8fc5888482e722dc code diff (meta-only paste): https://www.internalfb.com/phabricator/paste/view/P1387333845 Reviewers: Subscribers: Tasks: Tags: * Replace implementation for int8 dynamic quantization with call to `quantize` Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags: * Refactor int4 weight only quantization with call to `quantize` Summary: This is similar to pytorch#294 but applied for int4 weight only quantization Test Plan: unit perf test: python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int4_wo_quant_perf elapsed time: 0.2166275215148926, ref elapsed time: 0.2191881561279297 elapsed time: 0.2376406478881836, ref elapsed time: 0.22721023559570314 elapsed time: 0.21919679641723633, ref elapsed time: 0.2154969596862793 integration perf test: reference: elapsed_time: 2.5900126953125 milliseconds after refactor: elapsed_time: 2.56680078125 milliseconds diff: no diff TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py Before: After: generated code diff: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: Mark Saroufim <marksaroufim@meta.com>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 30, 2024

jerryzh168 changed the title ~~Replace implementation for int8 dynamic quantization with call to `qu…~~ refactor int8 dynamic quantization with call to quantize May 30, 2024

jerryzh168 requested review from cpuhrsch, msaroufim and HDCharles May 30, 2024 02:50

jerryzh168 changed the title ~~refactor int8 dynamic quantization with call to quantize~~ Refactor int8 dynamic quantization with call to quantize May 30, 2024

jerryzh168 force-pushed the dyn-refactor branch 4 times, most recently from d2b3403 to a87fed9 Compare May 30, 2024 17:50

cpuhrsch reviewed May 30, 2024

View reviewed changes

torchao/quantization/quant_api.py Show resolved Hide resolved

cpuhrsch reviewed May 30, 2024

View reviewed changes

torchao/dtypes/aqt.py Outdated Show resolved Hide resolved

jerryzh168 mentioned this pull request May 30, 2024

[Tracker] WIP features for torchao 0.3 #252

Closed

19 tasks

jerryzh168 force-pushed the dyn-refactor branch 3 times, most recently from 43e5b6c to 70d4188 Compare May 30, 2024 22:16

msaroufim reviewed May 30, 2024

View reviewed changes

test/quantization/test_quant_api.py Outdated Show resolved Hide resolved

jerryzh168 force-pushed the dyn-refactor branch 2 times, most recently from 6ac58f9 to dc48f08 Compare May 30, 2024 22:39

msaroufim reviewed May 30, 2024

View reviewed changes

jerryzh168 force-pushed the dyn-refactor branch 7 times, most recently from 4ea2e04 to fa4db66 Compare May 31, 2024 00:21

jerryzh168 commented May 31, 2024

View reviewed changes

jerryzh168 force-pushed the dyn-refactor branch from fa4db66 to 4ec820d Compare May 31, 2024 00:24

jerryzh168 requested review from cpuhrsch and msaroufim May 31, 2024 03:52

msaroufim approved these changes May 31, 2024

View reviewed changes

jerryzh168 merged commit 68ce5b8 into pytorch:main May 31, 2024
13 checks passed

jerryzh168 deleted the dyn-refactor branch May 31, 2024 16:14

jerryzh168 mentioned this pull request May 31, 2024

Refactor int8 weight only quant to use quantize #299

Closed

jerryzh168 mentioned this pull request Jun 1, 2024

Refactor int4 and int8 weight only quantization to use quantize #301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor int8 dynamic quantization with call to `quantize` #294

Refactor int8 dynamic quantization with call to `quantize` #294

jerryzh168 commented May 30, 2024 •

edited

Loading

pytorch-bot bot commented May 30, 2024 •

edited

Loading

msaroufim May 30, 2024

jerryzh168 May 30, 2024

msaroufim May 30, 2024

jerryzh168 May 30, 2024

jerryzh168 May 31, 2024

cpuhrsch May 31, 2024

msaroufim May 30, 2024

jerryzh168 May 30, 2024

jerryzh168 May 31, 2024

msaroufim May 30, 2024

jerryzh168 May 30, 2024

jerryzh168 May 31, 2024

Refactor int8 dynamic quantization with call to quantize #294

Refactor int8 dynamic quantization with call to quantize #294

Conversation

jerryzh168 commented May 30, 2024 • edited Loading

pytorch-bot bot commented May 30, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/294

❗ 1 Active SEVs

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Refactor int8 dynamic quantization with call to `quantize` #294

Refactor int8 dynamic quantization with call to `quantize` #294

jerryzh168 commented May 30, 2024 •

edited

Loading

pytorch-bot bot commented May 30, 2024 •

edited

Loading