077 autoquant gpt fast #361

HDCharles · 2024-06-14T04:12:16Z

Summary:

autoquant wasn't working for llama benchmarks for a few reasons, the main
ones being that we were doing autoquant logging on prefill not decode_one_token which is an issue since the two
have different shapes. We also weren't torch.compiling prefill which obviated the whole point of
autoquant benchmarking torch.compiled prefill shapes.

To fix this, new functionality was needed for autoquant, we needed an
option to not automatically end logging upon a single instance of
model.forward. The flag manual now controls whether you
manually have to call model.finalize_autoquant() after logging is done, or
whether it happens automatically after a model forward run.

a few small other fixes were also made:

updated where generate.py resets cuda memory so as to not confound
with torch.compilation memory usage
README updated with new numbers
better autoquant docstring
reordered benchmarks so they match whats in the README
cleaned up a few autoquant print multi shape print bugs

Test Plan: sh benchmarks.sh

python test_integration.py -k "test_autoquant_manual"

Reviewers:

pytorch-bot · 2024-06-14T04:12:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/361

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Failure with setup-ssh on Amazon Linux 2023 runners

✅ No Failures

As of commit c16593e with merge base 6b0ca2d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/_models/llama/generate.py

torchao/quantization/autoquant.py

torchao/utils.py

torchao/quantization/autoquant.py

test/integration/test_integration.py

torchao/quantization/autoquant.py

msaroufim

Just a few more minor pieces of feedback

msaroufim · 2024-06-18T17:47:09Z

test/integration/test_integration.py

+    @parameterized.expand(COMMON_DEVICE_DTYPE)
+    @unittest.skipIf(not TORCH_VERSION_AFTER_2_3, "autoquant requires 2.3+.")
+    def test_autoquant_manual(self, device, dtype):
+        if device != "cuda" and dtype != torch.bfloat16:


So is the idea here to say skip if the device is cpu and using bf16? if so can we flip the negatives to make this clearer

removed this, i copied something from other tests at some point

msaroufim · 2024-06-18T17:50:40Z

torchao/quantization/autoquant.py

-    Wraps the given model in an AutoQuantWrapper. If `example_input` is provided, performs a forward pass on the input.
-    Otherwise, returns the wrapped model. The AutoQuantWrapper manages cases where the model is torch-compiled by first
-    performing autoquantization on the original model and then allowing the torch.compile run/tracing to occur.
+    Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential


Cool this helped quite a bit, can you also make sure it renders correctly here https://github.com/pytorch/ao/tree/main/docs/source

Summary: we were hitting the peak upon model load, not during model runtime, this is an issue since users can load model to cpu/meta which significantly reduces mem usage during model load/quant. Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags:

Summary: autoquant wasn't working for llama benchmarks for a few reasons the main one being that we were doing logging on prefill not decode_one_token. We also weren't torch.compiling prefill which obviated the whole point of autoquant benchmarking torch.compiled prefill shapes. To fix this, new functionality was needed for autoquant, we needed an option to not automatically end logging upon a single instance of model.forward. The flag manual_do_autoquant now controls whether you manually have to call model.do_autoquant() after logging is done, or whether it happens automatically after a model forward run. a few small other fixes were also made: 1) updated where generate.py resets cuda memory so as to not confound with torch.compilation memory usage 2) README updated with new numbers 3) better autoquant docstring 5) reordered benchmarks so they match whats in the README Test Plan: sh benchmarks.sh python test_integration.py -k "test_autoquant_manual" Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

jerryzh168 · 2024-06-18T23:12:24Z

torchao/quantization/autoquant.py

+        torch.autoquant(model, manual=True)
+        model(*example_input1)
+        model(*example_input2)
+        model.do_autoquant()


having both autoquant and do_autoquant seems a bit confusing

also can do_autoquant (maybe a different name) also be a function like autoquant

torchao/quantization/autoquant.py

jerryzh168

looks good overall, just the do_autoquant API feels a bit weird, I think we can make it more intuitive

jerryzh168 · 2024-06-19T00:09:10Z

torchao/_models/llama/generate.py

+            model = autoquant(model, manual=True)
+
            generate(
                model,
                encode_tokens(tokenizer, prompt, bos=True, device=device),
-                2,
-                interactive=False
+                max_new_tokens,
+                interactive=False,
+                temperature=temperature,
+                top_k=top_k,
            )
+
+            # do autoquantization
+            model.do_autoquant()


can you also comment on why does this have to autoquant in this way

because we optimize for the shapes autoquant sees during shape calibration so we have to run the full generate loop, which means we need a way to manually end shape calibration and initialize benchmarking/quantization

we need to do shape calibration with the actual shapes used for generate, so we setup autoquant, set it to wait until we manually end shape calibration, run the generate set to log the correct shapes, then do_autoquant to actually do the benchmarks with the shapes that we've logged

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles · 2024-06-19T17:28:27Z

looks good overall, just the do_autoquant API feels a bit weird, I think we can make it more intuitive

what terms would you use?
the difficulty is we want the api to stand on its own, but also work in a manual mode.

so

torchao.autoquant(model)

seems fine

torchao.autoquant(model, manual=True)

seems ok, maybe manual could be different? it doesn't seem super off though, the flag is 'prolonging shape calibration for multiple inputs (rather than ending after a single input and doing the benchmarks + quantization) and the user has to manually end shape calibration'

other terms could be 'manual_shape_calibration_end', 'multi_input', 'defer_finalization'....of these manual seems like the best of a bad bunch tbh

lastly there's:

model.do_autoquant()

which ends shape calibration, does benchmarking on the calibrated shapes, picks the best one and then quantizes the layers. Feels like this could be 'finalize', 'finalize_autoquant' or something along those lines. This is the step where autoquantization actually happens though so do_autoquant is literal, its just that, when manual=True, the original autoquant api is less than completely accurate, but i don't see a good way around that.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

* fixing peak memory stats for benchmark Summary: we were hitting the peak upon model load, not during model runtime, this is an issue since users can load model to cpu/meta which significantly reduces mem usage during model load/quant. Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags: * Autoquantization work for benchmarks Summary: autoquant wasn't working for llama benchmarks for a few reasons the main one being that we were doing logging on prefill not decode_one_token. We also weren't torch.compiling prefill which obviated the whole point of autoquant benchmarking torch.compiled prefill shapes. To fix this, new functionality was needed for autoquant, we needed an option to not automatically end logging upon a single instance of model.forward. The flag manual_do_autoquant now controls whether you manually have to call model.do_autoquant() after logging is done, or whether it happens automatically after a model forward run. a few small other fixes were also made: 1) updated where generate.py resets cuda memory so as to not confound with torch.compilation memory usage 2) README updated with new numbers 3) better autoquant docstring 5) reordered benchmarks so they match whats in the README Test Plan: sh benchmarks.sh python test_integration.py -k "test_autoquant_manual" Reviewers: Subscribers: Tasks: Tags: * updating api name and improving docstrings Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * oops missed a few manual_do_autoquant -> manual Summary: Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags: * fix forward_log_only Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * improving test conditions Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing nits Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * final tests and change do_autoquant to finalize_autoquant Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2024

HDCharles force-pushed the 077_autoquant_gpt-fast branch from 39e2205 to 35e1509 Compare June 14, 2024 04:12

HDCharles requested review from msaroufim, cpuhrsch and jerryzh168 June 14, 2024 04:13

msaroufim requested changes Jun 14, 2024

View reviewed changes

msaroufim reviewed Jun 14, 2024

View reviewed changes

torchao/quantization/autoquant.py Outdated Show resolved Hide resolved

HDCharles force-pushed the 077_autoquant_gpt-fast branch from 35e1509 to ecdc1fe Compare June 14, 2024 21:39

HDCharles requested a review from msaroufim June 14, 2024 21:39

msaroufim reviewed Jun 14, 2024

View reviewed changes

torchao/quantization/autoquant.py Show resolved Hide resolved

HDCharles requested a review from msaroufim June 14, 2024 23:18

HDCharles mentioned this pull request Jun 18, 2024

Increased memory usage for int8/int4 weight only quantization compared to gpt-fast #346

Closed

msaroufim reviewed Jun 18, 2024

View reviewed changes

HDCharles added 6 commits June 18, 2024 12:34

updating api name and improving docstrings

9717087

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

oops missed a few manual_do_autoquant -> manual

8f1ba0a

Summary: Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags:

fix forward_log_only

f45d9d8

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

improving test conditions

8b29cd5

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles force-pushed the 077_autoquant_gpt-fast branch from 0913f14 to 8b29cd5 Compare June 18, 2024 22:25

HDCharles requested a review from msaroufim June 18, 2024 22:26

jerryzh168 reviewed Jun 18, 2024

View reviewed changes

torchao/quantization/autoquant.py Outdated Show resolved Hide resolved

jerryzh168 approved these changes Jun 19, 2024

View reviewed changes

jerryzh168 reviewed Jun 19, 2024

View reviewed changes

jerryzh168 self-requested a review June 19, 2024 00:20

msaroufim approved these changes Jun 19, 2024

View reviewed changes

fixing nits

b3d9816

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles force-pushed the 077_autoquant_gpt-fast branch from 87ef697 to 89527b4 Compare June 20, 2024 22:29

final tests and change do_autoquant to finalize_autoquant

c16593e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles force-pushed the 077_autoquant_gpt-fast branch from 89527b4 to c16593e Compare June 20, 2024 22:35

HDCharles merged commit dd35079 into main Jun 21, 2024
13 checks passed

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Update iOS.md (pytorch#361)

060e186

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

077 autoquant gpt fast #361

077 autoquant gpt fast #361

HDCharles commented Jun 14, 2024 •

edited

Loading

pytorch-bot bot commented Jun 14, 2024 •

edited

Loading

msaroufim left a comment

msaroufim Jun 18, 2024 •

edited

Loading

HDCharles Jun 19, 2024

msaroufim Jun 18, 2024 •

edited

Loading

jerryzh168 Jun 18, 2024 •

edited

Loading

jerryzh168 left a comment

jerryzh168 Jun 19, 2024

HDCharles Jun 19, 2024

HDCharles Jun 19, 2024

HDCharles commented Jun 19, 2024

077 autoquant gpt fast #361

077 autoquant gpt fast #361

Conversation

HDCharles commented Jun 14, 2024 • edited Loading

pytorch-bot bot commented Jun 14, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/361

❗ 1 Active SEVs

✅ No Failures

msaroufim left a comment

Choose a reason for hiding this comment

msaroufim Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

HDCharles Jun 19, 2024

Choose a reason for hiding this comment

msaroufim Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

jerryzh168 Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

jerryzh168 left a comment

Choose a reason for hiding this comment

jerryzh168 Jun 19, 2024

Choose a reason for hiding this comment

HDCharles Jun 19, 2024

Choose a reason for hiding this comment

HDCharles Jun 19, 2024

Choose a reason for hiding this comment

HDCharles commented Jun 19, 2024

HDCharles commented Jun 14, 2024 •

edited

Loading

pytorch-bot bot commented Jun 14, 2024 •

edited

Loading

msaroufim Jun 18, 2024 •

edited

Loading

msaroufim Jun 18, 2024 •

edited

Loading

jerryzh168 Jun 18, 2024 •

edited

Loading