-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
077 autoquant gpt fast #361
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/361
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit c16593e with merge base 6b0ca2d (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
39e2205
to
35e1509
Compare
35e1509
to
ecdc1fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more minor pieces of feedback
test/integration/test_integration.py
Outdated
@parameterized.expand(COMMON_DEVICE_DTYPE) | ||
@unittest.skipIf(not TORCH_VERSION_AFTER_2_3, "autoquant requires 2.3+.") | ||
def test_autoquant_manual(self, device, dtype): | ||
if device != "cuda" and dtype != torch.bfloat16: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the idea here to say skip if the device is cpu and using bf16? if so can we flip the negatives to make this clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this, i copied something from other tests at some point
Wraps the given model in an AutoQuantWrapper. If `example_input` is provided, performs a forward pass on the input. | ||
Otherwise, returns the wrapped model. The AutoQuantWrapper manages cases where the model is torch-compiled by first | ||
performing autoquantization on the original model and then allowing the torch.compile run/tracing to occur. | ||
Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool this helped quite a bit, can you also make sure it renders correctly here https://github.com/pytorch/ao/tree/main/docs/source
Summary: we were hitting the peak upon model load, not during model runtime, this is an issue since users can load model to cpu/meta which significantly reduces mem usage during model load/quant. Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags:
Summary: autoquant wasn't working for llama benchmarks for a few reasons the main one being that we were doing logging on prefill not decode_one_token. We also weren't torch.compiling prefill which obviated the whole point of autoquant benchmarking torch.compiled prefill shapes. To fix this, new functionality was needed for autoquant, we needed an option to not automatically end logging upon a single instance of model.forward. The flag manual_do_autoquant now controls whether you manually have to call model.do_autoquant() after logging is done, or whether it happens automatically after a model forward run. a few small other fixes were also made: 1) updated where generate.py resets cuda memory so as to not confound with torch.compilation memory usage 2) README updated with new numbers 3) better autoquant docstring 5) reordered benchmarks so they match whats in the README Test Plan: sh benchmarks.sh python test_integration.py -k "test_autoquant_manual" Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
0913f14
to
8b29cd5
Compare
torchao/quantization/autoquant.py
Outdated
torch.autoquant(model, manual=True) | ||
model(*example_input1) | ||
model(*example_input2) | ||
model.do_autoquant() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having both autoquant
and do_autoquant
seems a bit confusing
also can do_autoquant
(maybe a different name) also be a function like autoquant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall, just the do_autoquant
API feels a bit weird, I think we can make it more intuitive
torchao/_models/llama/generate.py
Outdated
model = autoquant(model, manual=True) | ||
|
||
generate( | ||
model, | ||
encode_tokens(tokenizer, prompt, bos=True, device=device), | ||
2, | ||
interactive=False | ||
max_new_tokens, | ||
interactive=False, | ||
temperature=temperature, | ||
top_k=top_k, | ||
) | ||
|
||
# do autoquantization | ||
model.do_autoquant() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also comment on why does this have to autoquant in this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we optimize for the shapes autoquant sees during shape calibration so we have to run the full generate loop, which means we need a way to manually end shape calibration and initialize benchmarking/quantization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to do shape calibration with the actual shapes used for generate, so we setup autoquant, set it to wait until we manually end shape calibration, run the generate set to log the correct shapes, then do_autoquant to actually do the benchmarks with the shapes that we've logged
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
what terms would you use? so torchao.autoquant(model) seems fine torchao.autoquant(model, manual=True) seems ok, maybe manual could be different? it doesn't seem super off though, the flag is 'prolonging shape calibration for multiple inputs (rather than ending after a single input and doing the benchmarks + quantization) and the user has to manually end shape calibration' other terms could be 'manual_shape_calibration_end', 'multi_input', 'defer_finalization'....of these manual seems like the best of a bad bunch tbh lastly there's: model.do_autoquant() which ends shape calibration, does benchmarking on the calibrated shapes, picks the best one and then quantizes the layers. Feels like this could be 'finalize', 'finalize_autoquant' or something along those lines. This is the step where autoquantization actually happens though so do_autoquant is literal, its just that, when manual=True, the original autoquant api is less than completely accurate, but i don't see a good way around that. |
87ef697
to
89527b4
Compare
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
89527b4
to
c16593e
Compare
* fixing peak memory stats for benchmark Summary: we were hitting the peak upon model load, not during model runtime, this is an issue since users can load model to cpu/meta which significantly reduces mem usage during model load/quant. Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags: * Autoquantization work for benchmarks Summary: autoquant wasn't working for llama benchmarks for a few reasons the main one being that we were doing logging on prefill not decode_one_token. We also weren't torch.compiling prefill which obviated the whole point of autoquant benchmarking torch.compiled prefill shapes. To fix this, new functionality was needed for autoquant, we needed an option to not automatically end logging upon a single instance of model.forward. The flag manual_do_autoquant now controls whether you manually have to call model.do_autoquant() after logging is done, or whether it happens automatically after a model forward run. a few small other fixes were also made: 1) updated where generate.py resets cuda memory so as to not confound with torch.compilation memory usage 2) README updated with new numbers 3) better autoquant docstring 5) reordered benchmarks so they match whats in the README Test Plan: sh benchmarks.sh python test_integration.py -k "test_autoquant_manual" Reviewers: Subscribers: Tasks: Tags: * updating api name and improving docstrings Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * oops missed a few manual_do_autoquant -> manual Summary: Test Plan: sh benchmarks.sh Reviewers: Subscribers: Tasks: Tags: * fix forward_log_only Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * improving test conditions Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fixing nits Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * final tests and change do_autoquant to finalize_autoquant Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary:
autoquant wasn't working for llama benchmarks for a few reasons, the main
ones being that we were doing autoquant logging on prefill not decode_one_token which is an issue since the two
have different shapes. We also weren't torch.compiling prefill which obviated the whole point of
autoquant benchmarking torch.compiled prefill shapes.
To fix this, new functionality was needed for autoquant, we needed an
option to not automatically end logging upon a single instance of
model.forward. The flag
manual
now controls whether youmanually have to call model.finalize_autoquant() after logging is done, or
whether it happens automatically after a model forward run.
a few small other fixes were also made:
with torch.compilation memory usage
Test Plan: sh benchmarks.sh
python test_integration.py -k "test_autoquant_manual"
Reviewers: