Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Auto-Round support #581

Merged
merged 77 commits into from
Sep 4, 2024
Merged

Conversation

yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Jul 31, 2024

Resolve #533

Description

  • Integrated Auto-Round with quantize_ API using hooks + MultiTensor.
  • Exported the optimized qweight to AffineQuantizedTensor to leverage the tinygemm and Uintx kernels.
  • Evaluated the accuracy for Llama2/3/3.1 on 5 popular lm-eval tasks (more tests are on the way).
  • Added Auto-Round to the generation benchmarking for Llama2/3, (Llama 3.1 not yet tested as it was landed a few days ago).
  • Small fix for the Llama model Fixed the llama model #769

Usage

from torchao.prototype.autoround.core import prepare_model_for_applying_auto_round_
from torchao.prototype.autoround.core import apply_auto_round

prepare_model_for_applying_auto_round_(
    model,
    is_target_module=is_target_module,
    bits=4,
    group_size=128,
    iters=200,
    device=device,
)

input_ids_lst = []
for data in dataloader:
    input_ids_lst.append(data["input_ids"].to(model_device))

multi_t_input_ids = MultiTensor(input_ids_lst)
out = model(multi_t_input_ids)

quantize_(model, apply_auto_round(), is_target_module)

For E2E examples, please refer README.md

cc @thuang6 @ftian1 @wenhuach21

yiliu30 added 19 commits July 24, 2024 02:56
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Copy link

pytorch-bot bot commented Jul 31, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/581

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 96f745d with merge base 05224a9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 31, 2024
@yiliu30
Copy link
Contributor Author

yiliu30 commented Jul 31, 2024

Hi @jerryzh168 @msaroufim, I’m reaching out to request a preliminary review for this PR. Although some refactoring is still in progress. I’d like to get your feedback to ensure we’re on the right track before moving forward.

This draft PR includes:

  • 1. A end-to-end example that quantizes the facebook/opt-125m with Auto-Round optimized qweight, scales zeros, and performs inference with torchao's Int4WeightOnlyQuantizedLinearWeight AffineQuantizedTensor.
  • 2. Cleaned up the dependencies of auto-round in the patch-for-ao-2 branch.

Some TODOs:

  • 3. Reduce the GPU memory consumption
  • 4. Support other bits and data types (currently, the weight bits are hardcoded to 4, and activations are not quantized)
  • 5. Further Refactoring auto-round
  • 6. Rearrange the code structure

Regarding 3) GPU memory consumption, in the current flow, I use hooks to capture the inputs and outputs of each block during the calibration stage. This approach differs from the original auto-round's implementation, which captures only the input of the first decoder block and delays block inference to the quantize stage (similar to AutoAWQ's implementation). The implementation in this PR introduces some limitations: a) The GPU memory consumption is quite large when the calibration dataset is large. b) We cannot use the output of a previously quantized block as the input for the following block.

This approach is mainly to align with the static quantization flow and use quantize_ API. I wonder if you are willing to refactor the flow a bit to resolve these limitations, or if you have other suggestions? I think AutoAWQ might also need such adjustments.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Contributor Author

yiliu30 commented Aug 1, 2024

Hi @jerryzh168, for 3), I noticed that GPTQ has a similar complication. #577

Instead, we want to run the model for each input, but ONLY up to the first linear, then pause, do the algorithm to update the weight, get the outputs for the updated weight and then, unpause and continue on until we hit the next linear….etc.

The main difference is that GPTQ handles a single Linear layer, whereas auto-round works on a decoder block (it may also work on a Linear layer when quantizing the lm-head).

Inspired by HDCharles's proposal, I tried to extend it to auto-round. Based on MultiTensor, the remaining issue is enabling the dispatcher to identify the decoder block, such as OPTDecoderLayer.
I resolved this by defining a customized operation called general_decoder and swapping all decoder blocks with it. Then, we perform the inference with the calibration dataset, when the dispatcher encounters general_decoder, it jumps to the auto-round's optimization process with all accumulated inputs and returns the outputs of the optimized model or original model.

I have prepared a full demo at here. Could you please take a look, thanks a lot!

@jerryzh168
Copy link
Contributor

@yiliu30 sorry for the late reply, I think using MultiInput from @HDCharles's GPTQ issue makes sense for your use case, since Auto-Round flow is similar to GPTQ flow but does not fit into the static quant flow (with observers) very well

@jerryzh168
Copy link
Contributor

one small nit for the "general_decoder", we can use

if func is torch.ops.transformers_ops.general_decoder:
   outputs = optimize_decoder(func, grouped_args, spec)

instead of looking at func.__name__

also after this is done, I think we can improve our current utils for operator implementation:

def _implements(cls, aten_ops_or_torch_fns):
"""Use this decorator to implement a function for an aten ops in __torch_dispatch__
(if user passed in a list of ops)
or torch function in __torch_function__ (if user passed in a single object)
class MyTensor(torch.Tensor):
...
implements = classmethod(_implements)
implements = MyTensor.implements
@implements(torch.nn.functional.linear):
def _(func, types, args, kwargs):
...
"""
if not hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE"):
cls._ATEN_OP_OR_TORCH_FN_TABLE = {}
if not isinstance(aten_ops_or_torch_fns, (list, tuple)):
aten_ops_or_torch_fns = [aten_ops_or_torch_fns]
def decorator(func):
for op in aten_ops_or_torch_fns:
@functools.wraps(op)
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
cls._ATEN_OP_OR_TORCH_FN_TABLE[op] = wrapper
return func
return decorator
def _dispatch__torch_function__(cls, func, types, args=(), kwargs=None):
"""Use this util function for a common `__torch_function__` implementation
that dispatches to ops/functions registered with `_implements`
class MyTensor(torch.Tensor):
...
__torch_function__ = classmethod(_dispatch__torch_function__)
"""
kwargs = {} if kwargs is None else kwargs
if hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE") and \
func in cls._ATEN_OP_OR_TORCH_FN_TABLE:
return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, *args, **kwargs)
with torch._C.DisableTorchFunctionSubclass():
return func(*args, **kwargs)
def _dispatch__torch_dispatch__(cls, func, types, args, kwargs):
"""Use this util function for a common `__torch_dispatch__` implementation
that dispatches to ops/functions registered with `_implements`
class MyTensor(torch.Tensor):
...
__torch_dispatch__ = classmethod(_dispatch__torch_dispatch__)
"""
if hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE") and \
func in cls._ATEN_OP_OR_TORCH_FN_TABLE:
return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, *args, **kwargs)
raise NotImplementedError(f"{cls.__name__} dispatch: attempting to run unimplemented operator/function: {func}")
and incorporate this use case so you can reduce boilerplate code

yiliu30 added 2 commits August 8, 2024 04:44
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30 yiliu30 requested a review from jerryzh168 August 26, 2024 03:52
@wenhuach21
Copy link

I was curious about the compute dtype supported by the AO kernel. If it only supports FP16, I recommend forcing the dtype to FP16 before passing it to AutoRound. However, if BF16 is also supported, it would be preferable to set the scale_type in AutoRound to align with the original model.

Additionally, the accuracy data slightly differs from the results of our recipe, which may not be solely due to changes in hyperparameters. We should investigate this further.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@jerryzh168
Copy link
Contributor

I was curious about the compute dtype supported by the AO kernel. If it only supports FP16, I recommend forcing the dtype to FP16 before passing it to AutoRound. However, if BF16 is also supported, it would be preferable to set the scale_type in AutoRound to align with the original model.

Additionally, the accuracy data slightly differs from the results of our recipe, which may not be solely due to changes in hyperparameters. We should investigate this further.

it depends on the kernel, int4 weight only that uses tinygemm kernel only supports bfloat16 I think

quantize_(model, apply_auto_round(), is_target_module)
```

## End-to-End Results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what about performance results?

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changes looks good to me, one comment is just to include performance data (token/s, memory etc.) in README as well, similar to https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Contributor Author

yiliu30 commented Aug 28, 2024

The benchmark depends on #769

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30 yiliu30 mentioned this pull request Sep 3, 2024
…#16)

* wrap model's buffers and params to `MultiTensor` and update the results

Signed-off-by: yiliu30 <yi4.liu@intel.com>
)
else:
is_target_module = lambda mod, fqn: isinstance(mod, TransformerBlock)
quantize_model_with_autoround_(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we just use the same flow everywhere to reduce confusions, the flow in https://github.com/pytorch/ao/pull/581/files#diff-af129d63635a3b5b0a95f1a3831f852fbd7bedfd66b38d41bf4975fb49aad246 would be the recommended one I think

@jerryzh168
Copy link
Contributor

Thanks @yiliu30 for addressing all the comments!

@jerryzh168 jerryzh168 merged commit f5703b0 into pytorch:main Sep 4, 2024
17 checks passed
@yiliu30
Copy link
Contributor Author

yiliu30 commented Sep 4, 2024

@jerryzh168 Thanks for your patient guidance and detailed examples. This joint effort will allow more users to benefit from AO and auto-round!

jerryzh168 pushed a commit to jerryzh168/ao that referenced this pull request Sep 4, 2024
* initial flow for autoround

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update flow

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* use int4 kernel

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* remove debug code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update the forward

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* clean code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* e2e example

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* refine code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add requirements for test

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update test

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update the readme

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add readme

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update the filenames

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update the np version

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add demo

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* format

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add more docs

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* format

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add doc

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* use `AffineQuantizedTensor`

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* impl ar using multensors

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* clean code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* use hook + multensors

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* separate mul_tensors into a new file

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* fix typos

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* rename mul_tensor to multi_tensor

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* enable amp

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* eval model

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add gen examples

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add warmup to benchmark

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add benchmark

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* clean code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* format code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* use tiny kernel

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add more note

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* format

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* correct typos

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* remove hard code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* use intx

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* enable offload for multitensor

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update the default config

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* refine note

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update the version check

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* format

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add ut

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* format

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add scripts

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* format code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* format

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* fix typo

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* refine bench code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* Enable `use_optimized_layer_output` and AO' llama (pytorch#12)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* Refine the Doc (pytorch#14)

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add more docstring

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add paper link

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* correct some note

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add cmd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* udpdate the scripts

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* revert some change

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* Add a lightweight configuration for quick benchmarking (pytorch#15)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* update quant method name

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* Wrap model's buffers and params to `MultiTensor` & update the results (pytorch#16)

* wrap model's buffers and params to `MultiTensor` and update the results

Signed-off-by: yiliu30 <yi4.liu@intel.com>

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
* executable README

* fix title of CI workflow

* markup commands in markdown

* extend the markup-markdown language

* Automatically identify cuda from nvidia-smi in install-requirements (pytorch#606)

* Automatically identify cuda from nvidia-smi in install-requirements

* Update README.md

---------

Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>

* Unbreak zero-temperature sampling (pytorch#599)

Fixes pytorch#581.

* Improve process README

* [retake] Add sentencepiece tokenizer (pytorch#626)

* Add sentencepiece tokenizer

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Add white space

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Handle white space:

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Handle control ids

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* More cleanup

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Lint

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Use unique_ptr

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Use a larger runner

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Debug

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Debug

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Cleanup

* Update install_utils.sh to use python3 instead of python (pytorch#636)

As titled. On some devices `python` and `python3` are pointing to different environments so good to unify them.

* Fix quantization doc to specify dytpe limitation on a8w4dq (pytorch#629)

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: Kimish Patel <kimishpatel@fb.com>

* add desktop.json (pytorch#622)

* add desktop.json

* add fast

* remove embedding

* improvements

* update readme from doc branch

* tab/spc

* fix errors in updown language

* fix errors in updown language, and [skip]: begin/end

* fix errors in updown language, and [skip]: begin/end

* a storied run

* stories run on readme instructions does not need HF token

* increase timeout

* check for hang un hf_login

* executable README improvements

* typo

* typo

---------

Co-authored-by: Ian Barber <ian.barber@gmail.com>
Co-authored-by: Scott Wolchok <swolchok@meta.com>
Co-authored-by: Mengwei Liu <larryliu0820@users.noreply.github.com>
Co-authored-by: Kimish Patel <kimishpatel@fb.com>
Co-authored-by: Scott Roy <161522778+metascroy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Add Auto-Round support
4 participants