Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Auto-Round support #581

Merged
merged 77 commits into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
be78a08
initial flow for autoround
yiliu30 Jul 24, 2024
49f8075
update flow
yiliu30 Jul 25, 2024
62834a2
use int4 kernel
yiliu30 Jul 26, 2024
6433e75
remove debug code
yiliu30 Jul 26, 2024
65f46e5
update the forward
yiliu30 Jul 29, 2024
1e22c11
clean code
yiliu30 Jul 29, 2024
b8d37b9
e2e example
yiliu30 Jul 30, 2024
8d388fb
refine code
yiliu30 Jul 30, 2024
07a95a0
add requirements for test
yiliu30 Jul 30, 2024
6baa62f
update test
yiliu30 Jul 30, 2024
78a5067
update the readme
yiliu30 Jul 30, 2024
37e9f5f
add readme
yiliu30 Jul 30, 2024
8bfe76a
update the filenames
yiliu30 Jul 30, 2024
e25d6eb
update the np version
yiliu30 Jul 30, 2024
16a901d
add demo
yiliu30 Jul 30, 2024
5f16e8d
format
yiliu30 Jul 30, 2024
f3442c5
add more docs
yiliu30 Jul 31, 2024
432da79
format
yiliu30 Jul 31, 2024
7ee9f9b
add doc
yiliu30 Jul 31, 2024
e5ffcca
use `AffineQuantizedTensor`
yiliu30 Jul 31, 2024
cec375b
impl ar using multensors
yiliu30 Aug 8, 2024
a8f5681
clean code
yiliu30 Aug 8, 2024
ab08cb3
use hook + multensors
yiliu30 Aug 12, 2024
5ee2e06
separate mul_tensors into a new file
yiliu30 Aug 12, 2024
a5a3544
fix typos
yiliu30 Aug 12, 2024
e406ee8
rename mul_tensor to multi_tensor
yiliu30 Aug 13, 2024
7b6908e
enable amp
yiliu30 Aug 13, 2024
6a4d67c
eval model
yiliu30 Aug 13, 2024
c1fa230
add gen examples
yiliu30 Aug 13, 2024
5eef0a6
merge with main
yiliu30 Aug 13, 2024
e4cfa7d
add warmup to benchmark
yiliu30 Aug 13, 2024
41d9afd
add benchmark
yiliu30 Aug 13, 2024
e1cec58
Merge branch 'main' into re-a3
yiliu30 Aug 13, 2024
6f20e25
Merge branch 'auto_round_support-3' of https://github.com/yiliu30/tor…
yiliu30 Aug 13, 2024
ee1510c
Merge branch 'auto_round_support-3' into auto_round_support-3-bench
yiliu30 Aug 13, 2024
ca5bb30
clean code
yiliu30 Aug 13, 2024
e01e028
format code
yiliu30 Aug 13, 2024
8532af0
Merge pull request #3 from yiliu30/auto_round_support-3-bench
yiliu30 Aug 13, 2024
5106fe0
use tiny kernel
yiliu30 Aug 16, 2024
b82b638
add more note
yiliu30 Aug 16, 2024
bb08957
format
yiliu30 Aug 16, 2024
b5f08c5
Merge pull request #4 from yiliu30/auto_round_support-3-tinygemm-kernel
yiliu30 Aug 16, 2024
c8fc3f6
correct typos
yiliu30 Aug 16, 2024
7ee493f
remove hard code
yiliu30 Aug 19, 2024
1f75897
use intx
yiliu30 Aug 19, 2024
48d0903
Merge pull request #6 from yiliu30/auto_round_support-3-intx
yiliu30 Aug 19, 2024
34e6b49
enable offload for multitensor
yiliu30 Aug 20, 2024
eeca10b
update the default config
yiliu30 Aug 20, 2024
2b94608
refine note
yiliu30 Aug 21, 2024
0d38b20
Merge pull request #8 from yiliu30/enable-llama3
yiliu30 Aug 21, 2024
f04b594
Merge branch 'main' into auto_round_support-3
yiliu30 Aug 21, 2024
0e0b06d
update the version check
yiliu30 Aug 21, 2024
d0a4920
format
yiliu30 Aug 22, 2024
1e8a081
update
yiliu30 Aug 22, 2024
5b3374f
add ut
yiliu30 Aug 22, 2024
4ef0cdc
format
yiliu30 Aug 22, 2024
5f78c73
add scripts
yiliu30 Aug 22, 2024
5baae13
format code
yiliu30 Aug 22, 2024
6feb975
format
yiliu30 Aug 22, 2024
f6ed1e0
Merge pull request #9 from yiliu30/auto_round_support-3-unified-api
yiliu30 Aug 22, 2024
03cd9fc
update
yiliu30 Aug 22, 2024
e60b815
fix typo
yiliu30 Aug 22, 2024
b20e6d9
refine bench code
yiliu30 Aug 22, 2024
fabe8d2
Merge branch 'main' into auto_round_support-3
yiliu30 Aug 25, 2024
9ae5392
Enable `use_optimized_layer_output` and AO' llama (#12)
yiliu30 Aug 26, 2024
157c189
Refine the Doc (#14)
yiliu30 Aug 26, 2024
2df3f5f
add more docstring
yiliu30 Aug 26, 2024
d719460
add paper link
yiliu30 Aug 26, 2024
d7ba39e
correct some note
yiliu30 Aug 26, 2024
a2c6b28
add cmd
yiliu30 Aug 27, 2024
896d87f
resolve conflicts
yiliu30 Aug 28, 2024
6a8e073
udpdate the scripts
yiliu30 Aug 28, 2024
9e48d1a
revert some change
yiliu30 Aug 28, 2024
5ca125e
Add a lightweight configuration for quick benchmarking (#15)
yiliu30 Aug 29, 2024
b6d95ce
merge with main
yiliu30 Aug 30, 2024
21686f1
update quant method name
yiliu30 Aug 30, 2024
96f745d
Wrap model's buffers and params to `MultiTensor` & update the results…
yiliu30 Sep 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions torchao/prototype/autoround/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
### Usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more explanations on the API, summary of perf and accuracy of llama2/llama3 will be helpful, similar to https://github.com/pytorch/ao/tree/main/torchao/prototype/quant_llm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jerryzh168 , I summarized the usage and posted partial results at README.md. More tests are WIP.

> [!NOTE]
> Currently implementation requires installaton of `Auto-round`.

```bash
pip install -r requirements.txt
```

### Quantize `facebook/opt-125m` with Auto-round
```bash
python autoround_demo.py
```
Empty file.
95 changes: 95 additions & 0 deletions torchao/prototype/autoround/autoround_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
import argparse

import torchao

from torchao.prototype.autoround.core import (
auto_round_config,
MultiTensor,
post_process_model_after_applying_auto_round_,
prepare_model_for_applying_auto_round_,
)


def main(args):
# 0. Get the model, tokenizer, and decoder_cls
import torchao.prototype.autoround.utils as ar_utils

model_name_or_path = args.model_name_or_path
model, tokenizer, decoder_cls = ar_utils.get_float_model_info(model_name_or_path)
# Workaround for disabling the `kv_cache`, which cause the OOM.
model.config.use_cache = False
ar_utils.gen_text(model, tokenizer, "Float model", device="cuda", max_length=50)

auto_round_config.iters = args.iters
auto_round_config.nsamples = args.nsamples
auto_round_config.seqlen = args.seqlen

# 1. Prepare the model for applying auto-round
# User should provide the `is_decoder` function for identifying the decoder block
# It can be extended to other modules, such as `lm_head`, the function like:
# is_target_module = lambda mod, fqn: isinstance(mod, decoder_cls) or "lm_head" in fqn
if args.quant_lm_head:
# is_decoder = lambda mod, fqn: "lm_head" in fqn
is_decoder = lambda mod, fqn: isinstance(mod, decoder_cls) or "lm_head" in fqn
else:
is_decoder = lambda mod, fqn: isinstance(mod, decoder_cls)
prepare_model_for_applying_auto_round_(model, is_decoder)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168 here is the new flow using MulTensor, it's quite similar to the flow using prepare and convert, Does that sound okay to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah using MultiTensor sounds fine, and the flow makes sense, see my suggestion on the implementation below


# 2. Caliration and optimization
dataloader = ar_utils.get_dataloader(
tokenizer,
auto_round_config.seqlen,
seed=auto_round_config.seed,
bs=auto_round_config.train_bs,
nsamples=auto_round_config.nsamples,
)

input_ids_lst = []
attn_mask_lst = []
for i, data in enumerate(dataloader):
input_ids_lst.append(data["input_ids"])
attn_mask_lst.append(data["attention_mask"])

mul_t_input_ids = MultiTensor(input_ids_lst)
mul_t_attn_mask = MultiTensor(attn_mask_lst)

# The optimization is applied during the forward pass
out = model(mul_t_input_ids, mul_t_attn_mask)

# 3. Post-process the model after applying auto-round
post_process_model_after_applying_auto_round_(model)
assert ar_utils.has_tensor_of_type(model, torchao.dtypes.AffineQuantizedTensor)

# 4(Optional). Generate text using the optimized model
ar_utils.gen_text(model, tokenizer, "Quantized model", device="cuda", max_length=50)


if __name__ == "__main__":
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument(
"-m",
"--model_name_or_path",
type=str,
default="facebook/opt-125m",
help="Model name or path",
)
parser.add_argument("--seed", default=0, type=int, help="Random seed for torch")
parser.add_argument(
"--iters", default=20, type=int, help="Number of iterations for optimization"
)
parser.add_argument(
"--nsamples", default=128, type=int, help="Number of samples for optimization"
)
parser.add_argument(
"--seqlen", default=2048, type=int, help="Sequence length for optimization"
)
parser.add_argument(
"--quant_lm_head",
default=False,
action="store_true",
help="Quantize the `lm_head` or not",
)
args = parser.parse_args()
main(args)
Loading