Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding default inductor config settings #423

Merged
merged 13 commits into from
Jun 25, 2024
Merged

Conversation

HDCharles
Copy link
Contributor

@HDCharles HDCharles commented Jun 22, 2024

Summary:

making autoquant and quantize (and eval and generate) apis call a new
recommended_inductor_config_setter util to set recommended apis

also update groupsize -> group_size in generate.py
and handled errors in autoquant (to pass CI)

Test Plan:

sh benchmarks.sh

comparison of different config combinations for matmul precision,
mixed_mm and coordinate_descent

high precision

tok/s= 9.14, mem/s= 60.55 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=147.02, mem/s= 973.53 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

medium precision

tok/s= 9.23, mem/s= 61.11 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=139.59, mem/s= 924.33 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

high + mixed_mm_choice heuristic

tok/s= 9.10, mem/s= 60.26 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=146.98, mem/s= 973.23 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

high + false use_mixed_mm

tok/s= 9.28, mem/s= 61.48 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=146.90, mem/s= 972.73 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

high + default mixed_mm_choice

tok/s= 9.08, mem/s= 60.09 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=137.58, mem/s= 911.00 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

high + heuristic + coordinate_descent_check_all_directions

tok/s= 9.19, mem/s= 60.87 GB/s, peak_mem= 8.61 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=166.02, mem/s=1099.30 GB/s, peak_mem= 8.97 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

high + false use_mixed_mm + coordinate_descent_check_all_directions

tok/s= 9.28, mem/s= 61.46 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=161.66, mem/s=1070.43 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented Jun 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/423

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d105072 with merge base 96d49cd (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@HDCharles HDCharles requested a review from msaroufim June 22, 2024 18:30
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 22, 2024
@HDCharles HDCharles requested a review from jerryzh168 June 22, 2024 18:30
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Mind also updating any relevant documentation pages? Also if some of those flags are GPU specific just gate those behind a cuda flag

torchao/quantization/utils.py Show resolved Hide resolved
torchao/quantization/utils.py Show resolved Hide resolved
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool LGTM

@HDCharles HDCharles force-pushed the 066_set_inductor_config branch 2 times, most recently from 5ab0b95 to bfe2ea2 Compare June 25, 2024 03:13
@HDCharles
Copy link
Contributor Author

Nice! Mind also updating any relevant documentation pages? Also if some of those flags are GPU specific just gate those behind a cuda flag

i don't think anything is gpu specific

@HDCharles HDCharles force-pushed the 066_set_inductor_config branch from bfe2ea2 to 10a5c4a Compare June 25, 2024 03:17
Summary:

making autoquant and quantize apis call a new
recommended_inductor_config_setter util to set recommended apis

also update groupsize -> groupsize in generate.py

Test Plan:

sh benchmarks.sh

comparison of different config combinations for matmul precision,
mixed_mm and coordinate_descent

tok/s=  9.14, mem/s=  60.55 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=147.02, mem/s= 973.53 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.23, mem/s=  61.11 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=139.59, mem/s= 924.33 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.10, mem/s=  60.26 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=146.98, mem/s= 973.23 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.28, mem/s=  61.48 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=146.90, mem/s= 972.73 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.08, mem/s=  60.09 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=137.58, mem/s= 911.00 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.19, mem/s=  60.87 GB/s, peak_mem= 8.61 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=166.02, mem/s=1099.30 GB/s, peak_mem= 8.97 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

Reviewers:

Subscribers:

Tasks:

Tags:
@HDCharles HDCharles force-pushed the 066_set_inductor_config branch from 10a5c4a to 0e5fc3e Compare June 25, 2024 05:08
HDCharles added 12 commits June 24, 2024 22:51
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@@ -689,6 +696,7 @@ def test_int8_dynamic_quant_subclass(self, device, dtype):

@parameterized.expand(COMMON_DEVICE_DTYPE)
def test_int8_weight_only_quant_subclass(self, device, dtype):
undo_recommended_configs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HDCharles HDCharles merged commit 211b6bc into main Jun 25, 2024
13 checks passed
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
* adding default inductor config settings

Summary:

making autoquant and quantize apis call a new
recommended_inductor_config_setter util to set recommended apis

also update groupsize -> groupsize in generate.py

Test Plan:

sh benchmarks.sh

comparison of different config combinations for matmul precision,
mixed_mm and coordinate_descent

tok/s=  9.14, mem/s=  60.55 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=147.02, mem/s= 973.53 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.23, mem/s=  61.11 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=139.59, mem/s= 924.33 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.10, mem/s=  60.26 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=146.98, mem/s= 973.23 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.28, mem/s=  61.48 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=146.90, mem/s= 972.73 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.08, mem/s=  60.09 GB/s, peak_mem= 8.33 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=137.58, mem/s= 911.00 GB/s, peak_mem= 8.95 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,
tok/s=  9.19, mem/s=  60.87 GB/s, peak_mem= 8.61 GB, model_size= 6.62 GB quant: int8dq, mod: Llama-2-7b-chat-hf,
tok/s=166.02, mem/s=1099.30 GB/s, peak_mem= 8.97 GB, model_size= 6.62 GB quant: int8wo, mod: Llama-2-7b-chat-hf,

Reviewers:

Subscribers:

Tasks:

Tags:

* fixing tests

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fix weight only failures

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fixing new broken test

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fixing autoquant test

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* testing if inductor config is the issue

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* are inductor configs somehow being set?

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* when is coordinate descent tuning beinng enabled?

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* reset inductor config for tests

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* more test fixes

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* adding warning

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* handling of errors

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* option to supress autoquant errors

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
Summary:
Link against quantized ops lib

Test Plan:
python torchchat.py download stories15M
export PRMT="Once upon a time in a land far away"
python torchchat.py export stories15M --quant '{"linear:a8w4dq" :
{"groupsize": 32}, "embedding" : {"bitwidth": 8, "groupsize": 0}}'
--output-pte-path ./model.pte
./scripts/install_et.sh
rm -rf build/cmake-out/
cmake -S ./runner-et -B ./runner-et/cmake-out -G Ninja
cmake --build ./runner-et/cmake-out
./runner-et/cmake-out/run ./model.pte -z ./tokenizer.bin -t 0 -i
"${PRMT}"

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants