Add uintx quant to generate and eval #811

jerryzh168 · 2024-09-05T00:02:01Z

Summary:
att

Also rerun the benchmarks/eval for llama2/llama3 to get most recent perf/acc data

Test Plan:
torchao/_models/llama/generate.py
torchao/_models/llama/eval.py

llama2:

# torch.uint4, group_size = 64
python generate.py --compile --precision bfloat16 --quantization uintx-4-64
Average tokens/sec: 48.25
Average Bandwidth: 189.32 GB/s
Peak Memory Usage: 6.29 GB
Model Size: 3.92 GB

wikitext: {'word_perplexity,none': 12.890544846479484, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.612969956510788, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6897195668279897, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

# torch.uint2, group_size = 8
python generate.py --compile --precision bfloat16 --quantization uintx-2-8
Average tokens/sec: 36.11
Average Bandwidth: 238.58 GB/s
Peak Memory Usage: 9.26 GB
Model Size: 6.61 GB

python eval.py --compile --precision bfloat16 --quantization uintx-2-8
wikitext: {'word_perplexity,none': 28.766343716897, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.8742120465648264, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.9062841873734042, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

llama3:

# torch.uint4, group_size = 64
python generate.py --compile --precision bfloat16 --checkpoint_path=../../../checkpoints/meta-llama/Meta-Llama-3-8B/model.pth --quantization uintx-4-64
Average tokens/sec: 47.77
Average Bandwidth: 212.90 GB/s
Peak Memory Usage: 11.85 GB
Model Size: 4.46 GB

wikitext: {'word_perplexity,none': 8.112931736704462, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.479179221121259, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.5647968636325521, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}


# torch.uint2, group_size = 8
python generate.py --compile --precision bfloat16 --checkpoint_path=../../../checkpoints/meta-llama/Meta-Llama-3-8B/model.pth --quantization uintx-2-8
Average tokens/sec: 33.21
Average Bandwidth: 249.22 GB/s
Peak Memory Usage: 15.04 GB
Model Size: 7.51 GB

wikitext: {'word_perplexity,none': 39.36764348732592, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.98746296691363, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.9909279784106695, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-09-05T00:02:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/811

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d5ebc0e with merge base 317392d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

HDCharles · 2024-09-05T02:04:22Z

i would put the generate/eval results in a table somewhere, if you want to add them to the standard benchmarks you can add them to benchmarks.sh

also i would rebase on mine or you will have merge issues

HDCharles · 2024-09-05T02:05:05Z

if eval is broken for you, can you send me the error?

torchao/_models/llama/generate.py

jerryzh168 · 2024-09-05T03:33:59Z

if eval is broken for you, can you send me the error?

seems to be fine, it seems that int8wo and bfloat16 are just very close, I thought they were exactly the same before, but there is actually a slight difference

Summary: att Also rerun the benchmarks/eval for llama2/llama3 to get most recent perf/acc data Test Plan: torchao/_models/llama/generate.py torchao/_models/llama/eval.py Reviewers: Subscribers: Tasks: Tags:

jerryzh168 · 2024-09-05T05:47:09Z

right now these are slow, we can add to benchmarks.sh later when the perf is better I think

Summary: att Also rerun the benchmarks/eval for llama2/llama3 to get most recent perf/acc data Test Plan: torchao/_models/llama/generate.py torchao/_models/llama/eval.py Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 5, 2024

jerryzh168 requested review from HDCharles and msaroufim September 5, 2024 00:02

HDCharles approved these changes Sep 5, 2024

View reviewed changes

HDCharles reviewed Sep 5, 2024

View reviewed changes

torchao/_models/llama/generate.py Show resolved Hide resolved

Add uintx quant to generate and eval

d5ebc0e

Summary: att Also rerun the benchmarks/eval for llama2/llama3 to get most recent perf/acc data Test Plan: torchao/_models/llama/generate.py torchao/_models/llama/eval.py Reviewers: Subscribers: Tasks: Tags:

jerryzh168 force-pushed the benchmarks branch from 5a4a915 to d5ebc0e Compare September 5, 2024 05:46

jerryzh168 merged commit e05635e into pytorch:main Sep 5, 2024
17 checks passed

jerryzh168 deleted the benchmarks branch September 5, 2024 16:46

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

clean up unused files (pytorch#811)

9e52152

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add uintx quant to generate and eval #811

Add uintx quant to generate and eval #811

jerryzh168 commented Sep 5, 2024 •

edited

Loading

pytorch-bot bot commented Sep 5, 2024 •

edited

Loading

HDCharles commented Sep 5, 2024

HDCharles commented Sep 5, 2024

jerryzh168 commented Sep 5, 2024

jerryzh168 commented Sep 5, 2024

Add uintx quant to generate and eval #811

Add uintx quant to generate and eval #811

Conversation

jerryzh168 commented Sep 5, 2024 • edited Loading

pytorch-bot bot commented Sep 5, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/811

✅ No Failures

HDCharles commented Sep 5, 2024

HDCharles commented Sep 5, 2024

jerryzh168 commented Sep 5, 2024

jerryzh168 commented Sep 5, 2024

jerryzh168 commented Sep 5, 2024 •

edited

Loading

pytorch-bot bot commented Sep 5, 2024 •

edited

Loading