Add int4 weight-only QAT flow targeting tinygemm kernel #1570

andrewor14 · 2024-09-12T22:47:08Z

Summary: This commit adds an int4 weight-only QAT flow targeting the efficient tinygemm kernel. This means during fine-tuning we only simulate numerics of the kernel in bf16, but we only actually call the kernel after quantizing the model. For more detail, see pytorch/ao#383.

Test Plan:

Fine-tune QAT command:

tune run --nnodes 1 --nproc_per_node 6 --rdzv_endpoint="localhost:8900" qat_distributed --config llama3/8B_qat_full \
    batch_size=8 \
    fake_quant_after_n_steps=1000 \
    checkpointer.output_dir="/tmp/qat_results" \
    quantizer._component_=torchtune.training.quantization.Int4WeightOnlyQATQuantizer \
    quantizer.groupsize=128

Quantize command:

tune run quantize --config recipes/configs/quantization.yaml \
    model._component_=torchtune.models.llama3.llama3_8b \
    quantizer._component_=torchtune.training.quantization.Int4WeightOnlyQuantizer \
    quantizer.groupsize=128 \
    checkpointer._component_=torchtune.training.FullModelMetaCheckpointer \
    checkpointer.checkpoint_dir="/tmp/qat_results" \
    checkpointer.output_dir="/tmp/qat_results" \
    checkpointer.checkpoint_files=[meta_model_2.pt] \
    checkpointer.model_type=LLAMA3

Eval command:

tune run eleuther_eval --config eleuther_evaluation \
    tasks="[hellaswag, wikitext]" \
    model._component_=torchtune.models.llama3.llama3_8b \
    quantizer._component_=torchtune.training.quantization.Int4WeightOnlyQuantizer \
    quantizer.groupsize=128 \
    checkpointer._component_=torchtune.training.FullModelTorchTuneCheckpointer \
    checkpointer.checkpoint_dir="/tmp/qat_results" \
    checkpointer.output_dir="/tmp/qat_results" \
    checkpointer.checkpoint_files=[meta_model_2-4w.pt] \
    checkpointer.model_type=LLAMA3 \
    tokenizer._component_=torchtune.models.llama3.llama3_tokenizer \
    tokenizer.path=/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model

Evaluation results:

# Full fine-tune (quantized)
|    Tasks     |Version|Filter|n-shot|Metric|Value |   |Stderr|
|--------------|------:|------|-----:|------|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |0.4806|±  |0.0167|

# QAT subclass
|    Tasks     |Version|Filter|n-shot|Metric|Value |   |Stderr|
|--------------|------:|------|-----:|------|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |0.4914|±  |0.0164|

# QAT module swap
|    Tasks     |Version|Filter|n-shot|Metric|Value |   |Stderr|
|--------------|------:|------|-----:|------|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |0.4872|±  |0.0167|

pytorch-bot · 2024-09-12T22:47:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1570

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c0c4252 with merge base 6b43a1c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: This commit adds an int4 weight-only QAT flow targeting the efficient tinygemm kernel. This means during fine-tuning we only simulate numerics of the kernel in bf16, but we only actually call the kernel after quantizing the model. For more detail, see pytorch/ao#383. Test Plan: Fine-tune QAT command: ``` tune run --nnodes 1 --nproc_per_node 6 --rdzv_endpoint="localhost:8900" qat_distributed --config llama3/8B_qat_full \ batch_size=8 \ fake_quant_after_n_steps=1000 \ checkpointer.output_dir="/tmp/qat_results" \ quantizer._component_=torchtune.training.quantization.Int4WeightOnlyQATQuantizer \ quantizer.groupsize=128 ``` Quantize command: ``` tune run quantize --config recipes/configs/quantization.yaml \ model._component_=torchtune.models.llama3.llama3_8b \ quantizer._component_=torchtune.training.quantization.Int4WeightOnlyQuantizer \ quantizer.groupsize=128 \ checkpointer._component_=torchtune.training.FullModelMetaCheckpointer \ checkpointer.checkpoint_dir="/tmp/qat_results" \ checkpointer.output_dir="/tmp/qat_results" \ checkpointer.checkpoint_files=[meta_model_2.pt] \ checkpointer.model_type=LLAMA3 ``` Eval command: ``` tune run eleuther_eval --config eleuther_evaluation \ tasks="[hellaswag, wikitext]" \ model._component_=torchtune.models.llama3.llama3_8b \ quantizer._component_=torchtune.training.quantization.Int4WeightOnlyQuantizer \ quantizer.groupsize=128 \ checkpointer._component_=torchtune.training.FullModelTorchTuneCheckpointer \ checkpointer.checkpoint_dir="/tmp/qat_results" \ checkpointer.output_dir="/tmp/qat_results" \ checkpointer.checkpoint_files=[meta_model_2-4w.pt] \ checkpointer.model_type=LLAMA3 \ tokenizer._component_=torchtune.models.llama3.llama3_tokenizer \ tokenizer.path=/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model ``` Evaluation results: ``` | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr| |--------------|------:|------|-----:|------|-----:|---|-----:| |truthfulqa_mc2| 2|none | 0|acc |0.4806|± |0.0167| | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr| |--------------|------:|------|-----:|------|-----:|---|-----:| |truthfulqa_mc2| 2|none | 0|acc |0.4914|± |0.0164| | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr| |--------------|------:|------|-----:|------|-----:|---|-----:| |truthfulqa_mc2| 2|none | 0|acc |0.4872|± |0.0167| ```

felipemello1 · 2024-09-24T00:43:46Z

@andrewor14 @ebsmothers can we merge?

andrewor14 · 2024-09-24T14:32:36Z

It's good to go from my side

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2024

andrewor14 force-pushed the int4-qat branch from 29871bb to c0c4252 Compare September 13, 2024 15:09

ebsmothers approved these changes Sep 23, 2024

View reviewed changes

ebsmothers merged commit a899da2 into pytorch:main Sep 26, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add int4 weight-only QAT flow targeting tinygemm kernel #1570

Add int4 weight-only QAT flow targeting tinygemm kernel #1570

andrewor14 commented Sep 12, 2024 •

edited

Loading

pytorch-bot bot commented Sep 12, 2024 •

edited

Loading

felipemello1 commented Sep 24, 2024

andrewor14 commented Sep 24, 2024

Add int4 weight-only QAT flow targeting tinygemm kernel #1570

Add int4 weight-only QAT flow targeting tinygemm kernel #1570

Conversation

andrewor14 commented Sep 12, 2024 • edited Loading

pytorch-bot bot commented Sep 12, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1570

✅ No Failures

felipemello1 commented Sep 24, 2024

andrewor14 commented Sep 24, 2024

andrewor14 commented Sep 12, 2024 •

edited

Loading

pytorch-bot bot commented Sep 12, 2024 •

edited

Loading