Use `torch.inference_mode()` for lower memory usage during calibration #20

mgoin · 2024-06-17T16:31:09Z

On an H100 80B the calibration of a Llama 3 8B with a ~8192 sequence length input would cause OOM issues. With the small addition of with torch.inference_mode(): to the calibration loop, I see only a peak usage of ~15GB.

Snippet used for testing:

from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
seq_len = 8192
examples = ["hello " * seq_len]
examples = tokenizer(examples, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static",
    ignore_patterns=["re:.*lm_head"],
)

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)

#20)

Use torch.inference_mode() for lower memory usage during calibration

544ca2d

mgoin linked an issue Jun 17, 2024 that may be closed by this pull request

Memory requirements for long sequences #19

Closed

mgoin mentioned this pull request Jun 17, 2024

Memory requirements for long sequences #19

Closed

mgoin merged commit b1c6ad6 into main Jun 17, 2024
4 checks passed

mgoin added a commit that referenced this pull request Jul 18, 2024

Use torch.inference_mode() for lower memory usage during calibration (

57c31bb

#20)

mgoin added a commit that referenced this pull request Jul 18, 2024

Use torch.inference_mode() for lower memory usage during calibration (

e6c2225

#20)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `torch.inference_mode()` for lower memory usage during calibration #20

Use `torch.inference_mode()` for lower memory usage during calibration #20

mgoin commented Jun 17, 2024 •

edited

Loading

Use torch.inference_mode() for lower memory usage during calibration #20

Use torch.inference_mode() for lower memory usage during calibration #20

Conversation

mgoin commented Jun 17, 2024 • edited Loading

Use `torch.inference_mode()` for lower memory usage during calibration #20

Use `torch.inference_mode()` for lower memory usage during calibration #20

mgoin commented Jun 17, 2024 •

edited

Loading