Add offload for 8-bit model #1699

SunMarc · 2023-07-10T15:33:36Z

What does this PR do ?

This PR makes offload on cpu/disk possible with 8-bit models, thus saving even more memory. Previously, we did not quantize the modules on cpu/disk and the modules weights stayed at full precision. With cpu/disk offlaod, we offload the quantized weight to cpu/disk and move them back to gpu when needed using hooks. This should work out of the box with device_map="auto" but we make the user specify enable_offload=True to be sure that he knows what he's doing. Furthermore, no modification is needed on bitsandbytes library.

The input weights (weights_location) can be quantized or not. If the weights are not quantized, we will first quantize them before offloading them to the cpu/disk. If we don't want to quantize a module, the user should add it in skip_modules arg.

PS: 4-bit model offload will be added when we will be able to serialize them.

input_text =  "Hello my name is"
tokenizer =  AutoTokenizer.from_pretrained("bigscience/bloom-1b7")
encoded_input = tokenizer(input_text, return_tensors="pt")

model_name = "marcsun13/bloom-1b7_with_lm_head"
weights_location = hf_hub_download(model_name, "pytorch_model.bin")

with init_empty_weights():
    model_8bit = AutoModelForCausalLM.from_config(AutoConfig.from_pretrained(model_name))
model_8bit.tie_weights()

device_map = {'transformer.word_embeddings': 'cpu',
              'transformer.word_embeddings_layernorm': 'cpu',
              'transformer.h.0': 0,
              'transformer.h.1': 0,
              'transformer.h.2': 0,
              'transformer.h.3': 0,
              'transformer.h.4': 0,
              'transformer.h.5': 0,
              'transformer.h.6': 0,
              'transformer.h.7': 0,
              'transformer.h.8': 0,
              'transformer.h.9': 0,
              'transformer.h.10': 0,
              'transformer.h.11': 0,
              'transformer.h.12': 0, 
              'transformer.h.13': 'cpu', 
              'transformer.h.14': 'cpu',
              'transformer.h.15': 'cpu', 
              'transformer.h.16': 'cpu',
              'transformer.h.17': 'cpu',
              'transformer.h.18': 'cpu',
              'transformer.h.19': 'cpu',
              'transformer.h.20': 'cpu', 
              'transformer.h.21': 'cpu', 
              'transformer.h.22': 'cpu',
              'transformer.h.23': 'disk', 
              'transformer.ln_f': 'cpu',
              'lm_head':"cpu"}

bnb_quantization_config = BnbQuantizationConfig(load_in_8bit=True, enable_offload=True)

model_8bit = load_and_quantize_model(model_8bit,
                            bnb_quantization_config,
                            weights_location=weights_location,
                            device_map = device_map,
                            no_split_module_classes=["BloomBlock"],
                            offload_state_dict=True,
                            offload_folder="tmp"
                            )
output_parallel = model_8bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
output_text = tokenizer.decode(output_parallel[0], skip_special_tokens=True)

HuggingFaceDocBuilderDev · 2023-07-10T15:39:34Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada

Looks great on my side!
As a small comment I would maybe it is worth it to make it clear to users on the relevant documentation page that the computation will be still done on the GPU to avoid any confusion. Also to be on the safe zone, can you try to run the transformers slow tests of bnb integration and make sure they pass?

sgugger

Thanks for working on this. Quick question on my side, why do we need the user to flag enabled_offload=True in their config file? They are already indicating their intent to offload weights with the device_map so this is asking the same thing twice. Is there any downside to remove that flag?

src/accelerate/utils/modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

SunMarc · 2023-07-11T14:14:22Z

Thanks for working on this. Quick question on my side, why do we need the user to flag enabled_offload=True in their config file? They are already indicating their intent to offload weights with the device_map so this is asking the same thing twice. Is there any downside to remove that flag?

No there should not be any downside to remove that flag. Just removed it. It was just something that was used in the transformers integration for 8-bit model so I kept it initially.
cc @younesbelkada

docs/source/usage_guides/quantization.md

SunMarc · 2023-07-11T14:16:01Z

Looks great on my side! As a small comment I would maybe it is worth it to make it clear to users on the relevant documentation page that the computation will be still done on the GPU to avoid any confusion. Also to be on the safe zone, can you try to run the transformers slow tests of bnb integration and make sure they pass?

Added a section in the doc for offload and the transformers slow tests passed (61 in total)

younesbelkada

Very cool work! Thanks for confirming that the tests pass on transformers

Add offload for 8-bit model

b19f9dc

fix saved 8bit model offload and add tests

4a0b3c7

SunMarc requested review from younesbelkada and sgugger July 10, 2023 16:26

younesbelkada reviewed Jul 11, 2023

View reviewed changes

sgugger reviewed Jul 11, 2023

View reviewed changes

src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved

src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved

SunMarc and others added 4 commits July 11, 2023 09:09

Update src/accelerate/utils/modeling.py

e7875d5

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/accelerate/utils/modeling.py

ddeb4c2

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

add doc on how offload works

826659a

remove enable_offload

50ca630

sgugger approved these changes Jul 11, 2023

View reviewed changes

docs/source/usage_guides/quantization.md Outdated Show resolved Hide resolved

make style doc

c369fb0

younesbelkada approved these changes Jul 11, 2023

View reviewed changes

SunMarc merged commit 27d2908 into huggingface:main Jul 11, 2023
24 checks passed

SunMarc deleted the offload_8_bit branch July 11, 2023 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add offload for 8-bit model #1699

Add offload for 8-bit model #1699

SunMarc commented Jul 10, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 10, 2023 •

edited

Loading

younesbelkada left a comment

sgugger left a comment

SunMarc commented Jul 11, 2023 •

edited

Loading

SunMarc commented Jul 11, 2023 •

edited

Loading

younesbelkada left a comment •

edited

Loading

Add offload for 8-bit model #1699

Add offload for 8-bit model #1699

Conversation

SunMarc commented Jul 10, 2023 • edited Loading

What does this PR do ?

HuggingFaceDocBuilderDev commented Jul 10, 2023 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

SunMarc commented Jul 11, 2023 • edited Loading

SunMarc commented Jul 11, 2023 • edited Loading

younesbelkada left a comment • edited Loading

Choose a reason for hiding this comment

SunMarc commented Jul 10, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 10, 2023 •

edited

Loading

SunMarc commented Jul 11, 2023 •

edited

Loading

SunMarc commented Jul 11, 2023 •

edited

Loading

younesbelkada left a comment •

edited

Loading