Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change --auto-devices and model offload behavior #24

Closed
wants to merge 6 commits into from
Closed

Change --auto-devices and model offload behavior #24

wants to merge 6 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jan 24, 2023

Fixes #23

This makes device_map='auto' optional by ensuring it is only used as a setting for AutoModelForCausalLM.from_pretrained() in those cases:

  • As before, if the model's name starts with gpt-neo, opt- or galactica and has 13B/20B/30B parameters, or
  • The --auto-devices argument is passed, or
  • The --load-in-8bit argument is passed (launch would fail only with load_in_8bit=True, see below)
Traceback (most recent call last):
  File "server.py", line 301, in <module>
    model, tokenizer = load_model(model_name)
  File "server.py", line 169, in load_model
    model = eval(command)
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 463, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1989, in from_pretrained
    raise ValueError(
ValueError: A device map needs to be passed to run convert models into mixed-int8 format. Please run`.from_pretrained` with `device_map='auto'`

It also ensures that unless --cpu or --auto-devices is passed, the model will always moved to the GPU first. This is to avoid scenarios such as:

/home/81300/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py:1470: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cpu') before running `.generate()`.
  warnings.warn(
Traceback (most recent call last):
(...)
  File "/home/81300/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

May also impact #14

@oobabooga
Copy link
Owner

oobabooga commented Jan 25, 2023

@81300, thanks for the PR. Can you help me understand it? This is what I was thinking with the original implementation:

            settings.append("device_map='auto'")
            if args.gpu_memory is not None:
                if args.cpu_memory is not None:
                    settings.append(f"max_memory={{0: '{args.gpu_memory}GiB', 'cpu': '{args.cpu_memory}GiB'}}")
                else:
                    settings.append(f"max_memory={{0: '{args.gpu_memory}GiB', 'cpu': '99GiB'}}")
            if args.disk:
                if args.disk_cache_dir is not None:
                    settings.append(f"offload_folder='{args.disk_cache_dir}'")
                else:
                    settings.append("offload_folder='cache'")
            if args.load_in_8bit:
                settings.append("load_in_8bit=True")
            else:
                settings.append("torch_dtype=torch.float16")

At this point, the user doesn't want to load the entire model into the GPU or the CPU. He either wants to use the model in 8-bit mode (in which case things remain the same in your PR), or he wants to offload layers to the CPU and maybe the disk because the model doesn't fit entirely into his GPU. In the latter case, it seemed to me that device_map='auto' was always necessary.

What is gained by passing a max_memory dict and/or an offload_folder without device_map=auto?

About offload_state_dict=True, it defaults to true when there is a disk offload, so it seems unnecessary to include it.

Also, what is gained by adding .cuda() here? I thought that .cuda() was only relevant when the model was sent entirely to the GPU, and that it would generate an error if added when the model is loaded in 8-bit mode.

@ghost
Copy link
Author

ghost commented Jan 25, 2023

At this point, the user doesn't want to load the entire model into the GPU or the CPU.

From what I understand, right now passing --auto-devices doesn't do anything explicitly but skip this part of the logic. I thought it would be more clear if we make it explicit.

Also, what is gained by adding .cuda() here? I thought that .cuda() was only relevant when the model was sent entirely to the GPU, and that it would generate an error if added when the model is loaded in 8-bit mode.

Yeah. Your GPU might have enough VRAM to load the model but not enough to generate responses. You may want to experiment with --gpu-memory while still trying to load the model into the GPU initially (like it happens with the usage of .cuda() here) but if you passed --gpu-memory then the current logic presumes you want --auto-devices (device_map='auto') as well and doesn't send to .cuda() here. In which case if your GPU doesn't actually have enough VRAM to hold the model in the first place, the problem will be hidden as the auto device_map took care of it and stored some weights on the CPU before you began generating with the UI.

About offload_state_dict=True, it defaults to true when there is a disk offload, so it seems unnecessary to include it.

It's not really necessary, you're right. It was just added to ensure that if you pass --disk then disk offload happens anyway, irrespective of what device_map='auto' might decide.

@oobabooga
Copy link
Owner

I think that this boils down to a matter of semantics. With gpu-memory and cpu-memory, the goal is to limit the amount of memory allocated by auto-devices, allowing the user to get the model working by following the low VRAM guide in a simple way.

If you set the max_memory dict with a maximum GPU memory but no auto devices, there are two options:

  1. This is greater than the model's size and the model will load to the GPU as if no command-line flag was used.
  2. This is smaller than the model's size and the program will segfault because it wasn't instructed to offload weights to the CPU by auto-devices.

I think that features should be constructed with common use cases in mind rather than supporting every possible combination of parameters in the hugging face's pipelines. Right now, changing the behavior of auto-devices would not support a new use case it seems.

@oobabooga oobabooga closed this Jan 25, 2023
@ghost
Copy link
Author

ghost commented Jan 25, 2023

As an alternative do you think it would be worthwhile to have an option for specifying a custom device_map from a config file?

@oobabooga
Copy link
Owner

I think not, as this would also not support a new use case.

In the current implementation, the only use case that I can think of that is not supported is running in CPU mode with layers offloaded to the disk. But I am not sure if anyone is interested in that (it would be painfully slow).

If you can think of other relevant use cases that are not currently possible, please let me know.

@ghost
Copy link
Author

ghost commented Jan 26, 2023

As was said here too, sometimes layers do fit on GPU VRAM initially but generations fail either immediately or after a few outputs. So the use case would be allowing the user to present a tweaked map in the following manner:

from transformers import AutoConfig, AutoModelForCausalLM
from accelerate import infer_auto_device_map, init_empty_weights
import pprint

config = AutoConfig.from_pretrained("models/blenderbot-3B")
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

device_map = infer_auto_device_map(model)
pprint.pprint(device_map)
$ python infer-device-map.py
{'lm_head': 'cpu',
 'model.decoder.embed_positions': 0,
 'model.decoder.embed_tokens': 0,
 'model.decoder.layer_norm': 'cpu',
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 'model.decoder.layers.10': 0,
 'model.decoder.layers.11': 0,
 'model.decoder.layers.12': 0,
 'model.decoder.layers.13': 0,
 'model.decoder.layers.14': 0,
 'model.decoder.layers.15': 0,
 'model.decoder.layers.16.activation_fn': 0,
 'model.decoder.layers.16.encoder_attn': 0,
 'model.decoder.layers.16.encoder_attn_layer_norm': 0,
 'model.decoder.layers.16.fc1': 'cpu',
 'model.decoder.layers.16.fc2': 'cpu',
 'model.decoder.layers.16.final_layer_norm': 'cpu',
 'model.decoder.layers.16.self_attn': 0,
 'model.decoder.layers.16.self_attn_layer_norm': 0,
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18': 'cpu',
 'model.decoder.layers.19': 'cpu',
 'model.decoder.layers.2': 0,
 'model.decoder.layers.20': 'cpu',
 'model.decoder.layers.21': 'cpu',
 'model.decoder.layers.22': 'cpu',
 'model.decoder.layers.23': 'cpu',
 'model.decoder.layers.3': 0,
 'model.decoder.layers.4': 0,
 'model.decoder.layers.5': 0,
 'model.decoder.layers.6': 0,
 'model.decoder.layers.7': 0,
 'model.decoder.layers.8': 0,
 'model.decoder.layers.9': 0}

The distribution can then be edited to offload more intelligently, perhaps by taking a couple more layers off the GPU to make room for generations. Or perhaps the user has a fast NVME drive and wants to prioritize offloading there instead of eating up all available CPU RAM and causing OOM (the latter happens on Google Colab...)

While PyTorch is very good at managing GPU RAM efficiently (and giving it back when not needed), it's not entirely true with Python and CPU RAM. Therefore, an automatically computed device map might be too intense on the CPU. Move a few modules to the disk device if you get crashes due to lack of RAM.
https://huggingface.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map

Seems more granular than simply limiting the max memory or passing --disk (which may not even offload right now if the auto device_map hasn't set some layers to disk).

Of course the average person shouldn't need to bother with this and it's obviously not a common use case by any stretch. But it's possible that as model sizes grow most people running consumer rigs will start having resource allocation issues.

Touch-Night pushed a commit to Touch-Night/text-generation-webui that referenced this pull request Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

device_map='auto' is inserted by default
1 participant