-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change --auto-devices
and model offload behavior
#24
Conversation
@81300, thanks for the PR. Can you help me understand it? This is what I was thinking with the original implementation:
At this point, the user doesn't want to load the entire model into the GPU or the CPU. He either wants to use the model in 8-bit mode (in which case things remain the same in your PR), or he wants to offload layers to the CPU and maybe the disk because the model doesn't fit entirely into his GPU. In the latter case, it seemed to me that What is gained by passing a About Also, what is gained by adding |
From what I understand, right now passing
Yeah. Your GPU might have enough VRAM to load the model but not enough to generate responses. You may want to experiment with
It's not really necessary, you're right. It was just added to ensure that if you pass |
I think that this boils down to a matter of semantics. With If you set the
I think that features should be constructed with common use cases in mind rather than supporting every possible combination of parameters in the hugging face's pipelines. Right now, changing the behavior of |
As an alternative do you think it would be worthwhile to have an option for specifying a custom |
I think not, as this would also not support a new use case. In the current implementation, the only use case that I can think of that is not supported is running in CPU mode with layers offloaded to the disk. But I am not sure if anyone is interested in that (it would be painfully slow). If you can think of other relevant use cases that are not currently possible, please let me know. |
As was said here too, sometimes layers do fit on GPU VRAM initially but generations fail either immediately or after a few outputs. So the use case would be allowing the user to present a tweaked map in the following manner: from transformers import AutoConfig, AutoModelForCausalLM
from accelerate import infer_auto_device_map, init_empty_weights
import pprint
config = AutoConfig.from_pretrained("models/blenderbot-3B")
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
device_map = infer_auto_device_map(model)
pprint.pprint(device_map)
The distribution can then be edited to offload more intelligently, perhaps by taking a couple more layers off the GPU to make room for generations. Or perhaps the user has a fast NVME drive and wants to prioritize offloading there instead of eating up all available CPU RAM and causing OOM (the latter happens on Google Colab...)
Seems more granular than simply limiting the max memory or passing Of course the average person shouldn't need to bother with this and it's obviously not a common use case by any stretch. But it's possible that as model sizes grow most people running consumer rigs will start having resource allocation issues. |
Fixes #23
This makes
device_map='auto'
optional by ensuring it is only used as a setting forAutoModelForCausalLM.from_pretrained()
in those cases:gpt-neo
,opt-
orgalactica
and has 13B/20B/30B parameters, or--auto-devices
argument is passed, or--load-in-8bit
argument is passed (launch would fail only withload_in_8bit=True
, see below)It also ensures that unless
--cpu
or--auto-devices
is passed, the model will always moved to the GPU first. This is to avoid scenarios such as:May also impact #14