-
Notifications
You must be signed in to change notification settings - Fork 955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add offload for 8-bit model #1699
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great on my side!
As a small comment I would maybe it is worth it to make it clear to users on the relevant documentation page that the computation will be still done on the GPU to avoid any confusion. Also to be on the safe zone, can you try to run the transformers slow tests of bnb integration and make sure they pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Quick question on my side, why do we need the user to flag enabled_offload=True
in their config file? They are already indicating their intent to offload weights with the device_map
so this is asking the same thing twice. Is there any downside to remove that flag?
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
No there should not be any downside to remove that flag. Just removed it. It was just something that was used in the transformers integration for 8-bit model so I kept it initially. |
Added a section in the doc for offload and the transformers slow tests passed (61 in total) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool work! Thanks for confirming that the tests pass on transformers
What does this PR do ?
This PR makes offload on cpu/disk possible with 8-bit models, thus saving even more memory. Previously, we did not quantize the modules on cpu/disk and the modules weights stayed at full precision. With cpu/disk offlaod, we offload the quantized weight to cpu/disk and move them back to gpu when needed using hooks. This should work out of the box with
device_map="auto"
but we make the user specifyenable_offload=True
to be sure that he knows what he's doing. Furthermore, no modification is needed onbitsandbytes
library.The input weights (
weights_location
) can be quantized or not. If the weights are not quantized, we will first quantize them before offloading them to the cpu/disk. If we don't want to quantize a module, the user should add it inskip_modules
arg.PS: 4-bit model offload will be added when we will be able to serialize them.