Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial kernel changes to support GaLore #1137

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

matthewdouglas
Copy link
Member

@matthewdouglas matthewdouglas commented Mar 18, 2024

This is a draft containing some of the initial changes to support GaLore. So far this covers 2-state optimizers.

Optimizer2State.update_step() now contains an additional argument return_updates. When provided a tensor to hold the updates, they're returned here and p is not changed. Additionally, no weight decay is applied.

Needs tests, feedback welcome.

cc: @TimDettmers @jiaweizzhao @Titus-von-Koeller

Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

self.prefetch_state(p)

if "rank" in group:
self.update_step(group, p, gindex, pindex, return_updates=lor_update)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main addition in this PR is the new return_updates kwarg. This will give us the update from AdamW in lor_update and p.data will not be changed.

Corresponds to this step in Algorithm 1 from the paper:
lor_update = update(lor_grad)

self.update_step(group, p, gindex, pindex, return_updates=lor_update)

# GaLore Projection Back
p.data.add_(state["projector"].project_back(lor_update))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Algorithm 1 in the paper:

update = project_back(lor_update)
weight.data += update

@Titus-von-Koeller
Copy link
Collaborator

@matthewdouglas Tim said he could review your work this weekend.

@matthewdouglas
Copy link
Member Author

Updated with changes added for 1-state optimizers (Momentum, RMSProp, Adagrad, Lion).

@TimDettmers TimDettmers marked this pull request as ready for review July 17, 2024 23:35
Copy link
Collaborator

@TimDettmers TimDettmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a solid straightforward implementation. Good work, @matthewdouglas!

I wanted to overhaul the optimizers since changing or adding implementations is a pain when everything is separated into 1state and 2state optimizers. You probably encountered this too, Matthew. However, probably it is okay to keep it like this for the time being and refactor if we add another change.

Separating the update computation and the actual update in general could also have benefits to implement some new optimizers more easily. But I think we can leave that to future work and favor getting Galore out quickly together with the QLoRA fix.

The last remaining thing would be testing. The original tests are all green, but what would be good is a galore test. The best would be to use the original repo code and test it against the bitsandbytes implementation.

Steps for that would be to add galore-torch to the dev dependencies and only execute the tests when the dependencies are met to prevent other devs from needing to run this if they do not have galore-torch installed.

Otherwise, the tests can mirror the other tests that already exist and check if the gradients are close to each other. For that you can probably just add the original galore and your galore to the dictionary of optimizers in test_optim.py and see if the errors are approximately similar compared to other optimizer comparison.

Let me know if you have any other concerns with this, but I think with a test this is all ready to go.

@matthewdouglas matthewdouglas added the enhancement New feature or request label Aug 15, 2024
@matthewdouglas matthewdouglas self-assigned this Aug 15, 2024
@jiaweizzhao
Copy link

jiaweizzhao commented Sep 27, 2024

Hi @matthewdouglas, thanks for your great effort! I would like to follow up this PR and finalize our integration as soon as possible.

To test galore benchmark easily, I created a new branch: https://github.com/jiaweizzhao/GaLore/tree/bitsandbytes, where I integrated your GaLore implementation into the most recent GaLoreAdamW8bit: https://github.com/jiaweizzhao/GaLore/blob/bitsandbytes/galore_torch/adamw8bit.py

For environments, I installed both your bitsandbytes and modified GaLore locally. However, I tried to test a baseline but the following error comes up:

175 [rank0]: File "/data/home/jwzhao/.conda/envs/galore_new/lib/python3.8/site-packages/torch/optim/optimizer.py", line 484, in wrapper
176 [rank0]: out = func(*args, **kwargs)
177 [rank0]: File "/data/home/jwzhao/.conda/envs/galore_new/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
178 [rank0]: return func(*args, **kwargs)
179 [rank0]: File "/opt/hpcaas/.mounts/fs-0565f60d669b6a2d3/home/jwzhao/projects/bitsandbytes/bitsandbytes/optim/optimizer.py", line 288, in step
180 [rank0]: self.update_step(group, p, gindex, pindex)
181 [rank0]: File "/data/home/jwzhao/.conda/envs/galore_new/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
182 [rank0]: return func(*args, **kwargs)
183 [rank0]: File "/opt/hpcaas/.mounts/fs-0565f60d669b6a2d3/home/jwzhao/projects/bitsandbytes/bitsandbytes/optim/optimizer.py", line 552, in update_step
184 [rank0]: F.optimizer_update_8bit_blockwise(
185 [rank0]: File "/opt/hpcaas/.mounts/fs-0565f60d669b6a2d3/home/jwzhao/projects/bitsandbytes/bitsandbytes/functional.py", line 1789, in optimizer_update_8bit_blockwise
186 [rank0]: and len(str2optimizer8bit_blockwise[optimizer_name]) == 3
187 [rank0]: NameError: name 'str2optimizer8bit_blockwise' is not defined

Do you have any ideas why it occurred? Seems it only happen in this old PR. I also tried latest bitsandbytes with regular GaLore and it works.

@matthewdouglas
Copy link
Member Author

Thanks @jiaweizzhao! This indicates there was a problem loading the CUDA library. Were you able to build this part on this branch?

pip install -r requirements-dev.txt
cmake -DCOMPUTE_BACKEND=cuda -S .
cmake --build .
pip install -e .

If you try python -m bitsandbytes you may have a better idea of why it did not load.

I plan to follow up shortly and rebase with main!

@jiaweizzhao
Copy link

Not sure why I can not correctly load CUDA library using your scripts @matthewdouglas. Maybe it is due to the machine I am using. Could you actually try to run a simple galore benchmark on your end? I have packed everything in this branch: https://github.com/jiaweizzhao/GaLore/tree/bitsandbytes. Once installed, you can simply run sh scripts/verify_bitsandbytes/llama_60m_galore_adam8bit_new.sh to verify if the new galore_adam8bit works. Once it works I will made the changes across all optimizers.

@jiaweizzhao
Copy link

Seems the problem is I have no root access to compile from source (with pip install -e .) on my machine. Another way is if you could give me a complied galore version bitsandbytes package I can also try from my end @matthewdouglas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants