Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optim: APOLLO optimizer integration #36062

Merged
merged 8 commits into from
Feb 12, 2025

Conversation

zhuhanqing
Copy link
Contributor

What does this PR do?

This PR integrates APOLLO (Approximated Gradient Scaling for Memory-Efficient LLM Optimization) into Hugging Face's Transformers.
APOLLO is a memory-efficient optimizer designed for LLM pre-training and full-parameter fine-tuning, offering SGD-like memory cost with AdamW-level performance.

📜 Paper: https://arxiv.org/abs/2412.05270
💻 Code: https://github.com/zhuhanqing/APOLLO

Why APOLLO?

APOLLO introduces a new level of memory efficiency for LLM optimization:
✅ Ultra-Low Memory Usage → Achieves significant savings, even beyond GaLore, approaching SGD-level efficiency.
✅ Adam(W)-Level Performance → Maintains or surpasses Adam(W) performance, validated on LLaMA models up to 7B scale.
✅ No Expensive SVD Computation → Unlike GaLore, APOLLO leverages lightweight random projection, avoiding training stalls in large-scale LLM fine-tuning.

Third-Party Validation of APOLLO

APOLLO has been merged into LLaMA-Factory, FluxML, and with a validated performance in the post.

With these validations, merging APOLLO into Transformers would offer Hugging Face users an efficient, memory-friendly optimizer for training LLMs—reducing GPU memory requirements and making large-scale model training more accessible! 🚀

Test of the integration

Following the approach in PR #29588 for GaLore, I have successfully integrated APOLLO into Hugging Face Transformers.

✅ Ensuring Correctness
To verify the integration, I have:
1️⃣ Added multiple unit tests in tests/trainer/test_trainer.py.
2️⃣ Manually tested the API using the following script:

import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-apollo",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="apollo_adamw",
    optim_target_modules=["attn", "mlp"]
)

model_id = "mistralai/Mistral-7B-v0.1"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

trainer.train()

Who can review?

@muellerzr and @SunMarc – Would love your feedback on this integration! 😊
Let me know if any modifications are needed. Thanks! 🚀

@zhuhanqing
Copy link
Contributor Author

zhuhanqing commented Feb 6, 2025

Hello @muellerz,

The auto code check has some failures as above, but it seems not from our code but from other code,
e.g., tests_torch and check_repository_consistency

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR ! Just had a look and it looks like the integration follows quite closely galore implementation. Can you simplify the code so that it is not redundant ?

@zhuhanqing
Copy link
Contributor Author

Hi @SunMarc and @muellerz,

I have fixed your comment and updated our commit!
Thank you so much for your quick response and efforts in reviewing our codes.

Yep, I closely follow the GaLore implementation to add APOLLO support to ensure our code quality meets the standard of hugging face
APOLLO shares a similar method to using low-rank space to reduce memory costs like GaLore, making it feasible to follow GaLore implementation. APOLLO uses the low-rank space to approximate the gradient scaling factor for full-rank raw gradient, while our previous GaLore fundamentally differs from the new APOLLO as they use a more projected gradient descent method and perform updates in low-rank space, leading to accuracy loss due to the usage of low-rank updates.

Fix your comments
I have fixed your comments in the new commit.

Fix redundancy
I believe the redundancy lies in the src/transformers/trainer.py since we construct the APOLLO optimizer using a similar logic as GaLore by separating modules into low-rank and non-low-rank ones and appending those params as params groups. I modularize the construction logic such that GaLore and APOLLO only need to call a function to share the same codes.

Thank you again for your time and help!
Let me know if any modifications are needed. Thanks! 🚀

Best wishes,
Hanqing

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating ! Left a couple of comments. This is much better

@zhuhanqing
Copy link
Contributor Author

Hi @SunMarc,

Thank so much for your feedback!

I submit two commits, with the first one adding the typing in the setup_low_rank_optimizer function.

For the second commit, I agree that our current description of APOLLO in trainer.md is lazy and simply follows the GaLore one. I rewrote the description to make sure it is clean and clear and address your comments. In particular, I show how to precisely configure APOLLO-Mini by passing the right arguments.

Thank you again for your time and help! Looking forward to your feedback!

Best wishes,
Hanqing

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for iterating @zhuhanqing ! LGTM !

@SunMarc SunMarc requested a review from ArthurZucker February 10, 2025 09:48
@SunMarc
Copy link
Member

SunMarc commented Feb 10, 2025

Letting @ArthurZucker having a quick look if possible. Otherwise, I'll merge in 2/3 days !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zhuhanqing
Copy link
Contributor Author

zhuhanqing commented Feb 11, 2025

Hi @SunMarc, thank you so much for your help! Btw our APOLLO has been accepted by MLSys 2025, just announced today. I am eager to see APOLLO contribute more to the open-source community, especially democratizing LLM training with the integration with HF trainer!

@SunMarc
Copy link
Member

SunMarc commented Feb 12, 2025

Congrats for your acceptance !

@SunMarc SunMarc merged commit 08c4959 into huggingface:main Feb 12, 2025
25 checks passed
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants