Optim: APOLLO optimizer integration #36062

zhuhanqing · 2025-02-06T08:18:21Z

What does this PR do?

This PR integrates APOLLO (Approximated Gradient Scaling for Memory-Efficient LLM Optimization) into Hugging Face's Transformers.
APOLLO is a memory-efficient optimizer designed for LLM pre-training and full-parameter fine-tuning, offering SGD-like memory cost with AdamW-level performance.

📜 Paper: https://arxiv.org/abs/2412.05270
💻 Code: https://github.com/zhuhanqing/APOLLO

Why APOLLO?

APOLLO introduces a new level of memory efficiency for LLM optimization:
✅ Ultra-Low Memory Usage → Achieves significant savings, even beyond GaLore, approaching SGD-level efficiency.
✅ Adam(W)-Level Performance → Maintains or surpasses Adam(W) performance, validated on LLaMA models up to 7B scale.
✅ No Expensive SVD Computation → Unlike GaLore, APOLLO leverages lightweight random projection, avoiding training stalls in large-scale LLM fine-tuning.

Third-Party Validation of APOLLO

APOLLO has been merged into LLaMA-Factory, FluxML, and with a validated performance in the post.

With these validations, merging APOLLO into Transformers would offer Hugging Face users an efficient, memory-friendly optimizer for training LLMs—reducing GPU memory requirements and making large-scale model training more accessible! 🚀

Test of the integration

Following the approach in PR #29588 for GaLore, I have successfully integrated APOLLO into Hugging Face Transformers.

✅ Ensuring Correctness
To verify the integration, I have:
1️⃣ Added multiple unit tests in tests/trainer/test_trainer.py.
2️⃣ Manually tested the API using the following script:

import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-apollo",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="apollo_adamw",
    optim_target_modules=["attn", "mlp"]
)

model_id = "mistralai/Mistral-7B-v0.1"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

trainer.train()

Who can review?

@muellerzr and @SunMarc – Would love your feedback on this integration! 😊
Let me know if any modifications are needed. Thanks! 🚀

zhuhanqing · 2025-02-06T08:34:48Z

Hello @muellerz,

The auto code check has some failures as above, but it seems not from our code but from other code,
e.g., tests_torch and check_repository_consistency

SunMarc

Thanks for the PR ! Just had a look and it looks like the integration follows quite closely galore implementation. Can you simplify the code so that it is not redundant ?

docs/source/en/trainer.md

src/transformers/trainer.py

Merge remote-tracking branch 'upstream/main' into apollo-integration

zhuhanqing · 2025-02-07T04:48:39Z

Hi @SunMarc and @muellerz,

I have fixed your comment and updated our commit!
Thank you so much for your quick response and efforts in reviewing our codes.

Yep, I closely follow the GaLore implementation to add APOLLO support to ensure our code quality meets the standard of hugging face
APOLLO shares a similar method to using low-rank space to reduce memory costs like GaLore, making it feasible to follow GaLore implementation. APOLLO uses the low-rank space to approximate the gradient scaling factor for full-rank raw gradient, while our previous GaLore fundamentally differs from the new APOLLO as they use a more projected gradient descent method and perform updates in low-rank space, leading to accuracy loss due to the usage of low-rank updates.

Fix your comments
I have fixed your comments in the new commit.

Fix redundancy
I believe the redundancy lies in the src/transformers/trainer.py since we construct the APOLLO optimizer using a similar logic as GaLore by separating modules into low-rank and non-low-rank ones and appending those params as params groups. I modularize the construction logic such that GaLore and APOLLO only need to call a function to share the same codes.

Thank you again for your time and help!
Let me know if any modifications are needed. Thanks! 🚀

Best wishes,
Hanqing

SunMarc

Thanks for iterating ! Left a couple of comments. This is much better

docs/source/en/trainer.md

src/transformers/trainer.py

docs/source/en/trainer.md

zhuhanqing · 2025-02-08T07:30:27Z

Hi @SunMarc,

Thank so much for your feedback!

I submit two commits, with the first one adding the typing in the setup_low_rank_optimizer function.

For the second commit, I agree that our current description of APOLLO in trainer.md is lazy and simply follows the GaLore one. I rewrote the description to make sure it is clean and clear and address your comments. In particular, I show how to precisely configure APOLLO-Mini by passing the right arguments.

Thank you again for your time and help! Looking forward to your feedback!

Best wishes,
Hanqing

SunMarc

Thanks a lot for iterating @zhuhanqing ! LGTM !

SunMarc · 2025-02-10T09:49:30Z

Letting @ArthurZucker having a quick look if possible. Otherwise, I'll merge in 2/3 days !

HuggingFaceDocBuilderDev · 2025-02-10T10:14:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zhuhanqing · 2025-02-11T22:15:25Z

Hi @SunMarc, thank you so much for your help! Btw our APOLLO has been accepted by MLSys 2025, just announced today. I am eager to see APOLLO contribute more to the open-source community, especially democratizing LLM training with the integration with HF trainer!

SunMarc · 2025-02-12T14:33:31Z

Congrats for your acceptance !

ArthurZucker

Nice!!! 🚀

Added APOLLO optimizer integration

26597ae

zhuhanqing force-pushed the apollo-integration branch from eac6a62 to 26597ae Compare February 6, 2025 08:22

SunMarc reviewed Feb 6, 2025

View reviewed changes

docs/source/en/trainer.md Outdated Show resolved Hide resolved

docs/source/en/trainer.md Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

zhuhanqing added 3 commits February 6, 2025 21:04

keep up-to-date

a0eeb0a

Merge remote-tracking branch 'upstream/main' into apollo-integration

fix comment

9fc525b

Remove redundancy: Modularize low-rank optimizer construction

a884215

Remove redundancy: Remove useless comment

14c8cd9

SunMarc reviewed Feb 7, 2025

View reviewed changes

zhuhanqing added 3 commits February 8, 2025 00:15

Merge remote-tracking branch 'upstream/main' into apollo-integration

bb3e212

Fix comment: Add typing

935e507

Fix comment: Rewrite apollo desc

966bf9b

zhuhanqing mentioned this pull request Feb 9, 2025

APOLLO optimizer axolotl-ai-cloud/axolotl#2175

Open

5 tasks

SunMarc approved these changes Feb 10, 2025

View reviewed changes

SunMarc requested a review from ArthurZucker February 10, 2025 09:48

SunMarc merged commit 08c4959 into huggingface:main Feb 12, 2025
25 checks passed

ArthurZucker reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optim: APOLLO optimizer integration #36062

Optim: APOLLO optimizer integration #36062

zhuhanqing commented Feb 6, 2025

zhuhanqing commented Feb 6, 2025 •

edited

Loading

SunMarc left a comment •

edited

Loading

zhuhanqing commented Feb 7, 2025

SunMarc left a comment

zhuhanqing commented Feb 8, 2025

SunMarc left a comment

SunMarc commented Feb 10, 2025

HuggingFaceDocBuilderDev commented Feb 10, 2025

zhuhanqing commented Feb 11, 2025 •

edited

Loading

SunMarc commented Feb 12, 2025

ArthurZucker left a comment

Optim: APOLLO optimizer integration #36062

Optim: APOLLO optimizer integration #36062

Conversation

zhuhanqing commented Feb 6, 2025

What does this PR do?

Why APOLLO?

Third-Party Validation of APOLLO

Test of the integration

Who can review?

zhuhanqing commented Feb 6, 2025 • edited Loading

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

zhuhanqing commented Feb 7, 2025

SunMarc left a comment

Choose a reason for hiding this comment

zhuhanqing commented Feb 8, 2025

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc commented Feb 10, 2025

HuggingFaceDocBuilderDev commented Feb 10, 2025

zhuhanqing commented Feb 11, 2025 • edited Loading

SunMarc commented Feb 12, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

zhuhanqing commented Feb 6, 2025 •

edited

Loading

SunMarc left a comment •

edited

Loading

zhuhanqing commented Feb 11, 2025 •

edited

Loading