Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arde/fsdp activation checkpointing #25771

Merged

Conversation

arde171
Copy link
Contributor

@arde171 arde171 commented Aug 25, 2023

What does this PR do?

Currently, HF Trainer didn't support FSDP activation checkpointing. This PR provides support to FSDP activation checkpointing.
Please see the details about the FSDP activation checkpointing here.
I saw the improvement in training performance for the large LLM models (e.g., LLAMA 70B) with FSDP activation checkpointing as compared to the existing gradient_checkpointing option. It's easy to enable FSDP activation_checkpointing.

we just need to add "activation_checkpointing": "True" to enable the FSDP activation_checkpointing as shown in below example fsdp_config.json file.

fsdp_config.json

{
  "transformer_layer_cls_to_wrap": ["LlamaDecoderLayer"],
  ...
  "activation_checkpointing": "True"
}

Please see the below PR for more details about FSDP activation checkpointing in accelerate repo:
PR: huggingface/accelerate#1891

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ArthurZucker
Copy link
Collaborator

Cc @pacman100 if you think this is relevant 🤗

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @arde171 for adding this. Please raise an error if both activation_checkpointing in FSDP config and training arg gradient_checkpointing are set to True. The error should mention that both can't be set to True and to use FSDP's checkpointing logic when using FSDP.

src/transformers/trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @arde171!

@pacman100 pacman100 merged commit 738ecd1 into huggingface:main Aug 29, 2023
21 checks passed
parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
* add FSDP config option to enable activation-checkpointing

* update docs

* add checks and remove redundant code

* fix formatting error
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
* add FSDP config option to enable activation-checkpointing

* update docs

* add checks and remove redundant code

* fix formatting error
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023
* add FSDP config option to enable activation-checkpointing

* update docs

* add checks and remove redundant code

* fix formatting error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants