Fix FSDP gradient calculation with orig params #9335

janEbert · 2024-05-29T08:52:41Z

As observed by some users in #8487, FSDP causes differences in loss. This is because gradients are not correctly calculated when fsdp_use_orig_params=True. Currently, any non-FlatParameter is treated as being unsharded, when in reality they may be sharded when use_orig_params=True. This leads to a double reduction of sharded gradients. We thus simply adjust the parameters that are reduced manually. This requires use of a private variable in the FSDP module. Alternatively, the value of self.kwargs['ignored_states'] in nlp_overrides.py could be set as an attribute on the model to avoid this private variable access.

Thanks to @ofivite for suggesting that use_orig_params=True could be the cause of the issue, which greatly helped with analysis.

What does this PR do ?

Fix FSDP gradient calculation when fsdp_use_orig_params=True.

Collection: nlp

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to LoRA training with FSDP has spike in train loss #8487 (issue)

akoumpa · 2024-05-29T08:53:53Z

Nice, can you fix the DCO?

@ofivite

The `param.grad is not None` check also fixes gradient reduction in the case of parameters not having acquired gradients (as parameters could become empty tensors in FSDP). Thanks to @ofivite for suggesting that `use_orig_params=True` could be the cause of the issue, which greatly helped with analysis. Signed-off-by: janEbert <janpublicebert@posteo.net>

janEbert · 2024-05-29T11:04:22Z

Done thanks :)

@ofivite

The `param.grad is not None` check also fixes gradient reduction in the case of parameters not having acquired gradients (as parameters could become empty tensors in FSDP). Thanks to @ofivite for suggesting that `use_orig_params=True` could be the cause of the issue, which greatly helped with analysis. Signed-off-by: janEbert <janpublicebert@posteo.net> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

@ofivite

The `param.grad is not None` check also fixes gradient reduction in the case of parameters not having acquired gradients (as parameters could become empty tensors in FSDP). Thanks to @ofivite for suggesting that `use_orig_params=True` could be the cause of the issue, which greatly helped with analysis. Signed-off-by: janEbert <janpublicebert@posteo.net> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

@ofivite

The `param.grad is not None` check also fixes gradient reduction in the case of parameters not having acquired gradients (as parameters could become empty tensors in FSDP). Thanks to @ofivite for suggesting that `use_orig_params=True` could be the cause of the issue, which greatly helped with analysis. Signed-off-by: janEbert <janpublicebert@posteo.net>

github-actions bot added the NLP label May 29, 2024

akoumpa self-requested a review May 29, 2024 08:53

akoumpa added the Run CICD label May 29, 2024

akoumpa approved these changes May 29, 2024

View reviewed changes

janEbert force-pushed the fix-fsdp-grads branch from 49684a6 to e1fd00c Compare May 29, 2024 11:04

ericharper added Run CICD and removed Run CICD labels May 29, 2024

ericharper merged commit 2e39606 into NVIDIA:main May 30, 2024
128 of 130 checks passed

ko3n1g mentioned this pull request Jul 18, 2024

Release 2.0.0rc1 #9786

Closed

2 tasks

janEbert mentioned this pull request Aug 5, 2024

Add FSDP for NeMo 2.0 #9748

Merged

8 tasks

janEbert mentioned this pull request Aug 16, 2024

FSDP is not integrated across code base #10178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FSDP gradient calculation with orig params #9335

Fix FSDP gradient calculation with orig params #9335

janEbert commented May 29, 2024 •

edited

Loading

akoumpa commented May 29, 2024

janEbert commented May 29, 2024

Fix FSDP gradient calculation with orig params #9335

Fix FSDP gradient calculation with orig params #9335

Conversation

janEbert commented May 29, 2024 • edited Loading

What does this PR do ?

Before your PR is "Ready for review"

Who can review?

Additional Information

akoumpa commented May 29, 2024

janEbert commented May 29, 2024

janEbert commented May 29, 2024 •

edited

Loading