Distributed optimizer reduces GPT embedding grads in FP32 #8792

timmoon10 · 2024-04-02T22:51:42Z

What does this PR do ?

When training Megatron-core GPT with the distributed optimizer, the embedding gradients were reduced in the grad sync dtype (usually BF16). However, we want to reduce in FP32 to improve convergence.

Collection: NLP

Changelog

Make sure GPT embedding grads are reduced in FP32.

Usage

Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.

Enable mcore with model.mcore_gpt=True and the distributed optimizer with model.optim.name=distributed_fused_adam.

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

mcore support was added in Use GPTModel from mcore #7093
Distopt support for mixed grad dtypes was added in GPT support for BF16 grad reductions #5920

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-04-02T22:51:55Z

jenkins

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-04-03T03:52:12Z

jenkins

timmoon10 · 2024-04-03T18:19:01Z

jenkins

timmoon10 · 2024-04-04T20:54:10Z

jenkins

ericharper

LGTM. Thanks!

erhoo82

LGTM given it gives the same convergence.

* Make sure embedding grads are reduced in FP32 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Access correct attr to get position embeddings Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make sure embedding grads are reduced in FP32 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Access correct attr to get position embeddings Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Ao Tang <aot@nvidia.com>

* Make sure embedding grads are reduced in FP32 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Access correct attr to get position embeddings Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 and others added 2 commits April 2, 2024 15:41

Make sure embedding grads are reduced in FP32

cd46495

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into fp32-embedding-grads

2f7f51d

github-actions bot added the NLP label Apr 2, 2024

timmoon10 and others added 2 commits April 3, 2024 03:50

Access correct attr to get position embeddings

9e149bd

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into fp32-embedding-grads

0c8ef9e

Merge branch 'main' into fp32-embedding-grads

abe993a

Merge branch 'main' into fp32-embedding-grads

bdc257f

timmoon10 changed the title ~~Make sure GPT embedding grads are reduced in FP32~~ Make distributed optimizer reduces GPT embedding grads in FP32 Apr 5, 2024

timmoon10 changed the title ~~Make distributed optimizer reduces GPT embedding grads in FP32~~ Distributed optimizer reduces GPT embedding grads in FP32 Apr 5, 2024

ericharper approved these changes Apr 5, 2024

View reviewed changes

erhoo82 self-requested a review April 5, 2024 20:34

erhoo82 approved these changes Apr 5, 2024

View reviewed changes

erhoo82 merged commit cf3b3a5 into NVIDIA:main Apr 5, 2024
120 of 124 checks passed

This was referenced Apr 9, 2024

Check if model has position embed before accessing param #8857

Merged

Move logic for distopt FP32 grads to models #8867

Merged

timmoon10 mentioned this pull request Apr 16, 2024

Add config option for FP32 embedding grads #8946

Merged

8 tasks

github-actions bot mentioned this pull request Apr 17, 2024

Add config option for FP32 embedding grads #8953

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed optimizer reduces GPT embedding grads in FP32 #8792

Distributed optimizer reduces GPT embedding grads in FP32 #8792

timmoon10 commented Apr 2, 2024

timmoon10 commented Apr 2, 2024

timmoon10 commented Apr 3, 2024

timmoon10 commented Apr 3, 2024

timmoon10 commented Apr 4, 2024

ericharper left a comment

erhoo82 left a comment •

edited

Loading

Distributed optimizer reduces GPT embedding grads in FP32 #8792

Distributed optimizer reduces GPT embedding grads in FP32 #8792

Conversation

timmoon10 commented Apr 2, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

timmoon10 commented Apr 2, 2024

timmoon10 commented Apr 3, 2024

timmoon10 commented Apr 3, 2024

timmoon10 commented Apr 4, 2024

ericharper left a comment

Choose a reason for hiding this comment

erhoo82 left a comment • edited Loading

Choose a reason for hiding this comment

erhoo82 left a comment •

edited

Loading