Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LLM pretraining examples #7159

Merged
merged 22 commits into from
Aug 16, 2023
Merged

Conversation

maanug-nv
Copy link
Collaborator

@maanug-nv maanug-nv commented Aug 3, 2023

What does this PR do ?

Simplify LLM pretraining example scripts by moving common logic into a new TrainerBuilder and exp_manager.

Collection: [Note which collection this PR will affect]

Changelog

  • add a TrainerBuilder type that hides some common logic for setting up a Trainer
  • move logic to handle resume_from_checkpoint arg to exp_manager Current disabling resume_from_checkpoint logic and leaving a TODO comment since its current behavior is not a desirable user experience. Will make a separate PR to improve this.
  • utilize above refactors to reduce length of Megatron LLM pretraining example scripts

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Refactor

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the NLP label Aug 3, 2023
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch 2 times, most recently from 3408fc8 to 38f31bd Compare August 7, 2023 22:17
@maanug-nv
Copy link
Collaborator Author

Rebased to include PTL 2.0 changes from #6433

@maanug-nv maanug-nv marked this pull request as ready for review August 7, 2023 22:20
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch from 80868c2 to f139a3a Compare August 7, 2023 22:32
@ericharper ericharper requested a review from athitten August 8, 2023 20:13
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch 3 times, most recently from 186bab8 to 2f303ba Compare August 9, 2023 19:42
@arendu arendu requested review from arendu and removed request for arendu August 9, 2023 23:26
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch from a7352cc to f3cf9bf Compare August 10, 2023 05:10
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch 4 times, most recently from 3204f7f to e13b0af Compare August 11, 2023 01:02
Comment on lines +350 to +351
# if cfg.resume_from_checkpoint is not None:
# trainer.ckpt_path = cfg.resume_from_checkpoint

Check notice

Code scanning / CodeQL

Commented-out code

This comment appears to contain commented-out code.
@github-actions github-actions bot added the CI label Aug 11, 2023
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch from 1020637 to e13b0af Compare August 12, 2023 01:24
@github-actions github-actions bot removed the CI label Aug 12, 2023
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch 4 times, most recently from 7b6dfd7 to 752aa3c Compare August 14, 2023 23:26
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
maanug-nv and others added 17 commits August 15, 2023 17:44
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
@maanug-nv maanug-nv force-pushed the llm-trainer-builder branch from 752aa3c to a9c4a65 Compare August 15, 2023 22:44
if cfg.resume_from_checkpoint is not None:
trainer.ckpt_path = cfg.resume_from_checkpoint
# TODO: this behavior is undesirable, need ckpts in exp_dir to take priority if present over resume_from_checkpoint
# if cfg.resume_from_checkpoint is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maanug-nv where are we taking care of the below lines then:

    if cfg.model.resume_from_checkpoint is not None:
        trainer.ckpt_path = cfg.model.resume_from_checkpoint
    logging.info(f'Resuming training from checkpoint: {trainer.ckpt_path}')

Since the pre training scripts were assigning the checkpoint to trainer.ckpt_path if we passed a checkpoint path for resume_from_checkpoint under model in config.

Copy link
Collaborator Author

@maanug-nv maanug-nv Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so initially I moved those lines exactly as is to this place in exp_manager.py. After testing and discussing with @titu1994 , having those lines here (or in pretraining as they were before/are currently on main) has some undesirable behavior, details below. I wanted to keep this PR purely refactor (thought that would get it merged faster), so I'll correct the behavior in another PR. I can uncomment these lines if you prefer.

if 'resume_from_checkpoint' is set, that checkpoint is always used despite what is in the log dir. What makes more sense is that 'resume_from_checkpoint' is used if no log_dir is present, but log_dir takes priority if present.

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@maanug-nv will address resume_from_checkpoint in a follow up PR

@ericharper ericharper merged commit f90eea1 into NVIDIA:main Aug 16, 2023
dorotat-nv pushed a commit to dorotat-nv/NeMo that referenced this pull request Aug 24, 2023
* add builder class

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* formatting

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* use trainer builder for gpt pretraining example

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* subclass trainer builder for bert

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* use trainer builder for bert pretraining example

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* subclass t5 builder and use in t5 pretraining

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move resume_from_checkpoint logic to exp_manager

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* add docstring for resume_from_checkpoint

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* set resume_from_checkpoint with interpolation

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* remove refactored lines

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* unused import

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* another unused import

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* bug fix

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* another bug missed in rebase

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* add copyright

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* add type annotation

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* docstrings for trainer builder

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move trainer builder file

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* not needed for ptl 2.0

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* disable resume_from_checkpoint logic in exp_manager

Signed-off-by: Maanu Grover <maanug@nvidia.com>

---------

Signed-off-by: Maanu Grover <maanug@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Signed-off-by: dorotat <dorotat@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* add builder class

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* formatting

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* use trainer builder for gpt pretraining example

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* subclass trainer builder for bert

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* use trainer builder for bert pretraining example

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* subclass t5 builder and use in t5 pretraining

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move resume_from_checkpoint logic to exp_manager

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* add docstring for resume_from_checkpoint

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* set resume_from_checkpoint with interpolation

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* remove refactored lines

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* unused import

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* another unused import

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* bug fix

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* another bug missed in rebase

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* add copyright

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* add type annotation

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* docstrings for trainer builder

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* move trainer builder file

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* not needed for ptl 2.0

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* disable resume_from_checkpoint logic in exp_manager

Signed-off-by: Maanu Grover <maanug@nvidia.com>

---------

Signed-off-by: Maanu Grover <maanug@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants