-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor LLM pretraining examples #7159
Conversation
3408fc8
to
38f31bd
Compare
Rebased to include PTL 2.0 changes from #6433 |
80868c2
to
f139a3a
Compare
186bab8
to
2f303ba
Compare
a7352cc
to
f3cf9bf
Compare
3204f7f
to
e13b0af
Compare
1020637
to
e13b0af
Compare
7b6dfd7
to
752aa3c
Compare
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
752aa3c
to
a9c4a65
Compare
if cfg.resume_from_checkpoint is not None: | ||
trainer.ckpt_path = cfg.resume_from_checkpoint | ||
# TODO: this behavior is undesirable, need ckpts in exp_dir to take priority if present over resume_from_checkpoint | ||
# if cfg.resume_from_checkpoint is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maanug-nv where are we taking care of the below lines then:
if cfg.model.resume_from_checkpoint is not None:
trainer.ckpt_path = cfg.model.resume_from_checkpoint
logging.info(f'Resuming training from checkpoint: {trainer.ckpt_path}')
Since the pre training scripts were assigning the checkpoint to trainer.ckpt_path
if we passed a checkpoint path for resume_from_checkpoint
under model in config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so initially I moved those lines exactly as is to this place in exp_manager.py
. After testing and discussing with @titu1994 , having those lines here (or in pretraining as they were before/are currently on main) has some undesirable behavior, details below. I wanted to keep this PR purely refactor (thought that would get it merged faster), so I'll correct the behavior in another PR. I can uncomment these lines if you prefer.
if 'resume_from_checkpoint' is set, that checkpoint is always used despite what is in the log dir. What makes more sense is that 'resume_from_checkpoint' is used if no log_dir is present, but log_dir takes priority if present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
@maanug-nv will address resume_from_checkpoint in a follow up PR
* add builder class Signed-off-by: Maanu Grover <maanug@nvidia.com> * formatting Signed-off-by: Maanu Grover <maanug@nvidia.com> * use trainer builder for gpt pretraining example Signed-off-by: Maanu Grover <maanug@nvidia.com> * subclass trainer builder for bert Signed-off-by: Maanu Grover <maanug@nvidia.com> * use trainer builder for bert pretraining example Signed-off-by: Maanu Grover <maanug@nvidia.com> * subclass t5 builder and use in t5 pretraining Signed-off-by: Maanu Grover <maanug@nvidia.com> * move resume_from_checkpoint logic to exp_manager Signed-off-by: Maanu Grover <maanug@nvidia.com> * add docstring for resume_from_checkpoint Signed-off-by: Maanu Grover <maanug@nvidia.com> * set resume_from_checkpoint with interpolation Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove refactored lines Signed-off-by: Maanu Grover <maanug@nvidia.com> * unused import Signed-off-by: Maanu Grover <maanug@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * another unused import Signed-off-by: Maanu Grover <maanug@nvidia.com> * bug fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * another bug missed in rebase Signed-off-by: Maanu Grover <maanug@nvidia.com> * add copyright Signed-off-by: Maanu Grover <maanug@nvidia.com> * add type annotation Signed-off-by: Maanu Grover <maanug@nvidia.com> * docstrings for trainer builder Signed-off-by: Maanu Grover <maanug@nvidia.com> * move trainer builder file Signed-off-by: Maanu Grover <maanug@nvidia.com> * not needed for ptl 2.0 Signed-off-by: Maanu Grover <maanug@nvidia.com> * disable resume_from_checkpoint logic in exp_manager Signed-off-by: Maanu Grover <maanug@nvidia.com> --------- Signed-off-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Signed-off-by: dorotat <dorotat@nvidia.com>
* add builder class Signed-off-by: Maanu Grover <maanug@nvidia.com> * formatting Signed-off-by: Maanu Grover <maanug@nvidia.com> * use trainer builder for gpt pretraining example Signed-off-by: Maanu Grover <maanug@nvidia.com> * subclass trainer builder for bert Signed-off-by: Maanu Grover <maanug@nvidia.com> * use trainer builder for bert pretraining example Signed-off-by: Maanu Grover <maanug@nvidia.com> * subclass t5 builder and use in t5 pretraining Signed-off-by: Maanu Grover <maanug@nvidia.com> * move resume_from_checkpoint logic to exp_manager Signed-off-by: Maanu Grover <maanug@nvidia.com> * add docstring for resume_from_checkpoint Signed-off-by: Maanu Grover <maanug@nvidia.com> * set resume_from_checkpoint with interpolation Signed-off-by: Maanu Grover <maanug@nvidia.com> * remove refactored lines Signed-off-by: Maanu Grover <maanug@nvidia.com> * unused import Signed-off-by: Maanu Grover <maanug@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * another unused import Signed-off-by: Maanu Grover <maanug@nvidia.com> * bug fix Signed-off-by: Maanu Grover <maanug@nvidia.com> * another bug missed in rebase Signed-off-by: Maanu Grover <maanug@nvidia.com> * add copyright Signed-off-by: Maanu Grover <maanug@nvidia.com> * add type annotation Signed-off-by: Maanu Grover <maanug@nvidia.com> * docstrings for trainer builder Signed-off-by: Maanu Grover <maanug@nvidia.com> * move trainer builder file Signed-off-by: Maanu Grover <maanug@nvidia.com> * not needed for ptl 2.0 Signed-off-by: Maanu Grover <maanug@nvidia.com> * disable resume_from_checkpoint logic in exp_manager Signed-off-by: Maanu Grover <maanug@nvidia.com> --------- Signed-off-by: Maanu Grover <maanug@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
What does this PR do ?
Simplify LLM pretraining example scripts by moving common logic into a new
TrainerBuilder
andexp_manager
.Collection: [Note which collection this PR will affect]
Changelog
TrainerBuilder
type that hides some common logic for setting up aTrainer
move logic to handleCurrent disablingresume_from_checkpoint
arg toexp_manager
resume_from_checkpoint
logic and leaving a TODO comment since its current behavior is not a desirable user experience. Will make a separate PR to improve this.Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information