Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call ckpt_to_weights_subdir from MegatronCheckpointIO #10897

Merged
merged 36 commits into from
Nov 5, 2024
Merged

Conversation

ashors1
Copy link
Collaborator

@ashors1 ashors1 commented Oct 16, 2024

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
artbataev and others added 2 commits October 16, 2024 04:21
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Oct 16, 2024
@@ -8,7 +8,7 @@
from filelock import FileLock, Timeout
from pytorch_lightning.trainer.states import TrainerFn

from nemo.lightning.ckpt_utils import ckpt_to_context_subdir, ckpt_to_weights_subdir
from nemo.lightning.ckpt_utils import ckpt_to_context_subdir

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
nemo.lightning.ckpt_utils
begins an import cycle.
nemo/lightning/io/pl.py Fixed Show fixed Hide fixed
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Oct 16, 2024
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Oct 16, 2024
nemo/lightning/io/pl.py Fixed Show fixed Hide fixed
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
@ashors1 ashors1 added Run CICD and removed Run CICD labels Oct 29, 2024
Copy link
Contributor

[🤖]: Hi @ashors1 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

@ashors1 ashors1 added Run CICD and removed Run CICD labels Oct 30, 2024
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
@@ -37,7 +37,7 @@
from torch import nn
from typing_extensions import Self, override

from nemo.lightning.ckpt_utils import ckpt_to_dir
from nemo.lightning.ckpt_utils import WEIGHTS_PATH, ckpt_to_dir

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
nemo.lightning.ckpt_utils
begins an import cycle.
@ashors1 ashors1 added Run CICD and removed Run CICD labels Oct 30, 2024
@cuichenx cuichenx self-assigned this Nov 4, 2024
@ashors1 ashors1 added Run CICD and removed Run CICD labels Nov 4, 2024
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx
Copy link
Collaborator

cuichenx commented Nov 5, 2024

peft test passed https://github.com/NVIDIA/NeMo/actions/runs/11672168085/job/32500417335 launching full ci now

Copy link
Contributor

github-actions bot commented Nov 5, 2024

[🤖]: Hi @ashors1 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

@akoumpa akoumpa merged commit fb00406 into main Nov 5, 2024
303 of 305 checks passed
@akoumpa akoumpa deleted the ashors/ckpt-subdirs branch November 5, 2024 18:01
lilyw97 pushed a commit to lilyw97/NeMo that referenced this pull request Nov 13, 2024
* locate weights path within MegatronCheckpointIO

Signed-off-by: ashors1 <ashors@nvidia.com>

* small refactor

Signed-off-by: ashors1 <ashors@nvidia.com>

* remove another instance of ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* move ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* add weights path in save_checkpoint

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix circular import

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* handle saving in ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix minor typo

Signed-off-by: ashors1 <ashors@nvidia.com>

* bug fixes

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix undefined variable

Signed-off-by: ashors1 <ashors@nvidia.com>

* move function

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* fix adapter meta file path

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* fix mixtral test

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix mixtral test

Signed-off-by: ashors1 <ashors@nvidia.com>

* use function for weights subdir

Signed-off-by: Chen Cui <chcui@nvidia.com>

* address comments

Signed-off-by: ashors1 <ashors@nvidia.com>

* move asserts

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix undefined vars

Signed-off-by: ashors1 <ashors@nvidia.com>

* bug fix

Signed-off-by: ashors1 <ashors@nvidia.com>

---------

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
HuiyingLi pushed a commit to HuiyingLi/NeMo that referenced this pull request Nov 15, 2024
* locate weights path within MegatronCheckpointIO

Signed-off-by: ashors1 <ashors@nvidia.com>

* small refactor

Signed-off-by: ashors1 <ashors@nvidia.com>

* remove another instance of ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* move ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* add weights path in save_checkpoint

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix circular import

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* handle saving in ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix minor typo

Signed-off-by: ashors1 <ashors@nvidia.com>

* bug fixes

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix undefined variable

Signed-off-by: ashors1 <ashors@nvidia.com>

* move function

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* fix adapter meta file path

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* fix mixtral test

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix mixtral test

Signed-off-by: ashors1 <ashors@nvidia.com>

* use function for weights subdir

Signed-off-by: Chen Cui <chcui@nvidia.com>

* address comments

Signed-off-by: ashors1 <ashors@nvidia.com>

* move asserts

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix undefined vars

Signed-off-by: ashors1 <ashors@nvidia.com>

* bug fix

Signed-off-by: ashors1 <ashors@nvidia.com>

---------

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
yashaswikarnati pushed a commit that referenced this pull request Nov 21, 2024
* locate weights path within MegatronCheckpointIO

Signed-off-by: ashors1 <ashors@nvidia.com>

* small refactor

Signed-off-by: ashors1 <ashors@nvidia.com>

* remove another instance of ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* move ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* add weights path in save_checkpoint

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix circular import

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* handle saving in ckpt_to_weights_subdir

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix minor typo

Signed-off-by: ashors1 <ashors@nvidia.com>

* bug fixes

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix undefined variable

Signed-off-by: ashors1 <ashors@nvidia.com>

* move function

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* fix adapter meta file path

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* fix mixtral test

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix mixtral test

Signed-off-by: ashors1 <ashors@nvidia.com>

* use function for weights subdir

Signed-off-by: Chen Cui <chcui@nvidia.com>

* address comments

Signed-off-by: ashors1 <ashors@nvidia.com>

* move asserts

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix undefined vars

Signed-off-by: ashors1 <ashors@nvidia.com>

* bug fix

Signed-off-by: ashors1 <ashors@nvidia.com>

---------

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants