Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateless timer fix for PTL 1.6 #3925

Merged
merged 11 commits into from
Apr 4, 2022
Merged

Stateless timer fix for PTL 1.6 #3925

merged 11 commits into from
Apr 4, 2022

Conversation

MaximumEntropy
Copy link
Contributor

Signed-off-by: MaximumEntropy sandeep.subramanian.1@umontreal.ca

What does this PR do ?

Fixes a stateless timer restore issue because PTL 1.6 changed their Timer API

Collection: all

Changelog

  • Override save_state_dict() and load_state_dict() to not load time information.

Usage

N/A

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented Apr 4, 2022

This pull request introduces 2 alerts when merging da1fe22 into cfbb5f9 - view on LGTM.com

new alerts:

  • 2 for Unused import

MaximumEntropy and others added 10 commits April 4, 2022 12:12
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: ericharper <complex451@gmail.com>
@lgtm-com
Copy link

lgtm-com bot commented Apr 4, 2022

This pull request fixes 2 alerts when merging 3453f12 into 5d2e6dd - view on LGTM.com

fixed alerts:

  • 2 for Unused import

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper ericharper merged commit 5979013 into r1.8.0 Apr 4, 2022
@ericharper ericharper deleted the stateless_timer_fix branch April 4, 2022 21:35
ericharper added a commit that referenced this pull request Apr 8, 2022
* Stateless timer fix for PTL 1.6

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Stateless timer PTL test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix year

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove unused imports

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* GPU test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* clean import

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: ericharper <complex451@gmail.com>
titu1994 added a commit that referenced this pull request Apr 9, 2022
* update version

Signed-off-by: ericharper <complex451@gmail.com>

* Stateless timer fix for PTL 1.6 (#3925)

* Stateless timer fix for PTL 1.6

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Stateless timer PTL test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix year

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove unused imports

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* GPU test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* clean import

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: ericharper <complex451@gmail.com>

* fix save_best missing chpt bug, update for setup_tokenizer() changes (#3932)

* fix save_best missing chpt bug, update for setup_tokenizer() changes

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* style fix

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* Fix divide by world size (#3941)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* remove old doc (#3946)

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* Fix issues with librosa deprecations (#3950)

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix issue with Segfault in ASR models (#3956)

* Fix issue with Segfault in ASR models

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add docstring

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix notebook bugs for branch r1.8.0 (#3948)

* load the model from ngc

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix all biomegatron notebook

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix the typos

Signed-off-by: Yi Dong <doyend@gmail.com>

* remove output

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix isort

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix merge error

Signed-off-by: Yi Dong <doyend@gmail.com>

* change ntpath for isort workaround

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix unit test

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci bert pretraining

Signed-off-by: Yi Dong <doyend@gmail.com>

* make it compatible with main

Signed-off-by: Yi Dong <doyend@gmail.com>

* add the teste for biomegatron ner

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix argument

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix usablity issue

Signed-off-by: Yi Dong <doyend@gmail.com>

* work around

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix global batch fit loop (#3936)

* add lightning module hooks for global batch

Signed-off-by: ericharper <complex451@gmail.com>

* clean scripts

Signed-off-by: ericharper <complex451@gmail.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* remove unused import

Signed-off-by: ericharper <complex451@gmail.com>

* DP=1 fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* set num dataset workers to 2

Signed-off-by: ericharper <complex451@gmail.com>

* update validation_loop with GlobalDataFetcher

Signed-off-by: ericharper <complex451@gmail.com>

* add test global data fetcher

Signed-off-by: ericharper <complex451@gmail.com>

* Drop last for test ds

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix test epoch end

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix eval

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix reconfigure microbatch in the complete method

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* add comments

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Set init consumed samples

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* fix shuffle

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* add save_restore_connector arg

Signed-off-by: ericharper <complex451@gmail.com>

* Fix padding for labels and loss mask

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* GLUE/XNLI CI tests

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* limit val batches in hydra fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Restart CI

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix unittest

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Update max_epochs on megatron configs (#3958)

* update config

Signed-off-by: ericharper <complex451@gmail.com>

* update config

Signed-off-by: ericharper <complex451@gmail.com>

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* update version

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Yi Dong <doyend@gmail.com>
ericharper added a commit that referenced this pull request Apr 20, 2022
* Stateless timer fix for PTL 1.6

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Stateless timer PTL test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix year

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove unused imports

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* GPU test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* clean import

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: ericharper <complex451@gmail.com>
ericharper added a commit that referenced this pull request Apr 20, 2022
* update version

Signed-off-by: ericharper <complex451@gmail.com>

* Stateless timer fix for PTL 1.6 (#3925)

* Stateless timer fix for PTL 1.6

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Stateless timer PTL test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix year

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove unused imports

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* GPU test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* clean import

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: ericharper <complex451@gmail.com>

* Fix issues with librosa deprecations (#3950)

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix notebook bugs for branch r1.8.0 (#3948)

* load the model from ngc

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix all biomegatron notebook

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix the typos

Signed-off-by: Yi Dong <doyend@gmail.com>

* remove output

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix isort

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix merge error

Signed-off-by: Yi Dong <doyend@gmail.com>

* change ntpath for isort workaround

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix unit test

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci bert pretraining

Signed-off-by: Yi Dong <doyend@gmail.com>

* make it compatible with main

Signed-off-by: Yi Dong <doyend@gmail.com>

* add the teste for biomegatron ner

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix argument

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix usablity issue

Signed-off-by: Yi Dong <doyend@gmail.com>

* work around

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix global batch fit loop (#3936)

* add lightning module hooks for global batch

Signed-off-by: ericharper <complex451@gmail.com>

* clean scripts

Signed-off-by: ericharper <complex451@gmail.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* remove unused import

Signed-off-by: ericharper <complex451@gmail.com>

* DP=1 fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* set num dataset workers to 2

Signed-off-by: ericharper <complex451@gmail.com>

* update validation_loop with GlobalDataFetcher

Signed-off-by: ericharper <complex451@gmail.com>

* add test global data fetcher

Signed-off-by: ericharper <complex451@gmail.com>

* Drop last for test ds

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix test epoch end

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix eval

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix reconfigure microbatch in the complete method

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* add comments

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Set init consumed samples

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* fix shuffle

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* add save_restore_connector arg

Signed-off-by: ericharper <complex451@gmail.com>

* Fix padding for labels and loss mask

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* GLUE/XNLI CI tests

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* limit val batches in hydra fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Restart CI

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix unittest

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Exports 22.03 war (#3957)

* Fixed fastpitch for 22.03

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* cleanup

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Restored mask expansion; added WAR for test container images

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* style

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Refactor restorefrom (#3927)

* update package info (#3926)

Signed-off-by: ericharper <complex451@gmail.com>

* Refactor restore_from

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* Move export related python files to scripts/export/

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* Return state dict after modification function

* Remove Megatron legacy parameter in common.py restore_from function

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* ability to set log_predictions to false (#3929)

* Bumping Python version

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>

* fixing style

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>

* load the model from ngc

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix all biomegatron notebook

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix the typos

Signed-off-by: Yi Dong <doyend@gmail.com>

* remove output

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix isort

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix merge error

Signed-off-by: Yi Dong <doyend@gmail.com>

* change ntpath for isort workaround

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix unit test

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci bert pretraining

Signed-off-by: Yi Dong <doyend@gmail.com>

* Rearrage export files; Style fix; Extend legacy MegatronBert conversion to NLP models nemo version updation

* Glu activation variants (#3951)

* Temp

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add reglu and swiglu activations

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style on unrelated file

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* CI changes to test activations

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix unused import

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style fix beacuse of merge from main

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* make it compatible with main

Signed-off-by: Yi Dong <doyend@gmail.com>

* add the teste for biomegatron ner

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix argument

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix usablity issue

Signed-off-by: Yi Dong <doyend@gmail.com>

* FastPitch FT notebook - Improving Speech Quality clarifications (#3954)

* FastPitch FT notebook - Improving Speech Quality clarifications

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Add pynini dependency install to FastPitch FT notebook

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Pin pynini install for FastPitch FT tutorial

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* work around

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>
Co-authored-by: Dima Rekesh <bmwshop@gmail.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>

* Bump TTS deprecation version to 1.9 (#3955)

* bump deprecation version

Signed-off-by: Jason <jasoli@nvidia.com>

* update talknet depre

Signed-off-by: Jason <jasoli@nvidia.com>

* added conformer for zh. (#3970)

Signed-off-by: Vahid <vnoroozi@nvidia.com>

* Add pinned pynini and scipy installs to TTS training tutorial (#3967)

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Fix variable name and move models to CPU in Change partition (#3972)

* fixes

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* add CI

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>

* fix misconfiguration (#3975)

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>

* Fix NMT variable passing bug (#3985)

* fix

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* stylefix

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* Compatability override to load_state_dict for old TTS checkpoints (#3978)

* Compatability override to load_state_dict for old TTS checkpoints

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Tacotron2 training notebook fix - add GPU argument

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Add hann window override warning for old model loading

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Notebook Bug Fixes for r1.8.0 (#3989)

* Made config related bug fixes

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Fixed cfg.get syntax

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Fix compat override for TalkNet Aligner (#3993)

* Fix compatibility override for TalkNet Aligner

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Remove extraneous logging import

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* docs fixes (#3987)

* docs fixes

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* rename files in docs

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* docs improvement

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* arg renamed

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* Fix nemo megatron restore with artifacts (#3997)

* update config_path in register_artifact

Signed-off-by: ericharper <complex451@gmail.com>

* fix register_artifact calls

Signed-off-by: ericharper <complex451@gmail.com>

* fix register_artifact calls

Signed-off-by: ericharper <complex451@gmail.com>

* update log messages to include merges file

Signed-off-by: ericharper <complex451@gmail.com>

* add default prompts to config

Signed-off-by: ericharper <complex451@gmail.com>

* Fixes val_check_interval, skip loading train data during eval (#3968)

* Change stage check

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix bugs in megatron t5 glue eval scripts

Signed-off-by: Yu Yao <yuya@nvidia.com>

* Fix reconfigure

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Change check

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix hasattr

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix typo in cfg structure

Signed-off-by: Yu Yao <yuya@nvidia.com>

* Update megatron t5 glue eval config file

Signed-off-by: Yu Yao <yuya@nvidia.com>

* Reconfigure to avoid drop last

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix for train step reconfigure as well

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Update megatron t5 glue eval config file drop_last to False

Signed-off-by: Yu Yao <yuya@nvidia.com>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* limit test batches

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* LogProb calculation performance fix (#3984)

* performance fix for logprob computation

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix redandant assign

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix bug to add gather from TP workers

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>

* Fix link issues in export example notebook and fix pretrained model info for MegatronBert (#4004)

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

Co-authored-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* Fix single GPU training issue + change deprecated Lightning args (#4010)

* change vars

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* style fix

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* Fix P-Tune T5 model (#4001)

* fix ptune t5

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci test

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix the ci fail because of the order problem

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Megatron work-arounds (#3998)

* WAR around Apex issue, and making sure output is FP32

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Fixing merge issues; moving dummy Trainer; adding float() casts

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Fixing ColumnParallelLinear call

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cleanup

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cleanup#2

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* fix the broadcast shape mismatch (#4017)

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* add known issues (#4024)

Signed-off-by: ericharper <complex451@gmail.com>

* update readme with conda env setup instructions

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* update package info

Signed-off-by: ericharper <complex451@gmail.com>

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* update package info

Signed-off-by: ericharper <complex451@gmail.com>

* revert apex guard removal

Signed-off-by: ericharper <complex451@gmail.com>

* revert --language to --lang

Signed-off-by: ericharper <complex451@gmail.com>

* fix apex guard

Signed-off-by: ericharper <complex451@gmail.com>

* remove set_trace

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* fix apex guard

Signed-off-by: ericharper <complex451@gmail.com>

* remove unreachable statement

Signed-off-by: ericharper <complex451@gmail.com>

* remove duplicate lines

Signed-off-by: ericharper <complex451@gmail.com>

* remove duplicate lines

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: Ramanathan Arunachalam <ramanathan.arun@rutgers.edu>
Co-authored-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>
Co-authored-by: Dima Rekesh <bmwshop@gmail.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: Vahid Noroozi <VahidooX@users.noreply.github.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants