Upgrade to pytorch lightning 2.0 (NVIDIA#6433)

* Upgrade pytorch lightning version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Initial fixes for PTL2.0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add further fixes to support lightning 2.0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add replacements for replace_sampler_ddp, resume_from_checkpoint_fit_path and few occurances of validation_epoch_end Signed-off-by: Abhishree <abhishreetm@gmail.com> * Replace all occurances of validation_epoch_end to on_validation_epoch_end Signed-off-by: Abhishree <abhishreetm@gmail.com> * Replace training_epoch_end, test_epoch_end with on_train_epoch_end and on_test_epoch_end respectively Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change logger=None to logger=False in Trainer object Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove PTL2.0 deprecated Trainer args from TrainerConfig dataclass Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify trainer.precision check and other small edits Signed-off-by: Abhishree <abhishreetm@gmail.com> * Replace logger=None with logger=False in test_ptl_stateless_timer.py Trainer Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add default values for args to fix Attribute Error Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following modifications 1) Remove outputs arg from on_validation_epoch_end, on_test_epoch_end and make it an arg of the class 2) Replace resume_from_checkpoint with ckpt_path as needed 3) Explicitly add accelerator as 'CPU' in UTs being run on CPU Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove outputs arg from on_validation_epoch_end, on_test_epoch_end Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove outputs arg in on_validation_epoch_end in MultiBinaryAccuracy docstrings Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add val, test outputs as instance vars in PunctuationCapitalizationModel and TokenClassificationModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Replace trainer.fit_loop.max_steps with trainer.fit_loop.epoch_loop.max_steps in test_optimizers_schedulers.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Revert an extra space that was mistakenly added Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use self.validation_step_outputs and self.test_step_outputs in test_ema.py for uniformity Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use self.validation_step_outputs and self.test_step_outputs in test_ptl_stateless_timer.py and check_for_ranks.py for uniformity Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add self.validation_step_outputs.clear() and self.test_step_outputs.clear() wherever missing Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove outputs arg from on_train_epoch_end Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove outputs from on_validation_epoch_end in multi_binary_acc.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove output args from on_validation_epoch_end in the docstrings of some ASR files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove output args from on_validation_epoch_end and clear memory from validation_step_outputs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add on_validation_epoch_end and remove outputs args for nlp models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Append output of validation_step to validation_step_outputs in EncDecClassificationModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes 1) Index self.validation_step_outputs and self.test_step.outputs with dataloader_idx wherever needed 2) Initialize self.validation_step_outputs and self.test_step.outputs as empty lists and add support for multi dataloaders if they exist 3) Remove self.pre_configure_ddp from NLPDDPStrategy class as its removed in PTL 2.0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add default value dataloader_idx=0 for on_validation_batch_end() in megatron_base_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * TypeCast precision to str in attention.py and utils_funcs.py to avoid TypeError Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add if condition check for multiple dataloaders when appending to validation outputs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Separate validation pass to be used with both validation_step and test_step Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add if condition check for multiple dataloader while appending to test_step_outputs in punctuation_capitalization_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for multiple dataloaders based on type of trainer.val/test_dataloaders or self._validation/test_dl instead of len Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment Megatron T5 IA3 PP=2 in CI pipeline due to dataloader_iter issue with PTL 2.0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify precision checks to account for 16-mixed and bf16-mixed Signed-off-by: Abhishree <abhishreetm@gmail.com> * Append output of validation/test_step to self.validation/test_step_outputs in CTCG2PModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify find_unused_parameters=True in g2p_heteronym model 1) Add find_unused_parameters=True for DDP strategy in g2p_heteronym_classification_train_and_evaluate.py 2) Remove args output in validation/test_step and add instance variables instead for heteronym_classification.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove outputs from on_test_epoch_end in DialogueGPTClassificationModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add validation/test outputs in sgdqa_model and modify dialogue_config.yaml Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add split arg self.test_step_outputs to TextClassificationModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add test_step_outputs to dialogue and text classification models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change condition check for multiple dataloaders: 1) Replace ds_item as list in dialogue_config.yaml 2) Check for len of val/test_dataloaders or validation/test_dl along with type check of list in sgdqa_model.py while appending outputs of validation/test_step 3) Check for len of _validation/test_dl for creating self.validation/test_step_outputs in ModelPT and punctuation_cpitalization_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add additional condition for multi dataloaders Check len(self.trainer.val/test_dataloaders) > 1 along with type(self.trainer.val/test_dataloaders) == list for multi dataloaders in validation/test_step Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add val step outputs and default val for dataloader_idx 1) Append validation_step outout to self.validation_step_outputs in MultiLabelIntentSlotClassificationMode 2) Add default val for dataloader_idx for on_test_batch_start/end in TimingCallback 3) Add self.validation/test_step_outputs in BERTQAModel and remove outputs arg Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add val/test_step_outputs to S2SQAModel and GPTQAModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Edit JenkinsFile for bert_pretrainig.py Edit Jenkinsfile for this test to disable validation as a workaround for trainer.val_dataloader None error Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify precision to support 16-mixed, bf16-mixed in megatron_gpt_pretraining.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add ddp_find_unused_parameters_true and remove output args 1) Add ddp_find_unused_parameters_true fro trainer.strategy in self_alignment_pretraining.py as it has unused parameters 2) Remove output args and add self.validation/test_step_outputs to validation/test_step in mt_enc_dec_model.py 3) Comment tests in JenkinsFile that need to be fixed Signed-off-by: Abhishree <abhishreetm@gmail.com> * Precision fix in megatron_nmt_training.py for 16-mixed, bf16-mixed Signed-off-by: Abhishree <abhishreetm@gmail.com> * Precision fix for megatron_bert_pretraining.py and megatron_bert_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Precision fix and validation/test_step_outputs 1) Add fix to account for 16-mixed and bf16-mixed in megatron_retro_mutransfer_pretrain.py, megatron_retro_pretraining.py 2) Reset ckpt_path for test in enc_dec_nmt.py 3) Remove outputs args and add validation/test_step_outputs in megatron_retrieval_model.py 4) Comment Megatron Bert Pretraining and Resume Training with Pipeline Paralleism and add back NMT Training Post-LN Signed-off-by: Abhishree <abhishreetm@gmail.com> * Precision fix and skip few failing tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing comment lines in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment jenkin tests and super().on_validation_epoch_end() in megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Minor edit JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Minor edit in jenkins file Signed-off-by: Abhishree <abhishreetm@gmail.com> * Edit in Jenkins file Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment missed lines in Jenkins file Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix precision and validation/test outputs 1) Add precision fix to account for 16-mixed and bf16-mixed in megatron_t5_pretraining.py 2) Remove outputs args and add append loss to self.validation/test_step_outputs in megatron_lm_encoder_decoder_model.py 3) Add back resume_from_checkpoint in the megatron_t5_config.yaml 4) Comment out certain tests in Jenkins file Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix precision and validation/test/predict errors in megatron_t5_prompt_learning.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Precision fix and edit precision typo in all files 1) Account for 16-mixed and bf16-mixed in megatron_bart_pretraining.py and megatron_t5_seq2seq_finetune.py 2) Fix precision typo in all files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix all CI TTS tests and comment few Jenkins tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Combine xx_epoch_end and on_xx_epoch_end Add on_inference_epoch_end to inference_epoch_end function and have a single on_validation/test_epoch_end in megatron_finetune_model.py and megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add a missing comment in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add try except StopIteration in validation_step for models with dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove pyyaml from requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add try except for inference_step in megatron_finetune_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove limit_val_batches for mockGPTDataset test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add new self.validation_step_outputs for MegatronGPTSFTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Minor edit Jenkinsfile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Initialize self.validation/test_step_outputs in megatron_gpt_sft_model.py Initialize self.validation/test_step_outputs in setup of MegatronGPTSFTModel to take care of cases when datalaoders are not setup in ModelPT for example while restoring the model. Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove resume_from_checkpoint if trainer arg in conf yaml files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove resume_from_checkpoint as trainer arg in GPT, T5 configs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove resume_from_checkpoint in duplex_tn_config.yaml Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix typos, unused imports and refactor code to remove redundant funcs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove commented code in megatron_nmt_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix overriden functions to match parent class functions Signed-off-by: Abhishree <abhishreetm@gmail.com> * Prefetch dataloader_iter to prevent hang for PP>1 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override setup() in NLPDDPStrategy to avoid hang during predict with PP>1 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment tests in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add '16' to precision checks and other minor fixes Signed-off-by: Abhishree <abhishreetm@gmail.com> * Clear validation/test_step_outputs with dataloader_idx for multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * Minor edits Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify precision checks to avoid indexing Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove self.validation_step_outputs_sft and add dataloader_idx to clear outputs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Reference checkpoint with trainer.ckpt_path Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add _prefetch to NLPModel and minor fixes Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add limit_val_batches in JenkinsFile for NMT 1) Add trainer.limit_val_batches in Megatron NMT Training TP=2 2) Remove unused import in ModelPT Signed-off-by: Abhishree <abhishreetm@gmail.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
rohitrango · Aug 5, 2023 · e51562f · e51562f
1 parent e4b0985
commit e51562f
Show file tree

Hide file tree

Showing 152 changed files with 1,452 additions and 934 deletions.
diff --git a/Jenkinsfile b/Jenkinsfile
@@ -2234,7 +2234,10 @@ pipeline {
               trainer.devices=[1] \
               trainer.accelerator="gpu" \
               trainer.precision=16 \
-              +trainer.fast_dev_run=true \
+              +trainer.fast_dev_run=false \
+              +trainer.max_epochs=1 \
+              +trainer.limit_val_batches=0 \
+              +trainer.limit_train_batches=1 \
               model.train_ds.data_file=/home/TestData/nlp/wiki_book_mini/training \
               model.train_ds.batch_size=8 \
               model.language_model.lm_checkpoint=/home/TestData/nlp/bert_ckpts/nemo1.0/bert_base_uncased_mlm_final_1074591_nemo1.0.pt \
@@ -2626,7 +2629,6 @@ pipeline {
         sh "rm -rf examples/nlp/machine_translation/megatron_nmt_results"
       }
     }
-
     // stage('L2: NMT Bottleneck Fallback') {
     //   when {
     //     anyOf {
@@ -3202,7 +3204,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
         trainer.accelerator=gpu \
         trainer.log_every_n_steps=1 \
         trainer.val_check_interval=2 \
-        trainer.limit_val_batches=1 \
+        trainer.limit_val_batches=2 \
         trainer.accumulate_grad_batches=1 \
         trainer.max_steps=6 \
         trainer.precision=16 \
@@ -3319,10 +3321,10 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
         //model.activations_checkpoint_num_layers=1 \
         //model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \
         //model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings"
-        sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results"
-        sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
-      }
-    }
+         sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results"
+         sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
+       }
+     }
     stage('L2: Megatron GPT with Rope Pretraining using Flash Attention and Resume Training TP=2') {
       when {
         anyOf {
@@ -3578,8 +3580,8 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
         //model.activations_checkpoint_num_layers=1 \
         //model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \
         //model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings"
-        sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results"
-        sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
+        //sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results"
+        //sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
       }
     }
     stage('L2: Megatron GPT Pretraining and Resume Training PP=2') {
@@ -3666,6 +3668,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
         sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
       }
     }
+    // @athitten Remove /home/TestData/nlp/megatron_sft/trec.jsonl for validation and test file until we have support for multiple dataloaders in lightning 2.0
     stage('L2: Megatron GPT Finetuning PP=2') {
       when {
         anyOf {
@@ -3696,13 +3699,13 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
         model.data.train_ds.num_workers=0 \
         model.data.test_ds.micro_batch_size=1 \
         model.data.test_ds.global_batch_size=4 \
-        model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl,/home/TestData/nlp/megatron_sft/trec.jsonl] \
-        model.data.test_ds.names=[quarel,trec] \
+        model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
+        model.data.test_ds.names=[quarel] \
         model.data.validation_ds.micro_batch_size=1 \
         model.data.validation_ds.global_batch_size=4 \
         model.data.validation_ds.num_workers=0 \
-        model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl,/home/TestData/nlp/megatron_sft/trec.jsonl] \
-        model.data.validation_ds.names=[quarel,trec]"
+        model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
+        model.data.validation_ds.names=[quarel]"
         sh "python examples/nlp/language_modeling/tuning/megatron_gpt_sft.py \
         trainer.devices=2 \
         trainer.log_every_n_steps=1 \
@@ -3724,13 +3727,13 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
         model.data.train_ds.num_workers=0 \
         model.data.test_ds.micro_batch_size=1 \
         model.data.test_ds.global_batch_size=4 \
-        model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl,/home/TestData/nlp/megatron_sft/trec.jsonl] \
-        model.data.test_ds.names=[quarel,trec] \
+        model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
+        model.data.test_ds.names=[quarel] \
         model.data.validation_ds.micro_batch_size=1 \
         model.data.validation_ds.global_batch_size=4 \
         model.data.validation_ds.num_workers=0 \
-        model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl,/home/TestData/nlp/megatron_sft/trec.jsonl] \
-        model.data.validation_ds.names=[quarel,trec]"
+        model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
+        model.data.validation_ds.names=[quarel]"
         sh "rm -rf examples/nlp/language_modeling/gpt_sft_results"
       }
     }
@@ -3912,7 +3915,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
     //    }
     //  }
     //}
-
     stage('L2: Megatron GPT Prompt Tuning TP2 PP1') {
       when {
         anyOf {
@@ -3955,7 +3957,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
         }
       }
     }
-
      stage('L2: Megatron GPT Prompt Tuning TP1 PP2') {
        when {
          anyOf {
@@ -3995,10 +3996,10 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
                  data_paths=['/home/TestData/nlp/prompt_learning/boolq_CI_test.jsonl']"
              sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp.nemo"
              sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp_preds.txt"
-           }
-         }
-       }
-     }
+            }
+          }
+        }
+      }
 
     // TODO: Add this test back. Test was failing on CI machines due to HW error
     // stage('L2: Megatron GPT Convert from Megatron-LM checkpoing and Eval') {
@@ -4608,7 +4609,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
     //     }
     //   }
     // }
-
     stage('L2: Megatron UL2 Pretraining and Resume Training TP=2') {
       when {
         anyOf {
@@ -4748,7 +4748,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
         trainer.accelerator=gpu \
         trainer.log_every_n_steps=1 \
         trainer.val_check_interval=2 \
-        trainer.limit_val_batches=1 \
         trainer.accumulate_grad_batches=1 \
         trainer.max_steps=6 \
         trainer.precision=16 \
@@ -4934,7 +4933,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
           steps {
             sh "python examples/nlp/language_modeling/megatron_gpt_pretraining.py \
             trainer.max_steps=10 \
-            trainer.limit_val_batches=1 \
             trainer.val_check_interval=10 \
             exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \
             model.data.data_impl=mock \
@@ -4947,7 +4945,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
           steps {
             sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \
             trainer.max_steps=10 \
-            trainer.limit_val_batches=1 \
             trainer.val_check_interval=10 \
             exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \
             model.data.data_impl=mock \
@@ -4974,7 +4971,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
             trainer.devices=[0] \
             trainer.accelerator="gpu" \
             +trainer.limit_train_batches=1 +trainer.limit_val_batches=1 trainer.max_epochs=1 \
-            trainer.strategy=null \
+            trainer.strategy=auto \
             model.decoder.decoder_rnn_dim=256 \
             model.decoder.attention_rnn_dim=1024 \
             model.decoder.prenet_dim=128 \
@@ -4996,7 +4993,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
             validation_datasets=/home/TestData/an4_dataset/an4_val.json \
             trainer.devices="[0]" \
             +trainer.limit_train_batches=1 +trainer.limit_val_batches=1 trainer.max_epochs=1 \
-            trainer.strategy=null \
+            trainer.strategy=auto \
             model.train_ds.dataloader_params.batch_size=4 \
             model.train_ds.dataloader_params.num_workers=0 \
             model.validation_ds.dataloader_params.batch_size=4 \
@@ -5018,7 +5015,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
             +trainer.limit_train_batches=1 \
             +trainer.limit_val_batches=1 \
             trainer.max_epochs=1 \
-            trainer.strategy=null \
+            trainer.strategy=auto \
             model.pitch_mean=212.35873413085938 \
             model.pitch_std=68.52806091308594 \
             model.train_ds.dataloader_params.batch_size=4 \
@@ -5045,7 +5042,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
             +trainer.limit_train_batches=1 \
             +trainer.limit_val_batches=1 \
             trainer.max_epochs=1 \
-            trainer.strategy=null \
+            trainer.strategy=auto \
             model.pitch_mean=212.35873413085938 \
             model.pitch_std=68.52806091308594 \
             model.train_ds.dataloader_params.batch_size=4 \
@@ -5070,7 +5067,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
             +trainer.limit_train_batches=1 \
             +trainer.limit_val_batches=1 \
             trainer.max_epochs=1 \
-            trainer.strategy=null \
+            trainer.strategy=auto \
             model.pitch_mean=212.35873413085938 \
             model.pitch_std=68.52806091308594 \
             model.train_ds.dataloader_params.batch_size=4 \
@@ -5091,7 +5088,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
             +trainer.limit_train_batches=1 \
             +trainer.limit_val_batches=1 \
             +trainer.max_epochs=1 \
-            trainer.strategy=null \
+            trainer.strategy=auto \
             model.train_ds.dataloader_params.batch_size=4 \
             model.train_ds.dataloader_params.num_workers=0 \
             model.validation_ds.dataloader_params.batch_size=4 \

diff --git a/docs/source/tts/api.rst b/docs/source/tts/api.rst
@@ -8,22 +8,22 @@ Mel-Spectrogram Generators
 .. autoclass:: nemo.collections.tts.models.FastPitchModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 .. autoclass:: nemo.collections.tts.models.MixerTTSModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 .. autoclass:: nemo.collections.tts.models.RadTTSModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 .. autoclass:: nemo.collections.tts.models.Tacotron2Model
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 .. autoclass:: nemo.collections.tts.models.SpectrogramEnhancerModel
     :show-inheritance:
@@ -36,38 +36,38 @@ Speech-to-Text Aligner Models
 .. autoclass:: nemo.collections.tts.models.AlignerModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 
 Two-Stage Models
 ~~~~~~~~~~~~~~~~~
 .. autoclass:: nemo.collections.tts.models.TwoStagesModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 
 Vocoders
 ~~~~~~~~
 .. autoclass:: nemo.collections.tts.models.GriffinLimModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 .. autoclass:: nemo.collections.tts.models.HifiGanModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 .. autoclass:: nemo.collections.tts.models.UnivNetModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 .. autoclass:: nemo.collections.tts.models.WaveGlowModel
     :show-inheritance:
     :members:
-    :exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
+    :exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
 
 
 Base Classes

diff --git a/examples/asr/conf/asr_adapters/asr_adaptation.yaml b/examples/asr/conf/asr_adapters/asr_adaptation.yaml
@@ -187,7 +187,6 @@ trainer:
   precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
   log_every_n_steps: 10  # Interval of logging.
   enable_progress_bar: True
-  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
   check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
   sync_batchnorm: true

diff --git a/examples/asr/conf/conformer/conformer_ctc_bpe.yaml b/examples/asr/conf/conformer/conformer_ctc_bpe.yaml
@@ -204,7 +204,6 @@ trainer:
   precision: 32 # 16, 32, or bf16
   log_every_n_steps: 10  # Interval of logging.
   enable_progress_bar: True
-  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
   check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
   sync_batchnorm: true

diff --git a/examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_bpe.yaml b/examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_bpe.yaml
@@ -239,7 +239,6 @@ trainer:
   precision: 32 # 16, 32, or bf16
   log_every_n_steps: 10  # Interval of logging.
   enable_progress_bar: True
-  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
   check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
   sync_batchnorm: true

diff --git a/examples/asr/conf/squeezeformer/squeezeformer_ctc_bpe.yaml b/examples/asr/conf/squeezeformer/squeezeformer_ctc_bpe.yaml
@@ -179,7 +179,6 @@ trainer:
   precision: 32 # 16, 32, or bf16
   log_every_n_steps: 10  # Interval of logging.
   enable_progress_bar: True
-  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
   check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
   sync_batchnorm: true

diff --git a/examples/asr/conf/ssl/wav2vec/wav2vec_ci.yaml b/examples/asr/conf/ssl/wav2vec/wav2vec_ci.yaml
@@ -138,7 +138,6 @@ trainer:
   gradient_clip_val: 0.0
   precision: 32 # 16, 32, or bf16
   log_every_n_steps: 100 # Interval of logging.
-  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
   check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
   sync_batchnorm: false

diff --git a/examples/nlp/dialogue/conf/dialogue_config.yaml b/examples/nlp/dialogue/conf/dialogue_config.yaml
@@ -25,7 +25,6 @@ trainer:
   accelerator: gpu
   log_every_n_steps: 5  # Interval of logging.
   val_check_interval: 1.0  # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
-  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
   enable_checkpointing: False # Provided by exp_manager
   logger: False  # Provided by exp_manager 

diff --git a/examples/nlp/duplex_text_normalization/conf/duplex_tn_config.yaml b/examples/nlp/duplex_text_normalization/conf/duplex_tn_config.yaml
@@ -71,7 +71,6 @@ decoder_trainer:
   strategy: ddp
   log_every_n_steps: 1  # Interval of logging.
   val_check_interval: 1.0  # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
-  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
 
 decoder_model:
   do_training: true

diff --git a/examples/nlp/entity_linking/self_alignment_pretraining.py b/examples/nlp/entity_linking/self_alignment_pretraining.py
@@ -27,6 +27,10 @@
 
 @hydra_runner(config_path="conf", config_name="umls_medical_entity_linking_config.yaml")
 def main(cfg: DictConfig) -> None:
+    # PTL 2.0 has find_unused_parameters as False by default, so its required to set it to True
+    # when there are unused parameters here
+    if cfg.trainer.strategy == 'ddp':
+        cfg.trainer.strategy = "ddp_find_unused_parameters_true"
     logging.info(f"\nConfig Params:\n{OmegaConf.to_yaml(cfg)}")
     trainer = Trainer(**cfg.trainer)
     exp_manager(trainer, cfg.get("exp_manager", None))