Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin #11128

vysarge · 2024-11-01T03:56:18Z

What does this PR do ?

Corrects handling of model.virtual_pipeline_model_parallel_size > 1 alongside model.peft.peft_scheme=lora; before this fix adapters are attached only to the first virtual pipeline chunk.

Collection: nlp

Changelog

Replace parameter-based check for first pipeline stage with parallel_state.is_pipeline_first_stage() to account for virtual pipeline stage
Alter various methods (_get_all_keys, _check_and_add_peft_cfg, setup_optimizer_param_groups, set_tunable_base_params, tie_weights, get_peft_state_dict, load_state_dict) to apply LoRA changes to all layers rather than only those in the first chunk and to write and read checkpoints correctly when virtual pipeline parallel is used

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

…eline parallel + LoRA Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

erhoo82 · 2024-11-01T05:55:32Z

nemo/collections/nlp/parts/mixins/nlp_adapter_mixins.py

+            and "model_0" in state_dict
+            and len(state_dict["model_0"]) == 0


Can you explain the logic of this code by comment?

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

…11128) * Update NLPAdapterModelMixin to handle model structure for virtual pipeline parallel + LoRA Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Clean up assert guard Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Clean up ValueError raise Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Apply isort and black reformatting Signed-off-by: vysarge <vysarge@users.noreply.github.com> * documentation Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Signed-off-by: vysarge <vysarge@users.noreply.github.com> Co-authored-by: vysarge <vysarge@users.noreply.github.com>

…11128) (#11135) * Update NLPAdapterModelMixin to handle model structure for virtual pipeline parallel + LoRA * Clean up assert guard * Clean up ValueError raise * Apply isort and black reformatting * documentation --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Signed-off-by: vysarge <vysarge@users.noreply.github.com> Co-authored-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: vysarge <vysarge@users.noreply.github.com>

…VIDIA#11128) * Update NLPAdapterModelMixin to handle model structure for virtual pipeline parallel + LoRA Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Clean up assert guard Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Clean up ValueError raise Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Apply isort and black reformatting Signed-off-by: vysarge <vysarge@users.noreply.github.com> * documentation Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Signed-off-by: vysarge <vysarge@users.noreply.github.com> Co-authored-by: vysarge <vysarge@users.noreply.github.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>

…VIDIA#11128) * Update NLPAdapterModelMixin to handle model structure for virtual pipeline parallel + LoRA Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Clean up assert guard Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Clean up ValueError raise Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Apply isort and black reformatting Signed-off-by: vysarge <vysarge@users.noreply.github.com> * documentation Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Signed-off-by: vysarge <vysarge@users.noreply.github.com> Co-authored-by: vysarge <vysarge@users.noreply.github.com>

vysarge added 3 commits October 31, 2024 20:38

Update NLPAdapterModelMixin to handle model structure for virtual pip…

680b1ce

…eline parallel + LoRA Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Clean up assert guard

354daf5

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Clean up ValueError raise

e40ded5

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

github-actions bot added the NLP label Nov 1, 2024

Apply isort and black reformatting

3e8f490

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

vysarge added Run CICD r2.0.0 labels Nov 1, 2024

vysarge requested a review from cuichenx November 1, 2024 04:11

erhoo82 reviewed Nov 1, 2024

View reviewed changes

documentation

96b4e47

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

vysarge added Run CICD and removed Run CICD labels Nov 1, 2024

pablo-garay enabled auto-merge (squash) November 1, 2024 16:51

erhoo82 approved these changes Nov 1, 2024

View reviewed changes

pablo-garay merged commit 2c42fc3 into NVIDIA:main Nov 1, 2024
159 of 160 checks passed

ko3n1g mentioned this pull request Nov 1, 2024

Cherry pick Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin (11128) into r2.0.0 #11135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin #11128

Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin #11128

vysarge commented Nov 1, 2024

erhoo82 Nov 1, 2024

vysarge Nov 1, 2024

		and "model_0" in state_dict
		and len(state_dict["model_0"]) == 0

Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin #11128

Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin #11128

Conversation

vysarge commented Nov 1, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

erhoo82 Nov 1, 2024

Choose a reason for hiding this comment

vysarge Nov 1, 2024

Choose a reason for hiding this comment