Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(activation_checkpointing): add non_reentrant_checkpoint to support inputs require no grad #4118

Merged
merged 27 commits into from
Aug 29, 2023

Conversation

hughpu
Copy link
Contributor

@hughpu hughpu commented Aug 9, 2023

The added function is union of torch.utils.checkpoint._checkpoint_without_reentrant and CheckpointFunction in checkpointing module.

  • this aim to solve the back probagation error raised from all input requires no grad, such as first layer of normal training, lora training until layer get lora injected, traditional fintune with only last layer trainable, etc...
  • _checkpoint_without_reentrant has already been implemented in pytorch for a while, the solution is stable at most time except for jit script module.
  • can help to solve the issue which is hacked by deepspeed.runtime.pipe.module.PipelineModule._is_checkpointable
  • unit test added and proved to work, also tested with pipeline module lora tuning project as activation_checkpoint_func without specifing checkpointable_layers

hughpu and others added 15 commits August 7, 2023 19:20
* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <development@cmikeh2.me>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------

Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
@hughpu
Copy link
Contributor Author

hughpu commented Aug 9, 2023

@microsoft-github-policy-service agree

@hughpu
Copy link
Contributor Author

hughpu commented Aug 19, 2023

@tjruwase @mrwyattii may I know is there any possibility that this PR can get some suggestion? since it's already pending for 2 weeks without any feedback.

  • This PR is implemented in an extension way without any modification to existing code neither features. Just an new option for checkpointing function. Not harmful.
  • It's not sth brandy new and actually the re-implementation of Pytorch feature reentrant-checkpoint
  • this can help solving the issues as, 1) first stage with embedding cannot be checkpointed. 2) peft-tuning with first stages untrainable cause output tensor requires no grad, lossing grad_fn etc.

@tjruwase tjruwase requested a review from tohtana August 19, 2023 17:28
@tjruwase
Copy link
Contributor

@hughpu, apologies for the delay. We will review asap.

Copy link
Contributor

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hughpu for this great PR!

@tjruwase tjruwase added this pull request to the merge queue Aug 23, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 23, 2023
@hughpu
Copy link
Contributor Author

hughpu commented Aug 23, 2023

@tohtana @tjruwase thank you for the review!

@hughpu
Copy link
Contributor Author

hughpu commented Aug 23, 2023

hi @tjruwase, it seems the merging is stopped by some http error raised from huggingface, would you mind to merge it again?

@hughpu
Copy link
Contributor Author

hughpu commented Aug 29, 2023

hi @tjruwase , shall we move forward to merge this PR? feel free to let me know if there is anything that I can do to facilitate this.

@tjruwase tjruwase added this pull request to the merge queue Aug 29, 2023
@tjruwase
Copy link
Contributor

@hughpu, apologies for the delay. It is now queued for merging.

Merged via the queue into microsoft:master with commit 42c1e91 Aug 29, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants