feat(activation_checkpointing): add `non_reentrant_checkpoint` to support inputs require no grad #4118

hughpu · 2023-08-09T15:42:44Z

The added function is union of torch.utils.checkpoint._checkpoint_without_reentrant and CheckpointFunction in checkpointing module.

this aim to solve the back probagation error raised from all input requires no grad, such as first layer of normal training, lora training until layer get lora injected, traditional fintune with only last layer trainable, etc...
_checkpoint_without_reentrant has already been implemented in pytorch for a while, the solution is stable at most time except for jit script module.
can help to solve the issue which is hacked by deepspeed.runtime.pipe.module.PipelineModule._is_checkpointable
unit test added and proved to work, also tested with pipeline module lora tuning project as activation_checkpoint_func without specifing checkpointable_layers

…af forward tensor refs

…ant_checkpoint`

* Pass correct node size * formatting --------- Co-authored-by: Connor Holmes <development@cmikeh2.me> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* add deepspeed chat arxiv report * add zeroquant v2 and fp * add selective enhencement * add ignore for 'Youn' in spell checker --------- Co-authored-by: yaozhewei <zheweiy@berkeley.edu> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

…use and add regression tests

hughpu · 2023-08-09T15:45:36Z

@microsoft-github-policy-service agree

hughpu · 2023-08-19T04:12:42Z

@tjruwase @mrwyattii may I know is there any possibility that this PR can get some suggestion? since it's already pending for 2 weeks without any feedback.

This PR is implemented in an extension way without any modification to existing code neither features. Just an new option for checkpointing function. Not harmful.
It's not sth brandy new and actually the re-implementation of Pytorch feature reentrant-checkpoint
this can help solving the issues as, 1) first stage with embedding cannot be checkpointed. 2) peft-tuning with first stages untrainable cause output tensor requires no grad, lossing grad_fn etc.

tjruwase · 2023-08-19T17:28:31Z

@hughpu, apologies for the delay. We will review asap.

tohtana

Thank you @hughpu for this great PR!

hughpu · 2023-08-23T13:08:06Z

@tohtana @tjruwase thank you for the review!

hughpu · 2023-08-23T14:08:55Z

hi @tjruwase, it seems the merging is stopped by some http error raised from huggingface, would you mind to merge it again?

hughpu · 2023-08-29T06:20:27Z

hi @tjruwase , shall we move forward to merge this PR? feel free to let me know if there is anything that I can do to facilitate this.

tjruwase · 2023-08-29T10:48:24Z

@hughpu, apologies for the delay. It is now queued for merging.

hughpu and others added 15 commits August 7, 2023 19:20

feat: add non_reentrant_checkpoint

a20c79c

feat: add missing output postprocess and change the hook to record le…

8aeba5f

…af forward tensor refs

fix: make the multi_grad_hook registered after graph construction

ee04fa8

fix: backward compatibility for multi_tensor_hook

51f833d

fix: nonlocal reference error of deepspeed_saved_tensors

b29c1ef

fix: reduce repeating hook registration

37e7c23

Merge branch 'microsoft:master' into feat/non-reentrant-checkpoint

d7c5440

test: add test for `activation_checkpointing.checkpointing.non_reentr…

e22c487

…ant_checkpoint`

Pass correct node size for ZeRO++ (microsoft#4085)

4d2a274

* Pass correct node size * formatting --------- Co-authored-by: Connor Holmes <development@cmikeh2.me> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

style: change flake8 detected style missmatch

aaf309e

test: hack to clone the test_activation_checkpointing module for re…

a910922

…use and add regression tests

doc: explain the introduction of non_reentrant_checkpoint

fc919b1

doc: explain the test of non_reentrant_checkpoint

b6a0a44

Merge branch 'microsoft:master' into feat/non-reentrant-checkpoint

8ec86a4

hughpu requested review from jeffra, tjruwase and mrwyattii as code owners August 9, 2023 15:42

hughpu added 9 commits August 10, 2023 07:14

Merge branch 'master' into feat/non-reentrant-checkpoint

78c0d65

Merge branch 'master' into feat/non-reentrant-checkpoint

e4eff23

Merge branch 'master' into feat/non-reentrant-checkpoint

a6c7871

Merge branch 'master' into feat/non-reentrant-checkpoint

fbbb760

Merge branch 'master' into feat/non-reentrant-checkpoint

a338097

Merge branch 'master' into feat/non-reentrant-checkpoint

a00cff1

Merge branch 'master' into feat/non-reentrant-checkpoint

c17cc3d

Merge branch 'master' into feat/non-reentrant-checkpoint

a680399

Merge branch 'master' into feat/non-reentrant-checkpoint

13e766d

tjruwase requested a review from tohtana August 19, 2023 17:28

Merge branch 'master' into feat/non-reentrant-checkpoint

a46e326

tohtana approved these changes Aug 23, 2023

View reviewed changes

tohtana mentioned this pull request Aug 23, 2023

Fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4128

Closed

tjruwase added this pull request to the merge queue Aug 23, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 23, 2023

Merge branch 'master' into feat/non-reentrant-checkpoint

b5c03f4

Merge branch 'master' into feat/non-reentrant-checkpoint

13a026d

inkcherry mentioned this pull request Aug 26, 2023

use non_reentrant_checkpoint fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

Merged

tjruwase added this pull request to the merge queue Aug 29, 2023

Merged via the queue into microsoft:master with commit 42c1e91 Aug 29, 2023
16 checks passed

MetaBlues mentioned this pull request Nov 1, 2023

[REQUEST] How to use deepspeed.checkpointing.non_reentrant_checkpoint() properly with Stage3? #4595

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(activation_checkpointing): add `non_reentrant_checkpoint` to support inputs require no grad #4118

feat(activation_checkpointing): add `non_reentrant_checkpoint` to support inputs require no grad #4118

hughpu commented Aug 9, 2023 •

edited

Loading

hughpu commented Aug 9, 2023

hughpu commented Aug 19, 2023

tjruwase commented Aug 19, 2023

tohtana left a comment

hughpu commented Aug 23, 2023

hughpu commented Aug 23, 2023

hughpu commented Aug 29, 2023

tjruwase commented Aug 29, 2023

feat(activation_checkpointing): add non_reentrant_checkpoint to support inputs require no grad #4118

feat(activation_checkpointing): add non_reentrant_checkpoint to support inputs require no grad #4118

Conversation

hughpu commented Aug 9, 2023 • edited Loading

hughpu commented Aug 9, 2023

hughpu commented Aug 19, 2023

tjruwase commented Aug 19, 2023

tohtana left a comment

Choose a reason for hiding this comment

hughpu commented Aug 23, 2023

hughpu commented Aug 23, 2023

hughpu commented Aug 29, 2023

tjruwase commented Aug 29, 2023

feat(activation_checkpointing): add `non_reentrant_checkpoint` to support inputs require no grad #4118

feat(activation_checkpointing): add `non_reentrant_checkpoint` to support inputs require no grad #4118

hughpu commented Aug 9, 2023 •

edited

Loading