Fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4128

inkcherry · 2023-08-10T02:45:08Z

mechanism FYI https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb?short_path=e1e8ee7#L282C244-L282C257

For discrete input, could make activation for the embedding layer.
For image input, no need to calculate and save grads for input.

Have verified on megatron-deepspeed train/cifar10 train ut (test_pipe. py). Under the same seed, the data of parameter and gradient updates is exactly equal to before this addition.

It can reduce small memory on rank0 (stage0 contains embedding layer)

tohtana

Thank you for submitting this PR, @inkcherry!
Is the goal of this to stop creating gradients of inputs when it is unnecessary?

Can you also clarify intentions of some changes?

tohtana · 2023-08-16T17:46:39Z

deepspeed/runtime/pipe/engine.py

@@ -638,7 +638,6 @@ def _exec_forward_pass(self, buffer_id):

        # Zero out the gradients each time we use the tensor because only the data in
        # tensor changes across batches
-        self._zero_grads(inputs)


Can you explain why we can delete this? We could also delete the comment before this line.

DeepSpeed/deepspeed/runtime/pipe/engine.py

Lines 620 to 623 in 64c670e

if isinstance(self.pipe_buffers['inputs'][buffer_id], tuple):

inputs = tuple(t.clone() for t in self.pipe_buffers['inputs'][buffer_id])

else:

inputs = self.pipe_buffers['inputs'][buffer_id].clone()

here inputs has become a non-leaf tensor by clone() op, and its gradient has not been saved. so no need to call _zero_grad again.

and if if self.is_pipe_partitioned and not self.is_first_stage() is False, which means not create a new leaf tensor named input, access current inputs'(non-leaf) gradient in _zero_grad will trigger a warning

tohtana · 2023-08-16T17:47:05Z

deepspeed/runtime/pipe/module.py

+class CkptLayer_Enum(Enum):
+    not_ckpt_layer = 0
+    normal_ckpt_layer = 1
+    warp_ckpt_layer = 2


Does warp mean wrap?

tohtana · 2023-08-16T17:47:48Z

deepspeed/runtime/pipe/module.py

@@ -320,7 +343,7 @@ def forward(self, forward_input):
        # will see a different offset.
        self.micro_offset += 1

-        def exec_range_func(start, end):
+        def exec_range_func(start, end, warp_layer=False):


Do you mean wrap, not warp?

tohtana · 2023-08-16T17:48:30Z

deepspeed/runtime/pipe/module.py

@@ -342,7 +365,29 @@ def exec_func(*inputs):
                    inputs = layer(inputs)
                return inputs

-            return exec_func
+            def exec_func_warp(*inputs):


Can we consolidate this with exec_func?
Most of the code are duplicated.

inkcherry · 2023-08-17T10:23:50Z

Thank you for submitting this PR, @inkcherry! Is the goal of this to stop creating gradients of inputs when it is unnecessary?

Can you also clarify intentions of some changes?

@tohtana , thanks for your review and apologies for not explaining clearly.

Yes, currently, requires_grads of input of the first checkpoint layer must be true. I think this is a limitation, and the current code may have 2 following logic to work around this limitation in two scenarios. please correct me if I'm wrong.

For image input, set requires_ grad=True for input from the dataset,

DeepSpeed/deepspeed/runtime/pipe/engine.py

Line 780 in 64c670e

loaded.requires_grad = loaded.is_floating_point()

this will calculate and save grads for the input image, which I think may cost memory and computing time with large size input. related issue. The .grad attribute of a Tensor that is not a leaf Tensor is being accessed CERC-AAI/multimodal#16

Secondly, for LLM, notice the comment

DeepSpeed/deepspeed/runtime/pipe/module.py

Lines 621 to 623 in 64c670e

    
           # This is an unfortunate hack related to torch and deepspeed activation checkpoint implementations. 
        
           # Some layers like torch.nn.Embedding will not receive grads if checkpointed, which breaks things. 
        
           # I presume it's related to the discrete inputs that cannot require_grad? Need to revisit.

, it hooks transformer layer(force make embedding layer as a non-checkpoint layer, which take input with requires_grad=False and ouput tensor with requires_grad=True).

For this limitation, we could pass a dummy input that requires grad but isn't necessarily used in computation FYI.
https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb?short_path=e1e8ee7#L282C244-L282C257

Here wrapping a WrapModule at the first layer to achieve this.

inkcherry · 2023-08-23T06:19:02Z

@tohtana , I've made some changes to the code. Any suggestions? would appreciate your feedback.

tohtana · 2023-08-23T06:35:28Z

Hi @inkcherry, thank you for the fixes. Overall, this looks okay to me.

On the other hand, I just reviewed #4118 and this might solve a part of the issue you addressed. I didn't notice this PR until very recently, and I'm sorry that I couldn't share it with you.

We want to merge the PR, and I am wondering if we can simplify the changes in this PR using the new features proposed in #4118. Can you share your thoughts on this?

inkcherry · 2023-08-23T08:20:52Z

@tohtana, yes I agree, #4118 is great！
I could try based on #4118 when it is merged.

tjruwase · 2023-08-23T10:23:38Z

@inkcherry, #4118 is now merged. Also, could you please add a unit test when you retry? Thanks!

inkcherry · 2023-08-26T17:37:05Z

moved to #4224
close this one

inkcherry added 4 commits August 8, 2023 02:57

basic pass

01d297d

update

f4ff0a7

update

a12a6f7

fix format

ce26885

inkcherry requested review from ShadenSmith and duli2012 as code owners August 10, 2023 02:45

clean up

5fdc294

tjruwase requested a review from tohtana August 10, 2023 09:50

tohtana reviewed Aug 16, 2023

View reviewed changes

inkcherry added 6 commits August 17, 2023 10:46

fix typo

3c739f2

update

4189350

remove comment

5850c50

merge function

8adb198

unwrap for save

f544f44

Merge branch 'master' into fix_requires_grad

25f41cd

inkcherry mentioned this pull request Aug 26, 2023

use non_reentrant_checkpoint fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4224

Merged

inkcherry closed this Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4128

Fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4128

inkcherry commented Aug 10, 2023 •

edited

Loading

tohtana left a comment •

edited

Loading

tohtana Aug 16, 2023

inkcherry Aug 17, 2023 •

edited

Loading

tohtana Aug 16, 2023

inkcherry Aug 17, 2023

tohtana Aug 16, 2023

inkcherry Aug 17, 2023

tohtana Aug 16, 2023

inkcherry Aug 17, 2023

inkcherry commented Aug 17, 2023 •

edited

Loading

inkcherry commented Aug 23, 2023

tohtana commented Aug 23, 2023

inkcherry commented Aug 23, 2023 •

edited

Loading

tjruwase commented Aug 23, 2023

inkcherry commented Aug 26, 2023

	if isinstance(self.pipe_buffers['inputs'][buffer_id], tuple):
	inputs = tuple(t.clone() for t in self.pipe_buffers['inputs'][buffer_id])
	else:
	inputs = self.pipe_buffers['inputs'][buffer_id].clone()

Fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4128

Fix requires_grad of input must be true for activation checkpoint layer in pipeline train. #4128

Conversation

inkcherry commented Aug 10, 2023 • edited Loading

tohtana left a comment • edited Loading

Choose a reason for hiding this comment

tohtana Aug 16, 2023

Choose a reason for hiding this comment

inkcherry Aug 17, 2023 • edited Loading

Choose a reason for hiding this comment

tohtana Aug 16, 2023

Choose a reason for hiding this comment

inkcherry Aug 17, 2023

Choose a reason for hiding this comment

tohtana Aug 16, 2023

Choose a reason for hiding this comment

inkcherry Aug 17, 2023

Choose a reason for hiding this comment

tohtana Aug 16, 2023

Choose a reason for hiding this comment

inkcherry Aug 17, 2023

Choose a reason for hiding this comment

inkcherry commented Aug 17, 2023 • edited Loading

inkcherry commented Aug 23, 2023

tohtana commented Aug 23, 2023

inkcherry commented Aug 23, 2023 • edited Loading

tjruwase commented Aug 23, 2023

inkcherry commented Aug 26, 2023

inkcherry commented Aug 10, 2023 •

edited

Loading

tohtana left a comment •

edited

Loading

inkcherry Aug 17, 2023 •

edited

Loading

inkcherry commented Aug 17, 2023 •

edited

Loading

inkcherry commented Aug 23, 2023 •

edited

Loading