Fix `checkpointable_layers` Logic #6881

Quentin-Anthony · 2024-12-17T00:11:21Z

Problem

There's an edge-case in DeepSpeed, where if all three of the following are true:

Deepspeed activation checkpointing is applied
The user passes checkpointable_layers (e.g. https://github.com/EleutherAI/gpt-neox/blob/f5325805678c2b9e35aae4528283e0132c5f5bbc/megatron/model/gpt2_model.py#L175)
The user's model class contains GPT2ModelPipe or GPTModelPipe`

Then the checkpointable_layers will not be activation checkpointed.

Reason

This is because in the current logic, _is_checkpointable will short-circuit to just return layers matching ParallelTransformerLayerPipe in the case of self.__class__.__name__ in ('GPTModelPipe', 'GPT2ModelPipe'). See https://github.com/microsoft/DeepSpeed/blob/da771ed42e41a44d5047813ca4672f1cfe9d1731/deepspeed/runtime/pipe/module.py#L653

Proposed Fixes

I think that checkpointable_layers should always be checked for, and added logic to this effect. I also found the documentation for checkpointable_layers confusing and contradictory, so I updated the docstring. Lastly, I added a unit test for checkpointable_layers.

tests/unit/runtime/activation_checkpointing/test_activation_checkpointing.py

**Problem** There's an edge-case in DeepSpeed, where if all three of the following are true: 1. Deepspeed activation checkpointing is applied 2. The user passes `checkpointable_layers` (e.g. https://github.com/EleutherAI/gpt-neox/blob/f5325805678c2b9e35aae4528283e0132c5f5bbc/megatron/model/gpt2_model.py#L175) 3. The user's model class contains `GPT2ModelPipe` or GPTModelPipe` Then the `checkpointable_layers` will not be activation checkpointed. **Reason** This is because in the current logic, `_is_checkpointable` will short-circuit to just return layers matching `ParallelTransformerLayerPipe` in the case of `self.__class__.__name__ in ('GPTModelPipe', 'GPT2ModelPipe')`. See https://github.com/microsoft/DeepSpeed/blob/da771ed42e41a44d5047813ca4672f1cfe9d1731/deepspeed/runtime/pipe/module.py#L653 **Proposed Fixes** I think that `checkpointable_layers` should always be checked for, and added logic to this effect. I also found the documentation for `checkpointable_layers` confusing and contradictory, so I updated the docstring. Lastly, I added a unit test for `checkpointable_layers`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: siqi <siqi@tecorigin.com>

**Problem** There's an edge-case in DeepSpeed, where if all three of the following are true: 1. Deepspeed activation checkpointing is applied 2. The user passes `checkpointable_layers` (e.g. https://github.com/EleutherAI/gpt-neox/blob/f5325805678c2b9e35aae4528283e0132c5f5bbc/megatron/model/gpt2_model.py#L175) 3. The user's model class contains `GPT2ModelPipe` or GPTModelPipe` Then the `checkpointable_layers` will not be activation checkpointed. **Reason** This is because in the current logic, `_is_checkpointable` will short-circuit to just return layers matching `ParallelTransformerLayerPipe` in the case of `self.__class__.__name__ in ('GPTModelPipe', 'GPT2ModelPipe')`. See https://github.com/microsoft/DeepSpeed/blob/da771ed42e41a44d5047813ca4672f1cfe9d1731/deepspeed/runtime/pipe/module.py#L653 **Proposed Fixes** I think that `checkpointable_layers` should always be checked for, and added logic to this effect. I also found the documentation for `checkpointable_layers` confusing and contradictory, so I updated the docstring. Lastly, I added a unit test for `checkpointable_layers`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

fix checkpointable layers logic and docstring. Add unit test.

6334229

Quentin-Anthony requested review from loadams, tjruwase and tohtana as code owners December 17, 2024 00:11

tjruwase approved these changes Dec 17, 2024

View reviewed changes

loadams reviewed Dec 17, 2024

View reviewed changes

tests/unit/runtime/activation_checkpointing/test_activation_checkpointing.py Show resolved Hide resolved

Quentin-Anthony and others added 6 commits December 18, 2024 12:24

fix act recomp test

b212177

Merge branch 'master' into qanthony/fix-act-recomp

d6e6f80

Merge branch 'master' into qanthony/fix-act-recomp

02141d3

Merge branch 'master' into qanthony/fix-act-recomp

b46b5aa

Merge branch 'master' into qanthony/fix-act-recomp

cd6fa6d

Merge branch 'master' into qanthony/fix-act-recomp

be5392e

loadams enabled auto-merge January 3, 2025 00:59

Merge branch 'master' into qanthony/fix-act-recomp

30fca29

loadams added this pull request to the merge queue Jan 4, 2025

Merged via the queue into deepspeedai:master with commit 0dbbb70 Jan 4, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `checkpointable_layers` Logic #6881

Fix `checkpointable_layers` Logic #6881

Quentin-Anthony commented Dec 17, 2024

Fix checkpointable_layers Logic #6881

Fix checkpointable_layers Logic #6881

Conversation

Quentin-Anthony commented Dec 17, 2024

Fix `checkpointable_layers` Logic #6881

Fix `checkpointable_layers` Logic #6881