add selective activation checkpointing #97

tianyu-l · 2024-02-28T02:14:47Z

Stack from ghstack (oldest at bottom):

-> add selective activation checkpointing #97

Selective activation checkpointing (SAC), compared with full AC which always does activation recomputation, selectively stores some intermediate activations to save training time, at the cost of more memory usage.

Here are some test results on llama 7B.

with full activation checkpointing:

[rank0]: Average iter time: 4.9126 seconds
[rank0]: Peak Memory: Reserved 40.61%, Alloc 28.12%, Active: 29.61%

with selective activation checkpointing:

[rank0]: Average iter time: 4.5459 seconds
[rank0]: Peak Memory: Reserved 80.45%, Alloc 62.0%, Active: 63.43%

[ghstack-poisoned]

ghstack-source-id: 2fbae95768f06b1b35af3f69eb0d39777b214089 Pull Request resolved: #97

lessw2020

great addition to the codebase!
lgtm!
two minor proposals:

propose a more pythonic/cleaner use of defaultdict for meta dict
longer term the ac policy probably wants to live at a more global level for future re-use and/or customization.

lessw2020 · 2024-02-28T04:47:44Z

torchtrain/parallelisms/parallelize_llama.py

@@ -67,12 +66,54 @@ def partition_fn(name, module, device_mesh):
    )


+# AC/selective AC
+no_recompute_list = {


not vital for now, but longer term I think this policy should be exposed at a higher level if we expect to have other models being added and/or expect people to customize this policy list . i.e. if have parallelize_gpt or similar, then it's awkward to pull the recompute list from parallelize_llama.

lessw2020 · 2024-02-28T04:56:42Z

torchtrain/parallelisms/parallelize_llama.py

+        def _get_custom_policy(meta):
+            def _custom_policy(mode, func, *args, **kwargs):
+                mm_count_key = f"{mode}_mm_count"
+                if mm_count_key not in meta:


I believe this is cleaner to use a default dict here:
meta = defaultdict(int)
and then no need for this this check whether key is present and init.
if mm_count_key not in meta: meta[mm_count_key] = 0

lessw2020 · 2024-02-28T04:57:21Z

torchtrain/parallelisms/parallelize_llama.py

+            return _custom_policy
+
+        def selective_checkpointing_context_fn():
+            meta = {}


propose
meta = defaultdict(int)

wanchaol

Please see inline comments

wanchaol · 2024-02-28T18:15:43Z

torchtrain/config_manager.py

@@ -215,4 +215,9 @@ def init_args_from_command_line(
                "is an empty string, checkpointing is disabled."
            ),
        )
+        parser.add_argument(
+            "--metrics.enable_selective_ac",


this is not a metrics flag, but rather a training flag

This is a mistake. I'll correct this.

wanchaol · 2024-02-28T18:17:23Z

train_configs/debug_model.toml

@@ -37,3 +37,4 @@ checkpoint_interval = 3600
 checkpoint_interval_type = "steps"
 checkpoint_folder = ""
 dataset = "alpaca"
+enable_selective_ac = false


hmmm you are putting it in the training section but the cmd arg parser is not on training, I think this should not work as expected.. (if it is, then we need to figuring out why toml parsing works wrongly)

In our setting, cmd arg parser is a backup way of providing args. The toml file has it in the training section, in the code traning.enable_selective_ac is used, so no problem there -- the cmd arg metrics.enable_selective_ac is parsed but not used.

Selective activation checkpointing (SAC), compared with full AC, selectively stores some intermediate activations to save training time, at the cost of more memory usage. Here are some test results on llama 7B. with full activation checkpointing: - [rank0]: Average iter time: 4.9126 seconds - [rank0]: Peak Memory: Reserved 40.61%, Alloc 28.12%, Active: 29.61% with selective activation checkpointing: - [rank0]: Average iter time: 4.5459 seconds - [rank0]: Peak Memory: Reserved 80.45%, Alloc 62.0%, Active: 63.43% [ghstack-poisoned]

ghstack-source-id: 1ab8e19df172354cb3cd7f10a7c45e7c7c1ceb51 Pull Request resolved: #97

awgu · 2024-02-29T02:30:25Z

torchtrain/parallelisms/parallelize_llama.py

-    return ptd_checkpoint_wrapper(
-        module, checkpoint_impl=CheckpointImpl.NO_REENTRANT, preserve_rng_state=False
-    )
+def checkpoint_wrapper(module, enable_selective_ac=False):


super nit: Prefer to not make enable_selective_ac a default arg if we always pass it explicitly.

awgu · 2024-02-29T02:30:47Z

torchtrain/parallelisms/parallelize_llama.py

+            checkpoint,
+        )
+
+        def _get_custom_policy(meta):


For my understanding, does this policy also run in eager mode?

Actually it only works in eager mode; with compiler we got:

[rank0]:[rank0]: assert torch._dynamo.config._experimental_support_context_fn_in_torch_utils_checkpoint, \ [rank0]:[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: [rank0]:[rank0]: AssertionError: Passing context_fn to torch.utils.checkpoint is currently not supported under torch.compile

NVM. Wanchao told me we need to set torch._dynamo.config._experimental_support_context_fn_in_torch_utils_checkpoint = True for it to work. Now SAC works with compile.

Selective activation checkpointing (SAC), compared with full AC, selectively stores some intermediate activations to save training time, at the cost of more memory usage. Here are some test results on llama 7B. with full activation checkpointing: - [rank0]: Average iter time: 4.9126 seconds - [rank0]: Peak Memory: Reserved 40.61%, Alloc 28.12%, Active: 29.61% with selective activation checkpointing: - [rank0]: Average iter time: 4.5459 seconds - [rank0]: Peak Memory: Reserved 80.45%, Alloc 62.0%, Active: 63.43% [ghstack-poisoned]

ghstack-source-id: 1d082ae8c9639ac0dffbbfac1d325fef86a2a880 Pull Request resolved: #97

Selective activation checkpointing (SAC), compared with full AC which always does activation recomputation, selectively stores some intermediate activations to save training time, at the cost of more memory usage. Here are some test results on llama 7B. with full activation checkpointing: - [rank0]: Average iter time: 4.9126 seconds - [rank0]: Peak Memory: Reserved 40.61%, Alloc 28.12%, Active: 29.61% with selective activation checkpointing: - [rank0]: Average iter time: 4.5459 seconds - [rank0]: Peak Memory: Reserved 80.45%, Alloc 62.0%, Active: 63.43% [ghstack-poisoned]

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: #97

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: pytorch#97

add selective activation checkpointing

c7c6b4e

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Feb 28, 2024

add selective activation checkpointing

180b213

ghstack-source-id: 2fbae95768f06b1b35af3f69eb0d39777b214089 Pull Request resolved: #97

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 28, 2024

tianyu-l requested review from lessw2020 and wanchaol February 28, 2024 02:25

tianyu-l linked an issue Feb 28, 2024 that may be closed by this pull request

add AC/selective AC to the model #7

Closed

lessw2020 approved these changes Feb 28, 2024

View reviewed changes

wanchaol suggested changes Feb 28, 2024

View reviewed changes

tianyu-l added a commit that referenced this pull request Feb 28, 2024

add selective activation checkpointing

c52dd28

ghstack-source-id: 1ab8e19df172354cb3cd7f10a7c45e7c7c1ceb51 Pull Request resolved: #97

awgu reviewed Feb 29, 2024

View reviewed changes

tianyu-l added a commit that referenced this pull request Feb 29, 2024

add selective activation checkpointing

3084b3a

ghstack-source-id: 1d082ae8c9639ac0dffbbfac1d325fef86a2a880 Pull Request resolved: #97

tianyu-l added a commit that referenced this pull request Feb 29, 2024

add selective activation checkpointing

09666d4

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: #97

tianyu-l merged commit 7ad1e43 into gh/tianyu-l/2/base Feb 29, 2024
4 checks passed

tianyu-l added a commit that referenced this pull request Feb 29, 2024

add selective activation checkpointing

1e3ec0b

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: #97

tianyu-l deleted the gh/tianyu-l/2/head branch February 29, 2024 22:18

tianyu-l mentioned this pull request Mar 20, 2024

add MFU to metrics #151

Merged

lessw2020 pushed a commit that referenced this pull request Apr 18, 2024

add selective activation checkpointing

6e87471

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: #97

philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024

add selective activation checkpointing

2c8cec2

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: pytorch#97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add selective activation checkpointing #97

add selective activation checkpointing #97

tianyu-l commented Feb 28, 2024 •

edited

Loading

lessw2020 left a comment

lessw2020 Feb 28, 2024

lessw2020 Feb 28, 2024

lessw2020 Feb 28, 2024

wanchaol left a comment

wanchaol Feb 28, 2024

tianyu-l Feb 28, 2024

wanchaol Feb 28, 2024

tianyu-l Feb 28, 2024

awgu Feb 29, 2024

awgu Feb 29, 2024

tianyu-l Feb 29, 2024

tianyu-l Feb 29, 2024

add selective activation checkpointing #97

add selective activation checkpointing #97

Conversation

tianyu-l commented Feb 28, 2024 • edited Loading

lessw2020 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l commented Feb 28, 2024 •

edited

Loading