Remove optimizer step on initialization #5104

tohtana · 2024-02-08T18:16:26Z

All ZeRO 1/2/3 stages call the optimizer's step() on its initialization. This increments a counter in the optimizer and produces a different result in parameter update with the normal usage of PyTorch. This PR eliminates step() in the initialization and lazily configures some internal states (linking hp_params) after the first step() call.

All ZeRO 1/2/3 stages call the optimizer's `step()` on its initialization. This increments a counter in the optimizer and produces a different result in parameter update with the normal usage of PyTorch. This PR eliminates `step()` in the initialization and lazily configures some internal states (linking *hp_params*) after the first `step()` call. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

This PR fixes the following two points regarding checkpoint loading. - Load optimizer states With [this PR](#5104), we removed optimizer's `step()` on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint. For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails. This PR fixes the loader to find optimizer state keys from a given checkpoint. - Resume step count 2943e6a The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has `step`. But some optimizers creates the key `step` in a param group at the first call of `step()` (e.g. Apex [Fused Adam](https://github.com/NVIDIA/apex/blob/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_adam.py#L154). In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group. This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.

All ZeRO 1/2/3 stages call the optimizer's `step()` on its initialization. This increments a counter in the optimizer and produces a different result in parameter update with the normal usage of PyTorch. This PR eliminates `step()` in the initialization and lazily configures some internal states (linking *hp_params*) after the first `step()` call. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

This PR fixes the following two points regarding checkpoint loading. - Load optimizer states With [this PR](microsoft#5104), we removed optimizer's `step()` on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint. For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails. This PR fixes the loader to find optimizer state keys from a given checkpoint. - Resume step count microsoft@2943e6a The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has `step`. But some optimizers creates the key `step` in a param group at the first call of `step()` (e.g. Apex [Fused Adam](https://github.com/NVIDIA/apex/blob/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_adam.py#L154). In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group. This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.

tohtana and others added 4 commits February 8, 2024 10:11

remove optimizer step on initialization

94882b1

Merge branch 'master' into tohtana/remove_step_on_init

6eadcfc

fix loading from checkpoint

1b47fba

Merge branch 'master' into tohtana/remove_step_on_init

0679d21

tohtana marked this pull request as ready for review February 9, 2024 22:28

tohtana requested review from tjruwase, mrwyattii, awan-10 and loadams as code owners February 9, 2024 22:28

Merge branch 'master' into tohtana/remove_step_on_init

3cd2ddf

tjruwase approved these changes Feb 10, 2024

View reviewed changes

tohtana added this pull request to the merge queue Feb 10, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 10, 2024

tohtana added this pull request to the merge queue Feb 11, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 11, 2024

tohtana merged commit 1817980 into master Feb 11, 2024
12 checks passed

zte-tcb mentioned this pull request Feb 19, 2024

When using bf16_optimizer, there is a doubt about the BF16_Optimizer initially with self.optimizer.step(). #5151

Closed

tohtana mentioned this pull request Mar 12, 2024

Fix loading a universal checkpoint #5263

Merged

tohtana mentioned this pull request Apr 17, 2024

Comparison of Deepspeed Stage 1,2 and 3 vs DDP #4815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove optimizer step on initialization #5104

Remove optimizer step on initialization #5104

tohtana commented Feb 8, 2024

Remove optimizer step on initialization #5104

Remove optimizer step on initialization #5104

Conversation

tohtana commented Feb 8, 2024