Training multiple models #7018

tjruwase · 2025-02-08T15:08:12Z

Support training multiple models, such as in HF

Here is some update on supporting multiple DS engines with single loss.backward(). The main message is that I think we can support this. First, some context. Backward pass in ZeRO is complicated because the optimizations/features require special handling of gradients, such as:

Gradient partitioning
Overlapping backward and reduction
Upcasting for fp32 grad accumulation

So, we created engine.backward(loss) as a wrapper function to provide us fine-grained control over backward as below

def backward(loss):
 backward_prologue() # setup logic for special gradient handling
 loss.backward()
 backward_epilogue() # cleanup/teardown logic

As demonstrated by @muellerzr, this approach breaks down when loss originates from multiple DS engines. Our proposed solution is to use backward hooks on the module to launch backward_prologue() and backward_epilogue() . Specifically,

backward pre hook on engine.module to launch backward_prologue() before any module gradient is created.
backward post hook on engine.module to launch backward_epilogue() after all module gradients are created.

We plan for this solution to preserve BC, i.e., engine.backward() will remain correct for single engine scenarios.
The current status is that (1) is completed, while (2) is in progress. To unblock e2e testing for multi-engine scenarios, since there are probably other issues, we have a temporarily added engine._backward_prologue() . You can try this out via the following artifacts.

Simple multi-engine test code: https://gist.github.com/tjruwase/f1adccf087b8fa269ffce2ab91c4f1c6#file-multi_engine-py
DS branch: https://github.com/microsoft/DeepSpeed/tree/olruwase/zero_multi_models

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

…/zero_multi_models

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

stas00

I am missing context of what this PR is doing (other than it tries to do something wrt training multiple models).

But don't you need new tests?

tjruwase · 2025-02-08T17:29:33Z

I am missing context of what this PR is doing (other than it tries to do something wrt training multiple models).

But don't you need new tests?

@stas00, thanks for the feedback. I have updated the OP with some background from earlier discussions.

I will work on converting the gist codes into UTs.

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

stas00 · 2025-02-10T00:11:23Z

That's much better after your OP expansion, Tunji.

The gists look good, please ping me once they are tests and would be happy to review again.

That's a very important feature for quite a few users. Thank you for working on it.

tjruwase added 10 commits September 16, 2024 12:54

Support multiple engines

f488b46

Use module backward prehook

6564884

Remove pdb

7745ae5

Remove dead code

a4ea120

Add module forward hooks

3a7a94f

Rebase branch

1ad8276

Formatting

1e2595f

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

Merge branch 'master' of github.com:microsoft/DeepSpeed into olruwase…

4abfd9f

…/zero_multi_models

Cleanup

204c4dd

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

Cleanup

78e1915

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

tjruwase requested a review from tohtana as a code owner February 8, 2025 15:08

tjruwase requested a review from stas00 February 8, 2025 15:08

stas00 reviewed Feb 8, 2025

View reviewed changes

tjruwase added 2 commits February 8, 2025 12:47

Bug fix

16d60bc

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

Prepare gradient handling in zero stage 1 & 2

02477ce

Merge branch 'master' into olruwase/zero_multi_models

2f84032

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training multiple models #7018

Training multiple models #7018

tjruwase commented Feb 8, 2025 •

edited

Loading

stas00 left a comment

tjruwase commented Feb 8, 2025

stas00 commented Feb 10, 2025 •

edited

Loading

Training multiple models #7018

Are you sure you want to change the base?

Training multiple models #7018

Conversation

tjruwase commented Feb 8, 2025 • edited Loading

stas00 left a comment

Choose a reason for hiding this comment

tjruwase commented Feb 8, 2025

stas00 commented Feb 10, 2025 • edited Loading

tjruwase commented Feb 8, 2025 •

edited

Loading

stas00 commented Feb 10, 2025 •

edited

Loading