ORTModule memory improvement #18924

pengwa · 2023-12-25T06:51:38Z

Dependency

#19007

ORTModule memory efficient gradient management

Previously I have tried to solve the coarsed-grained gradient accumulation/update problem in ORTModule with #8979, while that resolution somehow is not fully validated with DDP or there is user hooks on the gradient accumulation on torch parameter.

This PR is addressing the problem in the similar approach as PR 8979, e.g. trigger gradient accumulation once ORT computed the grad, but instead of use a AccumulateGrad op, this time with a ONNX operator PythonOp, internally it will call param.backward(grad), which will help handle all related hooks correctly.

Design

Check the details from

https://microsoftapc-my.sharepoint.com/:p:/g/personal/pengwa_microsoft_com/EaaBq4EzsFhOmsDEXCG7Ba4Bb9bwd0O2sFV_JXJ4jBLYLA?e=7Sz2g8&nav=eyJzSWQiOjI3MSwiY0lkIjozMjE4NzI1NDIzfQ

Convergence Validation:

differences are on mostly 0.000x, sometimes 0.00x, which may comes from the different order gradient apply happens before or after this change (on deepspeed zero stage 2)

TODO

Consolidate the logic with Stage3's similar logic.

(cherry picked from commit 76640be)

(cherry picked from commit cd607d5)

(cherry picked from commit 333f235)

(cherry picked from commit be122e3)

…pengwa/mem_improvement

snnn

The yaml file change looks good to me. I didn't look the other parts.

orttraining/orttraining/python/training/ortmodule/_inference_manager.py

…pengwa/mem_improvement

pengwa · 2024-01-16T00:57:47Z

Thank you @askhade, @snnn!

justinchuby · 2024-01-16T20:52:50Z

orttraining/orttraining/python/training/ortmodule/_mem_efficient_grad_mgmt.py

+    Args:
+        exported_model (ModelProto): The exported model.
+        named_params (Optional[Dict[str, torch.nn.parameter.Parameter]]): The full parameter map.
+
+    Returns:
+        tuple[bool, ModelProto]: A tuple of bool and ModelProto. The bool indicates whether the model is modified.


For future reference: no need to include type information in docstrings as they are in the function signature already.

pengwa added 4 commits December 24, 2023 22:49

save

5e01faa

(cherry picked from commit 76640be)

decouple pythonop creation

5042eb0

(cherry picked from commit cd607d5)

lint

594e10e

(cherry picked from commit 333f235)

remove stage3 related change

967d544

(cherry picked from commit be122e3)

pengwa mentioned this pull request Dec 25, 2023

ORTModule memory efficient gradient management #18907

Closed

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Dec 25, 2023

pengwa requested review from askhade, baijumeswani, nbcsm and zhijxu-MS December 25, 2023 06:52

pengwa added 2 commits December 25, 2023 17:13

fix ci

b3fdea9

tests

f3d302c

pengwa requested a review from a team as a code owner December 26, 2023 01:45

pengwa added 6 commits December 25, 2023 19:46

enable when auto grad function is enabled

6331293

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

a3a7ab8

…pengwa/mem_improvement

fix

28b8417

fix tests

e6a733f

fix again

28f7c9e

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

b5b7e69

…pengwa/mem_improvement

snnn previously approved these changes Jan 1, 2024

View reviewed changes

askhade reviewed Jan 9, 2024

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_inference_manager.py Outdated Show resolved Hide resolved

pengwa added 2 commits January 15, 2024 08:20

minors

2deee94

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

d762963

…pengwa/mem_improvement

pengwa dismissed snnn’s stale review via d762963 January 15, 2024 08:22

pengwa added 3 commits January 15, 2024 10:29

shape infer for post processed model

7cba74a

disable mem efficient grad mamgt by defauklt

24e36c3

doc

07de59d

askhade approved these changes Jan 15, 2024

View reviewed changes

snnn approved these changes Jan 15, 2024

View reviewed changes

pengwa merged commit 1150b1f into main Jan 16, 2024
121 of 128 checks passed

pengwa deleted the pengwa/mem_improvement branch January 16, 2024 00:57

justinchuby reviewed Jan 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORTModule memory improvement #18924

ORTModule memory improvement #18924

pengwa commented Dec 25, 2023 •

edited

Loading

snnn left a comment

pengwa commented Jan 16, 2024

justinchuby Jan 16, 2024

ORTModule memory improvement #18924

ORTModule memory improvement #18924

Conversation

pengwa commented Dec 25, 2023 • edited Loading

Dependency

ORTModule memory efficient gradient management

Design

Convergence Validation:

TODO

snnn left a comment

Choose a reason for hiding this comment

pengwa commented Jan 16, 2024

justinchuby Jan 16, 2024

Choose a reason for hiding this comment

pengwa commented Dec 25, 2023 •

edited

Loading