[WIP] ORTModule memory refinement #8979

pengwa · 2021-09-07T14:47:30Z

Description: memory refinement

On a simplified models constructed by stacking autograd.Function instances. So most of computations runs on PyTorch kernels, this is a good baseline to compare PyTorch with ORT. From memory profiling, we see ORT takes 16% more memory than PyTorch runs, obviously there are some bugs.

ORTModuleFunction holds all the calculated gradients until backward completed, BUT Pytorch will accumulate the calculated gradients into param.grad immediately once gradient computations comes to any of the leaf node (AccumulateGrad function).

The commit (29eebc2) in this PR, tries to 1). use similar idea of #8993 (cut off the connection between ORTModuleFunction and its inputs' AccumulateGrad gradient function. then PyTorch will not not accumulate the ORTModuleFunction backward outputs into param.grad. 2). we do the gradient in-place update (into param.grad) on the ONNX graph.

With the changes, for some cases the memory consumptions are in parity between ORT and PyTorch. More detailed benchmarks come later.

TODO: I might missed requirements of DDP onto torch grad accumulator 's post hook. Not sure whether change 2 benefits real models before investing more. So currently multiple GPU run might be failed using this branch.

Benchmark

command: python bench.py --batch 1024 --hidden 8194 --layer 12 --tag test --ort

PT:

MEM_STAT - ====== 98 before forward pass ====== MA 9348.5493 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 22.81 GB, percent = 2.6%
MEM_STAT - ====== 98 after forward pass ====== MA 9348.5493 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 22.81 GB, percent = 2.6%
MEM_STAT - ====== 98 after loss ====== MA 9348.5493 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 22.81 GB, percent = 2.6%
MEM_STAT - ====== 98 after backward pass ====== MA 9348.5493 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 22.81 GB, percent = 2.6%
MEM_STAT - ====== 99 before forward pass ====== MA 9348.5493 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 22.81 GB, percent = 2.6%
MEM_STAT - ====== 99 after forward pass ====== MA 9348.5493 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 22.81 GB, percent = 2.6%
MEM_STAT - ====== 99 after loss ====== MA 9348.5493 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 22.81 GB, percent = 2.6%
MEM_STAT - ====== 99 after backward pass ====== MA 9348.5493 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 22.81 GB, percent = 2.6%

ORT (master):

MEM_STAT - ====== 98 before forward pass ====== MA 9348.5493 MB Max_MA 12550.0869 MB CPU Virtual Memory: used = 23.13 GB, percent = 2.6%
MEM_STAT - ====== 98 after forward pass ====== MA 9348.5493 MB Max_MA 12550.0869 MB CPU Virtual Memory: used = 23.13 GB, percent = 2.6%
MEM_STAT - ====== 98 after loss ====== MA 9348.5493 MB Max_MA 12550.0869 MB CPU Virtual Memory: used = 23.13 GB, percent = 2.6%
MEM_STAT - ====== 98 after backward pass ====== MA 9348.5493 MB Max_MA 12550.0869 MB CPU Virtual Memory: used = 23.13 GB, percent = 2.6%
MEM_STAT - ====== 99 before forward pass ====== MA 9348.5493 MB Max_MA 12550.0869 MB CPU Virtual Memory: used = 23.13 GB, percent = 2.6%
MEM_STAT - ====== 99 after forward pass ====== MA 9348.5493 MB Max_MA 12550.0869 MB CPU Virtual Memory: used = 23.13 GB, percent = 2.6%
MEM_STAT - ====== 99 after loss ====== MA 9348.5493 MB Max_MA 12550.0869 MB CPU Virtual Memory: used = 23.13 GB, percent = 2.6%
MEM_STAT - ====== 99 after backward pass ====== MA 9348.5493 MB Max_MA 12550.0869 MB CPU Virtual Memory: used = 23.13 GB, percent = 2.6%

ORT (this PR):

MEM_STAT - ====== 98 before forward pass ====== MA 9316.5415 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 23.14 GB, percent = 2.6%
MEM_STAT - ====== 98 after forward pass ====== MA 9316.5415 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 23.14 GB, percent = 2.6%
MEM_STAT - ====== 98 after loss ====== MA 9316.5415 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 23.14 GB, percent = 2.6%
MEM_STAT - ====== 98 after backward pass ====== MA 9316.5415 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 23.14 GB, percent = 2.6%
MEM_STAT - ====== 99 before forward pass ====== MA 9316.5415 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 23.14 GB, percent = 2.6%
MEM_STAT - ====== 99 after forward pass ====== MA 9316.5415 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 23.14 GB, percent = 2.6%
MEM_STAT - ====== 99 after loss ====== MA 9316.5415 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 23.14 GB, percent = 2.6%
MEM_STAT - ====== 99 after backward pass ====== MA 9316.5415 MB Max_MA 9700.6987 MB CPU Virtual Memory: used = 23.14 GB, percent = 2.6%

Conclusion:

With this PR, the memory allocation (MA) and max memory allocation (MAX_MA) are aligned with PyTorch runs.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.

…ythonOp)

pengwa · 2021-09-07T15:31:15Z

I might missed requirements of DDP onto torch grad accumulator 's post hook. Not sure whether change 2 benefits real models before investing more.

…o pengwa/custom_fnc_mem

…ons compilation (cherry picked from commit b125e80)

…o pengwa/custom_fnc_mem

stale · 2022-04-16T05:53:35Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

## Dependency #19007 ## ORTModule memory efficient gradient management Previously I have tried to solve the coarsed-grained gradient accumulation/update problem in ORTModule with #8979, while that resolution somehow is not fully validated with DDP or there is user hooks on the gradient accumulation on torch parameter. This PR is addressing the problem in the similar approach as PR 8979, e.g. trigger gradient accumulation once ORT computed the grad, but instead of use a AccumulateGrad op, this time with a ONNX operator PythonOp, internally it will call param.backward(grad), which will help handle all related hooks correctly. ## Design Check the details from https://microsoftapc-my.sharepoint.com/:p:/g/personal/pengwa_microsoft_com/EaaBq4EzsFhOmsDEXCG7Ba4Bb9bwd0O2sFV_JXJ4jBLYLA?e=7Sz2g8&nav=eyJzSWQiOjI3MSwiY0lkIjozMjE4NzI1NDIzfQ ## Convergence Validation: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/ccf3a213-e815-4b23-b759-165033b2d9fe) differences are on mostly 0.000x, sometimes 0.00x, which may comes from the different order gradient apply happens before or after this change (on deepspeed zero stage 2) ## TODO Consolidate the logic with Stage3's similar logic.

pengwa added 5 commits August 31, 2021 03:15

remove PythonOpGrad control dependency && avoid segement fault

4cef4b4

comment alignment

34f4054

fix bugs

e04baa4

Release torch tensor referenced by torch gradient graph (created in P…

3b74747

…ythonOp)

Accumulate gradient in ORT execution

29eebc2

pengwa requested review from wschin, SherlockNoMad and tlh20 September 7, 2021 14:47

pengwa requested review from baijumeswani, BowenBao, liqunfu, thiagocrepaldi and a team as code owners September 7, 2021 14:47

pengwa added component:ortmodule training issues related to ONNX Runtime training; typically submitted using template labels Sep 7, 2021

pengwa changed the title ~~custom autograd func memory refinement~~ [WIP] custom autograd func memory refinement Sep 7, 2021

pengwa marked this pull request as draft September 8, 2021 00:53

pengwa changed the title ~~[WIP] custom autograd func memory refinement~~ [WIP] ORTModule memory refinement Sep 8, 2021

pengwa and others added 8 commits September 9, 2021 05:52

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

25f8872

…o pengwa/custom_fnc_mem

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

ca3857f

…o pengwa/custom_fnc_mem

Adding preprocessor checks for torch version during torch cpp extensi…

5c23796

…ons compilation (cherry picked from commit b125e80)

minor fix

6053743

enable the flag

a1e1e92

fix cpu run

6c36626

fix embedding gradient atenop output missing shape

7c3c451

minor fix

41a7551

pengwa closed this Sep 13, 2021

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

4d37a95

…o pengwa/custom_fnc_mem

pengwa reopened this Nov 3, 2021

garymm removed the request for review from a team February 11, 2022 01:33

stale bot added the stale issues that have not been addressed in a while; categorized by a bot label Apr 16, 2022

sophies927 removed the component:ortmodule label Aug 12, 2022

stale bot removed the stale issues that have not been addressed in a while; categorized by a bot label Aug 12, 2022

pengwa closed this Dec 30, 2022

pengwa deleted the pengwa/custom_fnc_mem branch April 11, 2023 11:37

This was referenced Dec 21, 2023

ORTModule memory efficient gradient management #18907

Closed

ORTModule memory improvement #18924

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ORTModule memory refinement #8979

[WIP] ORTModule memory refinement #8979

pengwa commented Sep 7, 2021 •

edited

Loading

pengwa commented Sep 7, 2021

stale bot commented Apr 16, 2022

[WIP] ORTModule memory refinement #8979

[WIP] ORTModule memory refinement #8979

Conversation

pengwa commented Sep 7, 2021 • edited Loading

pengwa commented Sep 7, 2021

stale bot commented Apr 16, 2022

pengwa commented Sep 7, 2021 •

edited

Loading