custom autograd func memory refinement #8993

pengwa · 2021-09-08T02:24:23Z

Description: custom autograd func memory refinement

In PythonOp glue code, we run the PyTorch code using torch tensor (constructed from the upstream ORTValue), this torch tensor is a leaf node when constructed with DLPack. So during forward function runs, there are edges connected to the leaf tensor, which will have a AccumulateGrad gradient function. Having AccumulateGrad gradient function reference the leaf variable , so that means, until AccumulateGrad gradient function is destroyed after PythonOpGrad completed and calls unregister_grad_fn, then the leaf variable will be released. This increase the life time of the variable a lot.

The changes in this PR is, after THPFunction_apply completed, we cut off the edge connection to the leaf variable, then the AccumulateGrad gradient function will be release immediately.

PT:

MEM_STAT - ====== 99 before forward pass ====== MA 9344.0005 MB Max_MA 9696.001 MB CPU Virtual Memory: used = 69.16 GB, percent = 15.7%
MEM_STAT - ====== 99 after forward pass ====== MA 9344.0005 MB Max_MA 9696.001 MB CPU Virtual Memory: used = 69.23 GB, percent = 15.7%
MEM_STAT - ====== 99 after loss ====== MA 9344.0005 MB Max_MA 9696.001 MB CPU Virtual Memory: used = 69.23 GB, percent = 15.7%
MEM_STAT - ====== 99 after backward pass ====== MA 9344.0005 MB Max_MA 9696.001 MB CPU Virtual Memory: used = 69.34 GB, percent = 15.7%

ORT (master):

MEM_STAT - ====== 99 before forward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.94 GB, percent = 16.1%
MEM_STAT - ====== 99 after forward pass ====== MA 9696.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.66 GB, percent = 16.0%
MEM_STAT - ====== 99 after loss ====== MA 9696.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.66 GB, percent = 16.0%
MEM_STAT - ====== 99 after backward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.71 GB, percent = 16.0%

ORT (This PR):

MEM_STAT - ====== 99 before forward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.1 GB, percent = 15.9%
MEM_STAT - ====== 99 after forward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.18 GB, percent = 15.9%
MEM_STAT - ====== 99 after loss ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.18 GB, percent = 15.9%
MEM_STAT - ====== 99 after backward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.23 GB, percent = 15.9%

MA (Memory Allocated) running with ORT dropped from 9696 to 9344 (in parity with PyTorch now).
Max_MA (Max Memory Allocated), ORT still higher than PyTorch. This is because ORTModuleFunction have a coarse-grained gradient accumulation operation, making some of the earlier generated gradient live longer than PyTorch runs.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.

…ythonOp)

…o pengwa/pythonop_mem

...ng/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc

orttraining/orttraining/python/training/ortmodule/_custom_autograd_function_runner.py

…tensions/torch_interop_utils/torch_interop_utils.cc Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>

…o pengwa/pythonop_mem

…nxruntime into pengwa/pythonop_mem

wschin

Thanks for the hard working. Looks super great.

* Release torch tensor referenced by torch gradient graph (created in PythonOp) * Update orttraining/orttraining/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc * refine with comments Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>

* fast reduction for reducemean (#8976) * Adding preprocessor checks for torch version during torch cpp extensions compilation (#8989) * custom autograd func memory refinement (#8993) * Release torch tensor referenced by torch gradient graph (created in PythonOp) * Update orttraining/orttraining/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc * refine with comments Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> * Fix issues in TensorRT EP (#8996) * fix big engine load issue and add cuda_cpu_alloc * remove redundancy * fix minor issues * [js/web] fix karma launch with chrome headless (#8998) * Update Nuget Packge Pipline to CUDA11.4 and TensorRT8 on Windows (#9000) * Update to CUDA11.4 and TensorRT-8.0.3.4 * update trt pool, remove cudnn from setup_env_gpu.bat * revert pool * test gpu package pipeline on t4 * back out changes * back out changes Co-authored-by: George Wu <jywu@microsoft.com> * Fix fuzz testing build blocking release. (#9008) * add model local function support (#8540) * updates for picking pnnx commit * add tests filter to c# tests * plus test fixes * fix versioning for contrib ops * fix tests * test filter for optional ops * more versioning related updates * fix test * fix layernorm spec * more updates * update docs * add more test filters * more filters * update binary size threshold * update docs * draft - enable model local function * enable model local functions in ORT * update to latest rel onnx commit * plus tests * plus more updates * plus updates * test updates * Fix for nested functions + shape inference * plus bug fix and updates per review * plus fixes per review * plus test updates * plus updates per review * plus fixes * fix a test Co-authored-by: Vincent Wang <wangwchpku@outlook.com> Co-authored-by: baijumeswani <bmeswani@microsoft.com> Co-authored-by: pengwa <pengwa@microsoft.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: stevenlix <38092805+stevenlix@users.noreply.github.com> Co-authored-by: Yulong Wang <yulongw@microsoft.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: Pranav Sharma <prs@microsoft.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>

pengwa added 5 commits August 31, 2021 03:15

remove PythonOpGrad control dependency && avoid segement fault

4cef4b4

comment alignment

34f4054

fix bugs

e04baa4

Release torch tensor referenced by torch gradient graph (created in P…

3b74747

…ythonOp)

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

104b9d1

…o pengwa/pythonop_mem

pengwa requested review from baijumeswani, BowenBao, liqunfu, SherlockNoMad, thiagocrepaldi and tlh20 as code owners September 8, 2021 02:24

fix

a0ea353

pengwa requested a review from wschin September 8, 2021 03:38

pengwa added training issues related to ONNX Runtime training; typically submitted using template release:1.9 labels Sep 8, 2021

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

bad6d28

…o pengwa/pythonop_mem

wschin previously approved these changes Sep 9, 2021

View reviewed changes

...ng/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc Outdated Show resolved Hide resolved

...ng/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc Outdated Show resolved Hide resolved

wschin reviewed Sep 9, 2021

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_custom_autograd_function_runner.py Show resolved Hide resolved

Update orttraining/orttraining/python/training/ortmodule/torch_cpp_ex…

926ef5a

…tensions/torch_interop_utils/torch_interop_utils.cc Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>

pengwa dismissed wschin’s stale review via 926ef5a September 9, 2021 05:06

pengwa added 3 commits September 9, 2021 05:26

refine with comments

f796774

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

a56db65

…o pengwa/pythonop_mem

Merge branch 'pengwa/pythonop_mem' of https://github.com/microsoft/on…

4d92531

…nxruntime into pengwa/pythonop_mem

wschin approved these changes Sep 9, 2021

View reviewed changes

SherlockNoMad approved these changes Sep 9, 2021

View reviewed changes

pengwa mentioned this pull request Sep 9, 2021

[WIP] ORTModule memory refinement #8979

Closed

pengwa merged commit d209fe2 into master Sep 9, 2021

pengwa deleted the pengwa/pythonop_mem branch September 9, 2021 10:37

faxu added the triage:approved label Sep 9, 2021

wangyems removed the release:1.9 label Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom autograd func memory refinement #8993

custom autograd func memory refinement #8993

pengwa commented Sep 8, 2021 •

edited

Loading

wschin left a comment

custom autograd func memory refinement #8993

custom autograd func memory refinement #8993

Conversation

pengwa commented Sep 8, 2021 • edited Loading

wschin left a comment

Choose a reason for hiding this comment

pengwa commented Sep 8, 2021 •

edited

Loading