-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
custom autograd func memory refinement #8993
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…o pengwa/pythonop_mem
…o pengwa/pythonop_mem
wschin
previously approved these changes
Sep 9, 2021
...ng/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc
Outdated
Show resolved
Hide resolved
...ng/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc
Outdated
Show resolved
Hide resolved
wschin
reviewed
Sep 9, 2021
orttraining/orttraining/python/training/ortmodule/_custom_autograd_function_runner.py
Show resolved
Hide resolved
…tensions/torch_interop_utils/torch_interop_utils.cc Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
…o pengwa/pythonop_mem
…nxruntime into pengwa/pythonop_mem
wschin
approved these changes
Sep 9, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard working. Looks super great.
SherlockNoMad
approved these changes
Sep 9, 2021
wangyems
pushed a commit
that referenced
this pull request
Sep 9, 2021
* Release torch tensor referenced by torch gradient graph (created in PythonOp) * Update orttraining/orttraining/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc * refine with comments Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
wangyems
added a commit
that referenced
this pull request
Sep 9, 2021
* fast reduction for reducemean (#8976) * Adding preprocessor checks for torch version during torch cpp extensions compilation (#8989) * custom autograd func memory refinement (#8993) * Release torch tensor referenced by torch gradient graph (created in PythonOp) * Update orttraining/orttraining/python/training/ortmodule/torch_cpp_extensions/torch_interop_utils/torch_interop_utils.cc * refine with comments Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> * Fix issues in TensorRT EP (#8996) * fix big engine load issue and add cuda_cpu_alloc * remove redundancy * fix minor issues * [js/web] fix karma launch with chrome headless (#8998) * Update Nuget Packge Pipline to CUDA11.4 and TensorRT8 on Windows (#9000) * Update to CUDA11.4 and TensorRT-8.0.3.4 * update trt pool, remove cudnn from setup_env_gpu.bat * revert pool * test gpu package pipeline on t4 * back out changes * back out changes Co-authored-by: George Wu <jywu@microsoft.com> * Fix fuzz testing build blocking release. (#9008) * add model local function support (#8540) * updates for picking pnnx commit * add tests filter to c# tests * plus test fixes * fix versioning for contrib ops * fix tests * test filter for optional ops * more versioning related updates * fix test * fix layernorm spec * more updates * update docs * add more test filters * more filters * update binary size threshold * update docs * draft - enable model local function * enable model local functions in ORT * update to latest rel onnx commit * plus tests * plus more updates * plus updates * test updates * Fix for nested functions + shape inference * plus bug fix and updates per review * plus fixes per review * plus test updates * plus updates per review * plus fixes * fix a test Co-authored-by: Vincent Wang <wangwchpku@outlook.com> Co-authored-by: baijumeswani <bmeswani@microsoft.com> Co-authored-by: pengwa <pengwa@microsoft.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: stevenlix <38092805+stevenlix@users.noreply.github.com> Co-authored-by: Yulong Wang <yulongw@microsoft.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: Pranav Sharma <prs@microsoft.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description: custom autograd func memory refinement
In PythonOp glue code, we run the PyTorch code using torch tensor (constructed from the upstream ORTValue), this torch tensor is a leaf node when constructed with DLPack. So during forward function runs, there are edges connected to the leaf tensor, which will have a AccumulateGrad gradient function. Having AccumulateGrad gradient function reference the leaf variable , so that means, until AccumulateGrad gradient function is destroyed after PythonOpGrad completed and calls unregister_grad_fn, then the leaf variable will be released. This increase the life time of the variable a lot.
The changes in this PR is, after THPFunction_apply completed, we cut off the edge connection to the leaf variable, then the AccumulateGrad gradient function will be release immediately.
PT:
MEM_STAT - ====== 99 before forward pass ====== MA 9344.0005 MB Max_MA 9696.001 MB CPU Virtual Memory: used = 69.16 GB, percent = 15.7%
MEM_STAT - ====== 99 after forward pass ====== MA 9344.0005 MB Max_MA 9696.001 MB CPU Virtual Memory: used = 69.23 GB, percent = 15.7%
MEM_STAT - ====== 99 after loss ====== MA 9344.0005 MB Max_MA 9696.001 MB CPU Virtual Memory: used = 69.23 GB, percent = 15.7%
MEM_STAT - ====== 99 after backward pass ====== MA 9344.0005 MB Max_MA 9696.001 MB CPU Virtual Memory: used = 69.34 GB, percent = 15.7%
ORT (master):
MEM_STAT - ====== 99 before forward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.94 GB, percent = 16.1%
MEM_STAT - ====== 99 after forward pass ====== MA 9696.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.66 GB, percent = 16.0%
MEM_STAT - ====== 99 after loss ====== MA 9696.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.66 GB, percent = 16.0%
MEM_STAT - ====== 99 after backward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.71 GB, percent = 16.0%
ORT (This PR):
MEM_STAT - ====== 99 before forward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.1 GB, percent = 15.9%
MEM_STAT - ====== 99 after forward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.18 GB, percent = 15.9%
MEM_STAT - ====== 99 after loss ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.18 GB, percent = 15.9%
MEM_STAT - ====== 99 after backward pass ====== MA 9344.0005 MB Max_MA 12544.001 MB CPU Virtual Memory: used = 70.23 GB, percent = 15.9%
MA (Memory Allocated) running with ORT dropped from 9696 to 9344 (in parity with PyTorch now).
Max_MA (Max Memory Allocated), ORT still higher than PyTorch. This is because ORTModuleFunction have a coarse-grained gradient accumulation operation, making some of the earlier generated gradient live longer than PyTorch runs.
Motivation and Context