Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix a bug in CachedOP. #12184

Merged
merged 4 commits into from
Aug 17, 2018
Merged

Fix a bug in CachedOP. #12184

merged 4 commits into from
Aug 17, 2018

Conversation

zheng-da
Copy link
Contributor

Description

As reported in #12116, the modification in #11951 causes excessive memory consumption. The reason is that #11951 actually turns on recording even in the inference mode.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@zheng-da zheng-da requested a review from anirudh2290 as a code owner August 15, 2018 18:22
@zheng-da
Copy link
Contributor Author

@piiswrong

RunGraph(false, idx, arrays, 0, idx.num_nodes(), std::move(array_reqs),
std::move(ref_count), &states, dispatch_modes,
!recording || inlining_);
recording);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend putting the recording logic in this line and not changing the semantics of recording variable in case in the future someone decides to use recording below RunGraph call.

  RunGraph(false, idx, arrays, 0, idx.num_nodes(), std::move(array_reqs),
           std::move(ref_count), &states, dispatch_modes,
           recording && inlining_);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm trying to follow the logic of the original code. A little logic error can cause unexpected problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it easier to understand the logic: RunGraph must record only when autograd is recording and CachedOp is inlining. An easier to understand logic can prevent future confusion. Up to you to change or not.

@@ -823,9 +823,10 @@ OpStatePtr CachedOp::DynamicForward(

// If we are already recording, we don't need RunGraph to record all
// computation again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend clarifying in the comment that "When CachedOp is inlining, RunGraph is in charge of recording the graph, otherwise we are in charge of recording and RunGraph shall not be recording".

@safrooze
Copy link
Contributor

Do we have any unit-test framework to check the growth of the bufferpool that we can use to write a unit-test for this fix?

Copy link
Contributor

@safrooze safrooze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the fix with my code and the excessive memory allocation issue reported in #12116 is fixed. Would be great to have a unit-test to avoid future regressions.

@haojin2
Copy link
Contributor

haojin2 commented Aug 16, 2018

@eric-haibin-lin @piiswrong This PR should be good to merge?

@piiswrong piiswrong merged commit cd9f9c8 into apache:master Aug 17, 2018
marcoabreu added a commit that referenced this pull request Aug 17, 2018
piyushghai added a commit to piyushghai/incubator-mxnet that referenced this pull request Aug 17, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* fix a bug.

* address comments.

* retrigger

* address comments.
@zheng-da zheng-da deleted the fix_cachedop2 branch September 29, 2018 21:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants