Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Fix JNI custom op code from deregistering the operator fixes #10438 #11885

Merged
merged 1 commit into from
Aug 21, 2018

Conversation

andrewfayres
Copy link
Contributor

Description

This PR fixes a bug where custom operators get deregistered while upgrading a model which was trained in a previous version of mxnet. During the upgrade process the engine calls the custom operator method opPropDel every time it encounters a customer operator which it's already seen. The code had a counter that did nothing until the 3rd time called then deregistered the operator. This caused an whenever the same custom operator was encountered 3 times (either by loading in the model multiple times or loading a model which used the operator multiple times).

The fix is to basically do nothing during the delete callback. The purpose of the callback seems to be to clean up resources that were allocated during the creator function and our creator function isn't allocating any new resources but is instead reinitializing existing resources which were created during the Operator.register call.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@andrewfayres
Copy link
Contributor Author

@lanking520

@nswamy
Copy link
Member

nswamy commented Jul 27, 2018

@andrewfayres is it possible to add some testing to this?

@andrewfayres
Copy link
Contributor Author

It's more a question of exactly what you want to test. We've got tests for custom operators already and there's already work going on to verify model backward compatibility.

* instead reinitializing the object which was created when register was called. This means
* that there doesn't seem to be anything to clean up here (previous efforts were actually
* deregistering the operator).
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned in here, have you observe any memory leaks in here if we got the counter part removed?

@andrewfayres
Copy link
Contributor Author

Adding a bit of detail based on our conversations this morning

@lanking520 I've seen no memory leaks resulting from the counter being removed. The reason is because the code that would eventually run after getting passed the counter would clean up very little. It'd remove the entry from the globalOperator map and delete the reference in the jvm. This effectively deregistered the operator in the JNI side while it could still be in scope in the scala side. The memory that leaks here is gets allocated during the Operator.register scala call. This is the same before and after this fix (we should add a dispose method and/or a Operator.deregister method but that's a bit beyond the scope of fixing this bug).

@nswamy I do not believe that there are currently any existing unit tests for our custom operator jni code. The whole JNI file should be tested via unit tests and I will create an issue for us to follow through on that. We do have a custom operator example that gets run during the CI and provides a minimal amount of testing. My manual testing mostly involved running a modified version of our example multiple times. Verifying that I could reproduce the bug and that the crash stopped happening after digging into the code and make the change. Then I verified that both my modified version and the original still worked after the change. The code that was being run after the count check was guaranteed to make an operator unusable in the native side and would crash whenever the jvm tried to use that operator.

@lanking520
Copy link
Member

It's been better to solve customer problems, though it brings a memory leak issues. @andrewfayres please add the problem here as an issue in Scala.

@andrewfayres
Copy link
Contributor Author

#10438 thought I tagged it in the title but that seems to of not worked.

@nswamy
Copy link
Member

nswamy commented Aug 8, 2018

@mxnet-label-bot please add [Scala]

@anirudh2290 anirudh2290 added Scala pr-awaiting-review PR is waiting for code review labels Aug 9, 2018
@lupesko
Copy link
Contributor

lupesko commented Aug 21, 2018

@nswamy bouncing this up for review

Copy link
Member

@lanking520 lanking520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have an comment line in here indicating the change that has been made, I think it should be fine now.

@nswamy
Copy link
Member

nswamy commented Aug 21, 2018

@andrewfayres could you please add how you tested and test results before and after the change to understand the impact, also create an issue to add tests to JNI code link here please.

@andrewfayres
Copy link
Contributor Author

Issue to add testing to JNI code.

I'll expand on what I said earlier for testing:
Reproducing the issue was fairly straight forward. Using the existing custom operator as a template, I trained and saved a model with a custom operator. Then I attempted to load the model multiple times. On the 3rd attempt a fatal "Operator not found error" was thrown. I then trained another model with 3 of the same custom operators, saved it, then attempted to load it and received the same error.

In both cases, the error was being thrown whenever the upgrade model process called the opPropDel for the 3rd time. The best that I can tell the opPropDel method is only called by the upgradeJSON method when loading a model trained on an older version of MXNet. In this case, there was a counter in the JNI code that on the 3rd time opPropDel was called the custom operator would be deregistered in the native code making all instances of the custom operator unusable and producing a fatal error on the next attempt.

My change was to remove the code that did the deregister and make opPropDel method simply return success (which is what it was already doing up until it hit the counter threshold).

After making this change I loaded the same models that I had trained previously and this time they loaded without issue.

@nswamy
Copy link
Member

nswamy commented Aug 21, 2018

Thanks @andrewfayres for the detailed explanation. Next time please add the results also in the PR. What makes such PR hard is that we are basing off of manual tests which are not reproducible unless another person does the exact same manual test to verify. The concern I have in such cases is the possibility of introducing regressions somewhere else.

@nswamy nswamy merged commit bf1edaf into apache:master Aug 21, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review Scala
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants