-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix JNI custom op code from deregistering the operator fixes #10438 #11885
Conversation
@andrewfayres is it possible to add some testing to this? |
It's more a question of exactly what you want to test. We've got tests for custom operators already and there's already work going on to verify model backward compatibility. |
* instead reinitializing the object which was created when register was called. This means | ||
* that there doesn't seem to be anything to clean up here (previous efforts were actually | ||
* deregistering the operator). | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned in here, have you observe any memory leaks in here if we got the counter part removed?
Adding a bit of detail based on our conversations this morning @lanking520 I've seen no memory leaks resulting from the counter being removed. The reason is because the code that would eventually run after getting passed the counter would clean up very little. It'd remove the entry from the globalOperator map and delete the reference in the jvm. This effectively deregistered the operator in the JNI side while it could still be in scope in the scala side. The memory that leaks here is gets allocated during the Operator.register scala call. This is the same before and after this fix (we should add a dispose method and/or a Operator.deregister method but that's a bit beyond the scope of fixing this bug). @nswamy I do not believe that there are currently any existing unit tests for our custom operator jni code. The whole JNI file should be tested via unit tests and I will create an issue for us to follow through on that. We do have a custom operator example that gets run during the CI and provides a minimal amount of testing. My manual testing mostly involved running a modified version of our example multiple times. Verifying that I could reproduce the bug and that the crash stopped happening after digging into the code and make the change. Then I verified that both my modified version and the original still worked after the change. The code that was being run after the count check was guaranteed to make an operator unusable in the native side and would crash whenever the jvm tried to use that operator. |
It's been better to solve customer problems, though it brings a memory leak issues. @andrewfayres please add the problem here as an issue in Scala. |
#10438 thought I tagged it in the title but that seems to of not worked. |
@mxnet-label-bot please add [Scala] |
@nswamy bouncing this up for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already have an comment line in here indicating the change that has been made, I think it should be fine now.
@andrewfayres could you please add how you tested and test results before and after the change to understand the impact, also create an issue to add tests to JNI code link here please. |
Issue to add testing to JNI code. I'll expand on what I said earlier for testing: In both cases, the error was being thrown whenever the upgrade model process called the opPropDel for the 3rd time. The best that I can tell the opPropDel method is only called by the upgradeJSON method when loading a model trained on an older version of MXNet. In this case, there was a counter in the JNI code that on the 3rd time opPropDel was called the custom operator would be deregistered in the native code making all instances of the custom operator unusable and producing a fatal error on the next attempt. My change was to remove the code that did the deregister and make opPropDel method simply return success (which is what it was already doing up until it hit the counter threshold). After making this change I loaded the same models that I had trained previously and this time they loaded without issue. |
Thanks @andrewfayres for the detailed explanation. Next time please add the results also in the PR. What makes such PR hard is that we are basing off of manual tests which are not reproducible unless another person does the exact same manual test to verify. The concern I have in such cases is the possibility of introducing regressions somewhere else. |
Description
This PR fixes a bug where custom operators get deregistered while upgrading a model which was trained in a previous version of mxnet. During the upgrade process the engine calls the custom operator method opPropDel every time it encounters a customer operator which it's already seen. The code had a counter that did nothing until the 3rd time called then deregistered the operator. This caused an whenever the same custom operator was encountered 3 times (either by loading in the model multiple times or loading a model which used the operator multiple times).
The fix is to basically do nothing during the delete callback. The purpose of the callback seems to be to clean up resources that were allocated during the creator function and our creator function isn't allocating any new resources but is instead reinitializing existing resources which were created during the Operator.register call.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments