Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Loading an older model with a custom operator from Java/Scala more than twice causes MXNet to crash #10438

Closed
milandesai opened this issue Apr 6, 2018 · 2 comments

Comments

@milandesai
Copy link
Contributor

Version

MXNet 1.1.0 with Scala

Description

Our Scala application uses MxNet and includes several tests. Many of these tests load the same model files, so as a result the same process ends up loading a model multiple times. This normally works fine, but when upgrading from 0.11.0 to 1.1.0 we discovered an odd issue causing our tests to fail. We discovered that if you have a model that uses a custom operator and was built from an mxnet version older than the current one (thus prompting the "Attempting to upgrade..." message), and you load it more than twice in the same process, then on the third reload it crashes.

Analysis

Based on my investigation, the relevant lines appear to be:

  • legacy_json_util.cc#UpgradeJSON_FixParsing (line 30): This method is run whenever a model older than the current version is loaded. It invokes the attribute parser for each op.
  • legacy_json_util.cc#UpgradeJSON_Parse (line 92): This method is always run. It also invokes the attribute parser for each op
  • custom.cc#AttrParser (line 80): This line resets the shared pointer to the custom operator's callback list. Note that it also sets up a deleter function that unregisters the op by invoking the function in the next bullet.
  • ml_dmlc_mxnet_native_c_api.cc#opPropDel (line 2426): This function unregisters the custom operator. However, there is logic here that tracks how many times this function has been invoked. It only deregisters the custom op on the third invocation.

Suppose we are running MXNet 1.1.0 and load a model built on 0.11.0. Because of the version mismatch, both of the UpgradeJSON stages are run, each one running the attribute parser for each op. When the custom op's attribute parser is run the second time, the shared pointer to the operator's callback list is reset, triggering the unregister function to be invoked. This function, for now, simply updates a counter. The counter now has value 1.

Now suppose during the same process, we load the same model again. The same scenario as above happens again, except this time the counter is incremented to 2.

The third time the model is loaded, again the same scenario reoccurs, except this time the counter is already at value 2, so the unregister function actually unregisters the custom op rather than simply updating the counter. Our custom operator is no longer registered, so shortly thereafter we get an "Operator [op-name] is not registered" fatal error.

@andrewfayres
Copy link
Contributor

andrewfayres commented Jul 24, 2018

I've been looking into this bug. It seems to occur whenever the upgradeJSON stage encounters the same custom operator 3 times. In your case it happens because you're reloading the same model but it can also occur if the operator is in the same model 3 times.

Best I can tell there isn't actually anything to be done in opPropDel. Obviously deregistering the operator isn't correct and there doesn't appear to be any native memory which should be freed. I'm working on a PR which should fix this.

@nswamy
Copy link
Member

nswamy commented Aug 24, 2018

Hi Milan, this issue is resolved, we are about release 1.3 and will make it available through Maven. I closing this issue as resolved, please open a new issue if you still find issues.

@nswamy nswamy closed this as completed Aug 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants