Loading an older model with a custom operator from Java/Scala more than twice causes MXNet to crash #10438

milandesai · 2018-04-06T09:26:25Z

Version

MXNet 1.1.0 with Scala

Description

Our Scala application uses MxNet and includes several tests. Many of these tests load the same model files, so as a result the same process ends up loading a model multiple times. This normally works fine, but when upgrading from 0.11.0 to 1.1.0 we discovered an odd issue causing our tests to fail. We discovered that if you have a model that uses a custom operator and was built from an mxnet version older than the current one (thus prompting the "Attempting to upgrade..." message), and you load it more than twice in the same process, then on the third reload it crashes.

Analysis

Based on my investigation, the relevant lines appear to be:

legacy_json_util.cc#UpgradeJSON_FixParsing (line 30): This method is run whenever a model older than the current version is loaded. It invokes the attribute parser for each op.
legacy_json_util.cc#UpgradeJSON_Parse (line 92): This method is always run. It also invokes the attribute parser for each op
custom.cc#AttrParser (line 80): This line resets the shared pointer to the custom operator's callback list. Note that it also sets up a deleter function that unregisters the op by invoking the function in the next bullet.
ml_dmlc_mxnet_native_c_api.cc#opPropDel (line 2426): This function unregisters the custom operator. However, there is logic here that tracks how many times this function has been invoked. It only deregisters the custom op on the third invocation.

Suppose we are running MXNet 1.1.0 and load a model built on 0.11.0. Because of the version mismatch, both of the UpgradeJSON stages are run, each one running the attribute parser for each op. When the custom op's attribute parser is run the second time, the shared pointer to the operator's callback list is reset, triggering the unregister function to be invoked. This function, for now, simply updates a counter. The counter now has value 1.

Now suppose during the same process, we load the same model again. The same scenario as above happens again, except this time the counter is incremented to 2.

The third time the model is loaded, again the same scenario reoccurs, except this time the counter is already at value 2, so the unregister function actually unregisters the custom op rather than simply updating the counter. Our custom operator is no longer registered, so shortly thereafter we get an "Operator [op-name] is not registered" fatal error.

The text was updated successfully, but these errors were encountered:

andrewfayres · 2018-07-24T22:00:47Z

I've been looking into this bug. It seems to occur whenever the upgradeJSON stage encounters the same custom operator 3 times. In your case it happens because you're reloading the same model but it can also occur if the operator is in the same model 3 times.

Best I can tell there isn't actually anything to be done in opPropDel. Obviously deregistering the operator isn't correct and there doesn't appear to be any native memory which should be freed. I'm working on a PR which should fix this.

nswamy · 2018-08-24T21:26:14Z

Hi Milan, this issue is resolved, we are about release 1.3 and will make it available through Maven. I closing this issue as resolved, please open a new issue if you still find issues.

nswamy added Bug Scala labels Apr 6, 2018

andrewfayres mentioned this issue Jul 31, 2018

Fix JNI custom op code from deregistering the operator fixes #10438 #11885

Merged

7 tasks

nswamy closed this as completed Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading an older model with a custom operator from Java/Scala more than twice causes MXNet to crash #10438

Loading an older model with a custom operator from Java/Scala more than twice causes MXNet to crash #10438

milandesai commented Apr 6, 2018

andrewfayres commented Jul 24, 2018 •

edited

Loading

nswamy commented Aug 24, 2018

Loading an older model with a custom operator from Java/Scala more than twice causes MXNet to crash #10438

Loading an older model with a custom operator from Java/Scala more than twice causes MXNet to crash #10438

Comments

milandesai commented Apr 6, 2018

Version

Description

Analysis

andrewfayres commented Jul 24, 2018 • edited Loading

nswamy commented Aug 24, 2018

andrewfayres commented Jul 24, 2018 •

edited

Loading