You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Our Scala application uses MxNet and includes several tests. Many of these tests load the same model files, so as a result the same process ends up loading a model multiple times. This normally works fine, but when upgrading from 0.11.0 to 1.1.0 we discovered an odd issue causing our tests to fail. We discovered that if you have a model that uses a custom operator and was built from an mxnet version older than the current one (thus prompting the "Attempting to upgrade..." message), and you load it more than twice in the same process, then on the third reload it crashes.
Analysis
Based on my investigation, the relevant lines appear to be:
legacy_json_util.cc#UpgradeJSON_FixParsing (line 30): This method is run whenever a model older than the current version is loaded. It invokes the attribute parser for each op.
legacy_json_util.cc#UpgradeJSON_Parse (line 92): This method is always run. It also invokes the attribute parser for each op
custom.cc#AttrParser (line 80): This line resets the shared pointer to the custom operator's callback list. Note that it also sets up a deleter function that unregisters the op by invoking the function in the next bullet.
ml_dmlc_mxnet_native_c_api.cc#opPropDel (line 2426): This function unregisters the custom operator. However, there is logic here that tracks how many times this function has been invoked. It only deregisters the custom op on the third invocation.
Suppose we are running MXNet 1.1.0 and load a model built on 0.11.0. Because of the version mismatch, both of the UpgradeJSON stages are run, each one running the attribute parser for each op. When the custom op's attribute parser is run the second time, the shared pointer to the operator's callback list is reset, triggering the unregister function to be invoked. This function, for now, simply updates a counter. The counter now has value 1.
Now suppose during the same process, we load the same model again. The same scenario as above happens again, except this time the counter is incremented to 2.
The third time the model is loaded, again the same scenario reoccurs, except this time the counter is already at value 2, so the unregister function actually unregisters the custom op rather than simply updating the counter. Our custom operator is no longer registered, so shortly thereafter we get an "Operator [op-name] is not registered" fatal error.
The text was updated successfully, but these errors were encountered:
I've been looking into this bug. It seems to occur whenever the upgradeJSON stage encounters the same custom operator 3 times. In your case it happens because you're reloading the same model but it can also occur if the operator is in the same model 3 times.
Best I can tell there isn't actually anything to be done in opPropDel. Obviously deregistering the operator isn't correct and there doesn't appear to be any native memory which should be freed. I'm working on a PR which should fix this.
Hi Milan, this issue is resolved, we are about release 1.3 and will make it available through Maven. I closing this issue as resolved, please open a new issue if you still find issues.
Version
MXNet 1.1.0 with Scala
Description
Our Scala application uses MxNet and includes several tests. Many of these tests load the same model files, so as a result the same process ends up loading a model multiple times. This normally works fine, but when upgrading from 0.11.0 to 1.1.0 we discovered an odd issue causing our tests to fail. We discovered that if you have a model that uses a custom operator and was built from an mxnet version older than the current one (thus prompting the "Attempting to upgrade..." message), and you load it more than twice in the same process, then on the third reload it crashes.
Analysis
Based on my investigation, the relevant lines appear to be:
legacy_json_util.cc#UpgradeJSON_FixParsing (line 30)
: This method is run whenever a model older than the current version is loaded. It invokes the attribute parser for each op.legacy_json_util.cc#UpgradeJSON_Parse (line 92)
: This method is always run. It also invokes the attribute parser for each opcustom.cc#AttrParser (line 80)
: This line resets the shared pointer to the custom operator's callback list. Note that it also sets up a deleter function that unregisters the op by invoking the function in the next bullet.ml_dmlc_mxnet_native_c_api.cc#opPropDel (line 2426)
: This function unregisters the custom operator. However, there is logic here that tracks how many times this function has been invoked. It only deregisters the custom op on the third invocation.Suppose we are running MXNet 1.1.0 and load a model built on 0.11.0. Because of the version mismatch, both of the UpgradeJSON stages are run, each one running the attribute parser for each op. When the custom op's attribute parser is run the second time, the shared pointer to the operator's callback list is reset, triggering the unregister function to be invoked. This function, for now, simply updates a counter. The counter now has value 1.
Now suppose during the same process, we load the same model again. The same scenario as above happens again, except this time the counter is incremented to 2.
The third time the model is loaded, again the same scenario reoccurs, except this time the counter is already at value 2, so the unregister function actually unregisters the custom op rather than simply updating the counter. Our custom operator is no longer registered, so shortly thereafter we get an "Operator [op-name] is not registered" fatal error.
The text was updated successfully, but these errors were encountered: