-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with mxnet on CC8 #30432
Comments
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @Dr15Jones Chris Jones. @Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
FYI @vlimant @gkasieczka @riga (I don't know Mia's GitHub handle) |
There might be related |
We could probably drop MXNet in the near future -- currently only the ParticleNet tagger is using it, but I have managed to move it to ONNXRuntime recently. |
@hqucms Thanks for the update. Does that imply that we could fully remove MXNet, or that we're good until someone wants to add a module using MXNet for inference? |
adding Mia @mialiu149 too |
@makortel I think there is a high chance that we could fully remove MXNet as most of the MXNet models can be exported to ONNX either using the existing conversion tool (that's what happened for DeepAK8) or by hand (that's what we did for ParticleNet). Exceptions may exist, but I am not sure if we have to worry about them also given the very limited usage of MXNet (up to now DeepAK8/ParticleNet has been the only use cases in CMSSW). |
@hqucms Thanks for the clarifications. Do you have any idea how soon the ParticleNet could be moved to ONNXRuntime? |
@makortel I can probably make a PR this week or next. |
I tried to debug it a bit and it fails here |
@smuzaffar if you want to debug any further, I'd suggest putting a break point there and seeing if the same object is called multiple times. Another possible issue is the shared library containing the code is unloaded before the destructor is called? That seems unlikely but could happen. |
Does that mean that here |
@makortel , I check and no we are not using the |
What is the full traceback of the crash? |
@Dr15Jones stack trace is same as you have mentioned |
How about doing an strace on the job to see if the shared library is unloaded early? |
ok @Dr15Jones I will try that I am not sure if one can unload a library for linux OS. |
Ok, thanks. I'm puzzled how the |
you are right @makortel , we are ended up in
what I did was to use build [a] in cmssw env and runing this shows the condition is true
[a]
|
OK problem is that mxnet CMAKE forces to disable DMLC_MODERN_THREAD_LOCAL ( https://github.com/cms-externals/incubator-mxnet/blob/1.6.0/CMakeLists.txt#L120 ) this is why https://github.com/cms-externals/incubator-mxnet/blob/1.6.0/src/resource.cc#L458 ends up in |
Thanks @smuzaffar. I'm not sure what to take from the comment preceding the
|
I am also not sure, they explain it a bit here apache/mxnet#16526 . |
cms-sw/cms-bot#1348 should enable threaded relvals for cc8 and gcc10 IBs |
Thanks, let's see how it goes. |
@slava77 @perrotta @riga @mialiu149 @hqucms |
I'm fine to drop it based on the current needs in the release itself. |
We should think carefully what exactly "needed" means here. As things are now, we can't generally use mxnet for inference. |
this was mostly a speculative statement from me. I think that the ML group should clarify if there are objections to remove mxnet before we actually do it. |
@riga @mialiu149 @makortel |
I think the main question is about the maintenance effort. Let me think out loud. Removing Besides that, I'd assume the mxnet external itself to require some maintenance effort (like updates). Removing the external would help to reduce that. On the other hand, the maintenance effort should be weighed with the benefits of keeping the external. Is there a use case for keeping the mxnet external? |
Has anything changed within a year? Are there any use cases for keeping mxnet external? Or could we just remove both |
(perhaps I've misheard) I think that mxnet was mentioned yesterday in the context of DRN implementation. @lgray @SohamBhattacharya |
I might have misspoken. But @lgray can confirm if the implementation is in mxnet. |
Given that nothing has happened in two years, I think it would be safe to remove the |
@smuzaffar Should we just go ahead and remove mxnet? |
@makortel , I am all in favor of dropping it. I will open cmssw/cmsdist PRs to remove |
@Dr15Jones (et al.) after the merge of #41377 this can be probably considered fixed, and closed |
+core |
@cmsbuild, please close |
This issue is fully signed and ready to be closed. |
Running workflows on CC8 which usie mxnet where the jobs use multilple threads leads to a crash at the end of the job. This can sometimes be reproduced when running under the gdb which yields the following traceback
One question, has the libmxnet.so shared library already been unloaded from the process before this happened?
The text was updated successfully, but these errors were encountered: