Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with mxnet on CC8 #30432

Closed
Dr15Jones opened this issue Jun 26, 2020 · 49 comments
Closed

Problem with mxnet on CC8 #30432

Dr15Jones opened this issue Jun 26, 2020 · 49 comments

Comments

@Dr15Jones
Copy link
Contributor

Running workflows on CC8 which usie mxnet where the jobs use multilple threads leads to a crash at the end of the job. This can sometimes be reproduced when running under the gdb which yields the following traceback

#0  0x000000003442bf80 in ?? ()
#1  0x00007fff9f0f0472 in mxnet::resource::ResourceManagerImpl::~ResourceManagerImpl() () from /cvmfs/cms-ib.cern.ch/week0/cc8_amd64_gcc8/cms/cmssw-patch/CMSSW_11_2_X_2020-06-26-1100/external/cc8_amd64_gcc8/lib/libmxnet.so
#2  0x00007fff9f0f0ca5 in dmlc::ThreadLocalStore<mxnet::resource::ResourceManagerImpl>::~ThreadLocalStore() ()
   from /cvmfs/cms-ib.cern.ch/week0/cc8_amd64_gcc8/cms/cmssw-patch/CMSSW_11_2_X_2020-06-26-1100/external/cc8_amd64_gcc8/lib/libmxnet.so
#3  0x00007ffff552406c in __run_exit_handlers () from /lib64/libc.so.6
#4  0x00007ffff55241a0 in exit () from /lib64/libc.so.6
#5  0x00007ffff550d87a in __libc_start_main () from /lib64/libc.so.6
#6  0x000000000041145e in _start ()

One question, has the libmxnet.so shared library already been unloaded from the process before this happened?

@Dr15Jones
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

FYI @vlimant @gkasieczka @riga (I don't know Mia's GitHub handle)

@slava77
Copy link
Contributor

slava77 commented Jun 26, 2020

@hqucms

@mrodozov
Copy link
Contributor

There might be related
apache/mxnet#16431
and also this dmlc/dmlc-core#571
found in the mxnet issue

@hqucms
Copy link
Contributor

hqucms commented Jun 26, 2020

We could probably drop MXNet in the near future -- currently only the ParticleNet tagger is using it, but I have managed to move it to ONNXRuntime recently.

@makortel
Copy link
Contributor

@hqucms Thanks for the update. Does that imply that we could fully remove MXNet, or that we're good until someone wants to add a module using MXNet for inference?

@vlimant
Copy link
Contributor

vlimant commented Jun 29, 2020

adding Mia @mialiu149 too

@hqucms
Copy link
Contributor

hqucms commented Jun 29, 2020

@hqucms Thanks for the update. Does that imply that we could fully remove MXNet, or that we're good until someone wants to add a module using MXNet for inference?

@makortel I think there is a high chance that we could fully remove MXNet as most of the MXNet models can be exported to ONNX either using the existing conversion tool (that's what happened for DeepAK8) or by hand (that's what we did for ParticleNet). Exceptions may exist, but I am not sure if we have to worry about them also given the very limited usage of MXNet (up to now DeepAK8/ParticleNet has been the only use cases in CMSSW).

@makortel
Copy link
Contributor

@hqucms Thanks for the clarifications. Do you have any idea how soon the ParticleNet could be moved to ONNXRuntime?

@hqucms
Copy link
Contributor

hqucms commented Jun 30, 2020

@makortel I can probably make a PR this week or next.

@smuzaffar
Copy link
Contributor

I tried to debug it a bit and it fails here
https://github.com/dmlc/dmlc-core/blob/0e13243556d7de754a923c8f97998721d694f29b/include/dmlc/thread_local.h#L58
during the end of job. Note that is is failing for CentOS8/AMD64 , SLC7/PowerPC and SLC7/AARCH64. On SLC7/AMD64 (with gcc8 , 9 and 10) its does not crash.

@Dr15Jones
Copy link
Contributor Author

@smuzaffar if you want to debug any further, I'd suggest putting a break point there and seeing if the same object is called multiple times. Another possible issue is the shared library containing the code is unloaded before the destructor is called? That seems unlikely but could happen.

@makortel
Copy link
Contributor

makortel commented Jul 6, 2020

Does that mean that here
https://github.com/dmlc/dmlc-core/blob/0e13243556d7de754a923c8f97998721d694f29b/include/dmlc/thread_local.h#L39-L49
we use the #else block? (IIUC the use of #if block would leave data_ to be empty) If that is really the case, would we have a way to enable DMLC_CXX11_THREAD_LOCAL && DMLC_MODERN_THREAD_LOCAL == 1?

@smuzaffar
Copy link
Contributor

@makortel , I check and no we are not using the else block.

@Dr15Jones
Copy link
Contributor Author

What is the full traceback of the crash?

@smuzaffar
Copy link
Contributor

@Dr15Jones stack trace is same as you have mentioned

@Dr15Jones
Copy link
Contributor Author

How about doing an strace on the job to see if the shared library is unloaded early?

@smuzaffar
Copy link
Contributor

ok @Dr15Jones I will try that I am not sure if one can unload a library for linux OS.

@makortel
Copy link
Contributor

makortel commented Jul 6, 2020

I check and no we are not using the else block.

Ok, thanks. I'm puzzled how the ThreadLocalStorage::data_ can have any elements then. Would you be able to check how many elements data_ has in this case?

@smuzaffar
Copy link
Contributor

you are right @makortel , we are ended up in else block, gdb show the following

Thread 1 "cmsRun" hit Breakpoint 6, dmlc::ThreadLocalStore<MXAPIThreadLocalEntry<int> >::Get ()
    at /build/muz/x/w/BUILD/cc8_amd64_gcc8/external/mxnet-predict/1.6.0-cms/mxnet-predict-1.6.0/include/dmlc/./thread_local.h:46
46            Singleton()->RegisterDelete(ptr);

Thread 1 "cmsRun" hit Breakpoint 6, dmlc::ThreadLocalStore<mxnet::resource::ResourceManagerImpl>::Get ()
    at /build/muz/x/w/BUILD/cc8_amd64_gcc8/external/mxnet-predict/1.6.0-cms/mxnet-predict-1.6.0/include/dmlc/thread_local.h:46
46            Singleton()->RegisterDelete(ptr);

what I did was to use build [a] in cmssw env and runing this shows the condition is true

Singularity> ./test/cc8_amd64_gcc8/TestMxnet
OK

[a]

#include <iostream>
#include <dmlc/thread_local.h>
int main()
{
#if DMLC_CXX11_THREAD_LOCAL && DMLC_MODERN_THREAD_LOCAL == 1
  std::cout <<"OK"<<std::endl;
#else
  std::cout <<"NOT"<<std::endl;
#endif
  return 0;
}

@smuzaffar
Copy link
Contributor

OK problem is that mxnet CMAKE forces to disable DMLC_MODERN_THREAD_LOCAL ( https://github.com/cms-externals/incubator-mxnet/blob/1.6.0/CMakeLists.txt#L120 ) this is why https://github.com/cms-externals/incubator-mxnet/blob/1.6.0/src/resource.cc#L458 ends up in else block here https://github.com/dmlc/dmlc-core/blob/0e13243556d7de754a923c8f97998721d694f29b/include/dmlc/thread_local.h#L39 . Updating mxnet cmake to set DMLC_MODERN_THREAD_LOCAL=1 makes cmsRun without segfault ( 3 runs)

@makortel
Copy link
Contributor

makortel commented Jul 6, 2020

Thanks @smuzaffar. I'm not sure what to take from the comment preceding the DMLC_MODERN_THREAD_LOCAL definition
https://github.com/cms-externals/incubator-mxnet/blob/1.6.0/CMakeLists.txt#L119
that points to
dmlc/dmlc-core#571 (comment)
specifically

Projects like MXNet depend on the lifetime of the thread_local data extending beyond the threads lifetime.

@smuzaffar
Copy link
Contributor

I am also not sure, they explain it a bit here apache/mxnet#16526 .

@makortel
Copy link
Contributor

I fully support #30667, but I think the crashes were in the L1 code and have been fixed (so were actually found thanks to glibc-malloc).

I'm fine to wait for #30667 for arm and ppc though, so how about enabling multithreaded tests on cc8-amd64?

@smuzaffar
Copy link
Contributor

cms-sw/cms-bot#1348 should enable threaded relvals for cc8 and gcc10 IBs

@makortel
Copy link
Contributor

Thanks, let's see how it goes.

@makortel
Copy link
Contributor

makortel commented Aug 6, 2020

@slava77 @perrotta @riga @mialiu149 @hqucms
Can mxnet be removed altogether?

@slava77
Copy link
Contributor

slava77 commented Aug 7, 2020

@slava77 @perrotta @riga @mialiu149 @hqucms
Can mxnet be removed altogether?

I'm fine to drop it based on the current needs in the release itself.
I wouldn't be surprised if it's needed again at a later time.

@makortel
Copy link
Contributor

makortel commented Aug 7, 2020

I wouldn't be surprised if it's needed again at a later time.

We should think carefully what exactly "needed" means here. As things are now, we can't generally use mxnet for inference.

@slava77
Copy link
Contributor

slava77 commented Aug 7, 2020

I wouldn't be surprised if it's needed again at a later time.

We should think carefully what exactly "needed" means here. As things are now, we can't generally use mxnet for inference.

this was mostly a speculative statement from me. I think that the ML group should clarify if there are objections to remove mxnet before we actually do it.

@slava77
Copy link
Contributor

slava77 commented Aug 25, 2020

@riga @mialiu149
based on lack of feedback, it sounds like we can remove mxnet.

@makortel
would removal of PhysicsTools/MXNet be enough, or will this need a removal of the external as well?

@makortel
Copy link
Contributor

would removal of PhysicsTools/MXNet be enough, or will this need a removal of the external as well?

I think the main question is about the maintenance effort. Let me think out loud.

Removing PhysicsTools/MXNet would very likely prevent anyone from using mxnet for inference inside CMSSW, and thus prevent from us getting back to the failures mentioned in the issue description.

Besides that, I'd assume the mxnet external itself to require some maintenance effort (like updates). Removing the external would help to reduce that. On the other hand, the maintenance effort should be weighed with the benefits of keeping the external.

Is there a use case for keeping the mxnet external?

@makortel
Copy link
Contributor

Has anything changed within a year? Are there any use cases for keeping mxnet external? Or could we just remove both PhysicsTools/MXNet and the external?

@slava77
Copy link
Contributor

slava77 commented Sep 16, 2021

(perhaps I've misheard) I think that mxnet was mentioned yesterday in the context of DRN implementation.
OTOH, that network was also mentioned to be deployed via Sonic/Triton. So, the PhysicsTools/MXNet Predictor may be irrelevant.

@lgray @SohamBhattacharya
please clarify.

@SohamBhattacharya
Copy link
Contributor

(perhaps I've misheard) I think that mxnet was mentioned yesterday in the context of DRN implementation.
OTOH, that network was also mentioned to be deployed via Sonic/Triton. So, the PhysicsTools/MXNet Predictor may be irrelevant.

@lgray @SohamBhattacharya
please clarify.

I might have misspoken. But @lgray can confirm if the implementation is in mxnet.

@makortel
Copy link
Contributor

Given that nothing has happened in two years, I think it would be safe to remove the mxnet-predict external and PhysicsTools/MXNet package. Any objections? @cms-sw/reconstruction-l2 @gkasieczka (I don't know the GitHub handles of other ML folks)

@makortel
Copy link
Contributor

@smuzaffar Should we just go ahead and remove mxnet?

@smuzaffar
Copy link
Contributor

@makortel , I am all in favor of dropping it. I will open cmssw/cmsdist PRs to remove PhysicsTools/MXNet and external

iarspider added a commit to cms-sw/cmsdist that referenced this issue Apr 20, 2023
iarspider added a commit to iarspider-cmssw/cmssw that referenced this issue Apr 20, 2023
cmsbuild added a commit that referenced this issue Apr 25, 2023
@perrotta
Copy link
Contributor

@Dr15Jones (et al.) after the merge of #41377 this can be probably considered fixed, and closed

@makortel
Copy link
Contributor

+core

@makortel
Copy link
Contributor

@cmsbuild, please close

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

smuzaffar added a commit to cms-sw/cmsdist that referenced this issue Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants