Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TritonService throwing multiple exceptions #38260

Closed
Dr15Jones opened this issue Jun 6, 2022 · 13 comments
Closed

TritonService throwing multiple exceptions #38260

Dr15Jones opened this issue Jun 6, 2022 · 13 comments

Comments

@Dr15Jones
Copy link
Contributor

In CMSSW_12_5_X_2022-06-05-2300 IB running workflow 10805.31 for step 3 we got an abort caused by an exception being thrown while an exception is being unwound. The C++ run time says it was of type TritonException

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc10/CMSSW_12_5_X_2022-06-05-2300/pyRelValMatrixLogs/run/10805.31_SingleGammaPt35+2018_photonDRN+SingleGammaPt35_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step3_SingleGammaPt35+2018_photonDRN+SingleGammaPt35_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/824-824

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 6, 2022

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

The relevant information appears to be

terminate called after throwing an instance of 'TritonException'
terminate called recursively

with stack trace

Thread 12 (Thread 0x2ac690a00700 (LWP 11534) "grpc_global_tim"):
#0  0x00002ac5e1afc8cd in syscall () from /lib64/libc.so.6
#1  0x00002ac5ebfde656 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#2  0x00002ac5ebfde592 in AbslInternalPerThreadSemWait_lts_20210324 () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#3  0x00002ac5ebfe1000 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#4  0x00002ac5ec0468d7 in gpr_cv_wait () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgpr.so.14
#5  0x00002ac5ebdf2aeb in timer_thread(void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002ac5ec049049 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgpr.so.14
#7  0x00002ac5e18ab1cf in start_thread () from /lib64/libpthread.so.0
#8  0x00002ac5e1afcd83 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x2ac690200700 (LWP 11531) "grpc_global_tim"):
#0  0x00002ac5e1afc8cd in syscall () from /lib64/libc.so.6
#1  0x00002ac5ebfde6ba in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#2  0x00002ac5ebfde592 in AbslInternalPerThreadSemWait_lts_20210324 () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#3  0x00002ac5ebfe1000 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#4  0x00002ac5ec0468b4 in gpr_cv_wait () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgpr.so.14
#5  0x00002ac5ebdf2aeb in timer_thread(void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002ac5ec049049 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgpr.so.14
#7  0x00002ac5e18ab1cf in start_thread () from /lib64/libpthread.so.0
#8  0x00002ac5e1afcd83 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x2ac68fb2c700 (LWP 11530) "resolver-execut"):
#0  0x00002ac5e1afc8cd in syscall () from /lib64/libc.so.6
#1  0x00002ac5ebfde656 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#2  0x00002ac5ebfde592 in AbslInternalPerThreadSemWait_lts_20210324 () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#3  0x00002ac5ebfe1000 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#4  0x00002ac5ec0468d7 in gpr_cv_wait () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgpr.so.14
#5  0x00002ac5ebddaecd in grpc_core::Executor::ThreadMain(void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002ac5ec049049 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgpr.so.14
#7  0x00002ac5e18ab1cf in start_thread () from /lib64/libpthread.so.0
#8  0x00002ac5e1afcd83 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x2ac66fc00700 (LWP 11529) "default-executo"):
#0  0x00002ac5e1afc8cd in syscall () from /lib64/libc.so.6
#1  0x00002ac5ebfde656 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#2  0x00002ac5ebfde592 in AbslInternalPerThreadSemWait_lts_20210324 () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#3  0x00002ac5ebfe1000 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#4  0x00002ac5ec0468d7 in gpr_cv_wait () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgpr.so.14
#5  0x00002ac5ebddaecd in grpc_core::Executor::ThreadMain(void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002ac5ec049049 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgpr.so.14
#7  0x00002ac5e18ab1cf in start_thread () from /lib64/libpthread.so.0
#8  0x00002ac5e1afcd83 in clone () from /lib64/libc.so.6
[cut]

Thread 5 (Thread 0x2ac62f200700 (LWP 9419) "cmsRun"):
[cut]
#3  <signal handler called>
#4  0x00002ac5e1bf2a27 in epoll_wait () from /lib64/libc.so.6
#5  0x00002ac5ebdd627a in pollset_work(grpc_pollset*, grpc_pollset_worker**, long) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002ac5ebe6821e in cq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpc.so.14
#7  0x00002ac5eae4c6be in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::internal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpcclient.so
#8  0x00002ac5eae33994 in inference::GRPCInferenceService::Stub::SystemSharedMemoryUnregister(grpc::ClientContext*, inference::SystemSharedMemoryUnregisterRequest const&, inference::SystemSharedMemoryUnregisterResponse*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpcclient.so
#9  0x00002ac5eaee2dc7 in triton::client::InferenceServerGrpcClient::UnregisterSystemSharedMemory(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpcclient.so
#10 0x00002ac5ead1075a in TritonCpuShmResource<triton::client::InferInput>::close() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#11 0x00002ac5ead10a87 in TritonCpuShmResource<triton::client::InferInput>::~TritonCpuShmResource() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#12 0x00002ac5ead00efa in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#13 0x00002ac5eacfa7f6 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#14 0x00002ac5eacfaa19 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#15 0x00002ac5df4ac0c1 in edm::Worker::endStream(edm::StreamID, edm::StreamContext&) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libFWCoreFramework.so
[cut]

Thread 4 (Thread 0x2ac62e28c700 (LWP 9418) "cmsRun"):
[cut]
#3  <signal handler called>
#4  0x00002ac5e1bf2a27 in epoll_wait () from /lib64/libc.so.6
#5  0x00002ac5ebdd627a in pollset_work(grpc_pollset*, grpc_pollset_worker**, long) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002ac5ebe6821e in cq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpc.so.14
#7  0x00002ac5eae4c6be in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::internal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpcclient.so
#8  0x00002ac5eae33994 in inference::GRPCInferenceService::Stub::SystemSharedMemoryUnregister(grpc::ClientContext*, inference::SystemSharedMemoryUnregisterRequest const&, inference::SystemSharedMemoryUnregisterResponse*) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpcclient.so
#9  0x00002ac5eaee2dc7 in triton::client::InferenceServerGrpcClient::UnregisterSystemSharedMemory(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/external/el8_amd64_gcc10/lib/libgrpcclient.so
#10 0x00002ac5ead1075a in TritonCpuShmResource<triton::client::InferInput>::close() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#11 0x00002ac5ead10a87 in TritonCpuShmResource<triton::client::InferInput>::~TritonCpuShmResource() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#12 0x00002ac5ead00efa in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#13 0x00002ac5eacfa7f6 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#14 0x00002ac5eacfaa19 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#15 0x00002ac5df4ac0c1 in edm::Worker::endStream(edm::StreamID, edm::StreamContext&) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libFWCoreFramework.so
[cut]

Thread 3 (Thread 0x2ac62d88b700 (LWP 9417) "cmsRun"):
[cut]
#3  <signal handler called>
#4  parse_lsda_header (context=context@entry=0x2ac62d883040, p=p@entry=0x2ac5e14eb29c "\377\233\035\001\f2\035r", info=info@entry=0x2ac62d882ee0) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:58
#5  0x00002ac5e13dd070 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=1, exception_class=5138137972254386944, ue_header=0x2ac74bcadc60, context=0x2ac62d883040) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:454
#6  0x00002ac5e1899bc6 in _Unwind_RaiseException (exc=0x2ac74bcadc60) at ../../../libgcc/unwind.inc:118
#7  0x00002ac5e189a145 in _Unwind_Resume_or_Rethrow (exc=exc@entry=0x2ac74bcadc60) at ../../../libgcc/unwind.inc:264
#8  0x00002ac5e13ddafc in __cxxabiv1::__cxa_rethrow () at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:126
#9  0x00002ac5e13d27e9 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:80
#10 0x00002ac5e13dd7b6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#11 0x00002ac5e13dc899 in __cxa_call_terminate (ue_header=ue_header@entry=0x2ac74bcadc60) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#12 0x00002ac5e13dd1d1 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=0x2ac74bcadc60, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:685
#13 0x00002ac5e189970f in _Unwind_RaiseException_Phase2 (exc=0x2ac74bcadc60, context=0x2ac62d883950, frames_p=0x2ac62d883858) at ../../../libgcc/unwind.inc:64
#14 0x00002ac5e189a0b6 in _Unwind_Resume (exc=0x2ac74bcadc60) at ../../../libgcc/unwind.inc:241
#15 0x00002ac5ead10941 in TritonCpuShmResource<triton::client::InferInput>::close() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#16 0x00002ac5ead10a87 in TritonCpuShmResource<triton::client::InferInput>::~TritonCpuShmResource() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#17 0x00002ac5ead00efa in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#18 0x00002ac5eacfa7f6 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#19 0x00002ac5eacfaa19 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#20 0x00002ac5df4ac0c1 in edm::Worker::endStream(edm::StreamID, edm::StreamContext&) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libFWCoreFramework.so
#21 0x00002ac5df4ad6c1 in edm::WorkerManager::endStream(edm::StreamID, edm::StreamContext&) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libFWCoreFramework.so
[cut]

Thread 1 (Thread 0x2ac5e291c040 (LWP 8685) "cmsRun"):
[cut]
#4  <signal handler called>
#5  0x00002ac5e1b11a4f in raise () from /lib64/libc.so.6
#6  0x00002ac5e1ae4db5 in abort () from /lib64/libc.so.6
#7  0x00002ac5e13df6b2 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:50
#8  0x00002ac5e13dd7b6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00002ac5e13dc899 in __cxa_call_terminate (ue_header=ue_header@entry=0x2ac62b9773e0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#10 0x00002ac5e13dd1d1 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=0x2ac62b9773e0, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:685
#11 0x00002ac5e189970f in _Unwind_RaiseException_Phase2 (exc=0x2ac62b9773e0, context=0x7ffc4f011c50, frames_p=0x7ffc4f011b58) at ../../../libgcc/unwind.inc:64
#12 0x00002ac5e189a0b6 in _Unwind_Resume (exc=0x2ac62b9773e0) at ../../../libgcc/unwind.inc:241
#13 0x00002ac5ead10941 in TritonCpuShmResource<triton::client::InferInput>::close() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#14 0x00002ac5ead10a87 in TritonCpuShmResource<triton::client::InferInput>::~TritonCpuShmResource() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#15 0x00002ac5ead00efa in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#16 0x00002ac5eacfa7f6 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#17 0x00002ac5eacfaa19 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#18 0x00002ac5df4ac0c1 in edm::Worker::endStream(edm::StreamID, edm::StreamContext&) () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libFWCoreFramework.so
[cut]
#26 0x00002ac5df382826 in edm::EventProcessor::endJob() () from /cvmfs/cms-ib.cern.ch/week0/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-05-2300/lib/el8_amd64_gcc10/libFWCoreFramework.so
[stuff cut]

Current Modules:

Module: PatPhotonDRNCorrectionProducer:patPhotonsDRN (crashed)
Module: PatPhotonDRNCorrectionProducer:patPhotonsDRN
Module: PatPhotonDRNCorrectionProducer:patPhotonsDRN
Module: PatPhotonDRNCorrectionProducer:patPhotonsDRN

@Dr15Jones
Copy link
Contributor Author

@kpedro88 FYI

@Dr15Jones
Copy link
Contributor Author

We may have seen the same problem with a bit more reporting in the CMSSW_12_5_X_2022-06-14-1100 IB:

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc10/CMSSW_12_5_X_2022-06-14-1100/pyRelValMatrixLogs/run/10805.31_SingleGammaPt35+2018_photonDRN+SingleGammaPt35_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step3_SingleGammaPt35+2018_photonDRN+SingleGammaPt35_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/

The relevant bits appear to be

terminate called after throwing an instance of 'TritonException'
  what():  An exception of category 'TritonFailure' occurred.
Exception Message:
unable to unregister shared memory region: 15133_input14: Transport closed



A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Tue Jun 14 13:05:17 CEST 2022
Thread 12 (Thread 0x2b832b800700 (LWP 25151) "grpc_global_tim"):
#0  0x00002b827b5dfe29 in syscall () from /lib64/libc.so.6
#1  0x00002b8284ef3656 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#2  0x00002b8284ef3592 in AbslInternalPerThreadSemWait_lts_20210324 () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#3  0x00002b8284ef6000 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#4  0x00002b82853ed8d7 in gpr_cv_wait () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgpr.so.14
#5  0x00002b828467466b in timer_thread(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002b82853f0049 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgpr.so.14
#7  0x00002b827b2d2ea5 in start_thread () from /lib64/libpthread.so.0
#8  0x00002b827b5e5b0d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x2b832adde700 (LWP 25146) "grpc_global_tim"):
#0  0x00002b827b5dfe29 in syscall () from /lib64/libc.so.6
#1  0x00002b8284ef36ba in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#2  0x00002b8284ef3592 in AbslInternalPerThreadSemWait_lts_20210324 () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#3  0x00002b8284ef6000 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#4  0x00002b82853ed8b4 in gpr_cv_wait () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgpr.so.14
#5  0x00002b828467466b in timer_thread(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002b82853f0049 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgpr.so.14
#7  0x00002b827b2d2ea5 in start_thread () from /lib64/libpthread.so.0
#8  0x00002b827b5e5b0d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x2b832abbd700 (LWP 25145) "resolver-execut"):
#0  0x00002b827b5dfe29 in syscall () from /lib64/libc.so.6
#1  0x00002b8284ef3656 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#2  0x00002b8284ef3592 in AbslInternalPerThreadSemWait_lts_20210324 () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#3  0x00002b8284ef6000 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#4  0x00002b82853ed8d7 in gpr_cv_wait () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgpr.so.14
#5  0x00002b828465d5ad in grpc_core::Executor::ThreadMain(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002b82853f0049 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgpr.so.14
#7  0x00002b827b2d2ea5 in start_thread () from /lib64/libpthread.so.0
#8  0x00002b827b5e5b0d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x2b8309a00700 (LWP 25144) "default-executo"):
#0  0x00002b827b5dfe29 in syscall () from /lib64/libc.so.6
#1  0x00002b8284ef3656 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#2  0x00002b8284ef3592 in AbslInternalPerThreadSemWait_lts_20210324 () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#3  0x00002b8284ef6000 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libabsl_synchronization.so.2103.0.1
#4  0x00002b82853ed8d7 in gpr_cv_wait () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgpr.so.14
#5  0x00002b828465d5ad in grpc_core::Executor::ThreadMain(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgrpc.so.14
#6  0x00002b82853f0049 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_X_2022-06-14-1100/external/slc7_amd64_gcc10/lib/libgpr.so.14
#7  0x00002b827b2d2ea5 in start_thread () from /lib64/libpthread.so.0

Thread 5 (Thread 0x2b82c8e00700 (LWP 22255) "cmsRun"):
[cut]
#3  <signal handler called>
#4  0x00002b827b2d954d in __lll_lock_wait () from /lib64/libpthread.so.0
#5  0x00002b827b2d4eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#6  0x00002b827b2d4daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#7  0x00002b827b62432f in dl_iterate_phdr () from /lib64/libc.so.6
[cut]
#11 0x00002b827b2c1aa5 in _Unwind_RaiseException (exc=exc@entry=0x2b83fba2c760) at ../../../libgcc/unwind.inc:93
#12 0x00002b827ae85cc8 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x2b82832642c0 <typeinfo for TritonException>, dest=0x2b8283235c20 <TritonException::~TritonException()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:90
#13 0x00002b82832478cd in TritonCpuShmResource<triton::client::InferInput>::close() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#14 0x00002b8283247a97 in TritonCpuShmResource<triton::client::InferInput>::~TritonCpuShmResource() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#15 0x00002b8283237f0a in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#16 0x00002b8283231806 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#17 0x00002b8283231a29 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#18 0x00002b8278f3c0d1 in edm::Worker::endStream(edm::StreamID, edm::StreamContext&) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so
[cut]

Thread 4 (Thread 0x2b82c7f15700 (LWP 22254) "cmsRun"):
[cut]
#3  0x00002b82819b1a1b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b827b51d387 in raise () from /lib64/libc.so.6
#6  0x00002b827b51ea78 in abort () from /lib64/libc.so.6
#7  0x00002b827ae7a7dc in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#8  0x00002b827ae859d6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00002b827ae84ab9 in __cxa_call_terminate (ue_header=ue_header@entry=0x2b83fba2bd60) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
[cut]
#11 0x00002b827b2c170f in _Unwind_RaiseException_Phase2 (exc=0x2b83fba2bd60, context=0x2b82c7f0d870, frames_p=0x2b82c7f0d778) at ../../../libgcc/unwind.inc:64
#12 0x00002b827b2c20b6 in _Unwind_Resume (exc=0x2b83fba2bd60) at ../../../libgcc/unwind.inc:241
#13 0x00002b8283247951 in TritonCpuShmResource<triton::client::InferInput>::close() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#14 0x00002b8283247a97 in TritonCpuShmResource<triton::client::InferInput>::~TritonCpuShmResource() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#15 0x00002b8283237f0a in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#16 0x00002b8283231806 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#17 0x00002b8283231a29 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#18 0x00002b8278f3c0d1 in edm::Worker::endStream(edm::StreamID, edm::StreamContext&) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so


Thread 3 (Thread 0x2b82c7514700 (LWP 22253) "cmsRun"):
#3  <signal handler called>
#4  0x00002b827b5ca917 in sched_yield () from /lib64/libc.so.6
#5  0x00002b827d78eba5 in tbb::detail::d2::concurrent_queue<edm::ErrorObj*, tbb::detail::d1::cache_aligned_allocator<edm::ErrorObj*> >::internal_try_pop(void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreMessageService.so
#6  0x00002b827d78394f in edm::service::ThreadSafeLogMessageLoggerScribe::log(edm::ErrorObj*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreMessageService.so
#7  0x00002b827d78b2a3 in edm::service::ThreadSafeLogMessageLoggerScribe::runCommand(edm::MessageLoggerQ::OpCode, void*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreMessageService.so
#8  0x00002b8278bdd149 in edm::MessageSender::ErrorObjDeleter::operator()(edm::ErrorObj*) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreMessageLogger.so
#9  0x00002b8278be0001 in std::_Sp_counted_deleter<edm::ErrorObj*, edm::MessageSender::ErrorObjDeleter, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreMessageLogger.so
#10 0x00002b8278bdda5a in edm::MessageSender::~MessageSender() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreMessageLogger.so
#11 0x00002b82fd35aedb in Multi5x5SuperClusterProducer::endStream() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/pluginRecoEcalEgammaClusterProducers.so


Thread 1 (Thread 0x2b827d669c40 (LWP 15133) "cmsRun"):
[cut]
#3  <signal handler called>
#4  0x00002b827b2d954d in __lll_lock_wait () from /lib64/libpthread.so.0
#5  0x00002b827b2d4eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#6  0x00002b827b2d4daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#7  0x00002b827b62432f in dl_iterate_phdr () from /lib64/libc.so.6
[cut]
#11 0x00002b827b2c1aa5 in _Unwind_RaiseException (exc=exc@entry=0x2b82c59c6a60) at ../../../libgcc/unwind.inc:93
#12 0x00002b827ae85cc8 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x2b82832642c0 <typeinfo for TritonException>, dest=0x2b8283235c20 <TritonException::~TritonException()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:90
#13 0x00002b82832478cd in TritonCpuShmResource<triton::client::InferInput>::close() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#14 0x00002b8283247a97 in TritonCpuShmResource<triton::client::InferInput>::~TritonCpuShmResource() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#15 0x00002b8283237f0a in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, TritonData<triton::client::InferInput> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#16 0x00002b8283231806 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#17 0x00002b8283231a29 in TritonClient::~TritonClient() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libHeterogeneousCoreSonicTriton.so
#18 0x00002b8278f3c0d1 in edm::Worker::endStream(edm::StreamID, edm::StreamContext&) () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so
[cut]
#26 0x00002b8278e12836 in edm::EventProcessor::endJob() () from /cvmfs/cms-ib.cern.ch/nweek-02737/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-13-2300/lib/slc7_amd64_gcc10/libFWCoreFramework.so
[cut]
#30 0x000000000040971c in main ()

Current Modules:
terminate called recursively

@kpedro88
Copy link
Contributor

There are two issues here:

  1. "Transport closed" means that the server connection was already closed before the unregister request was sent. Right now, the unregister request comes in the producer endStream() function, and the fallback server shutdown command is issued in the TritonService postEndJob() function. Is it guaranteed that postEndJob() happens after endStream()? If so, then we need to figure out why the server would shut down prematurely.
  2. Multiple exceptions are issued because each producer's client sets up its own shared memory region. If the server (shared by all streams) is closed prematurely, all clients will have the same error when trying to unregister the shared memory. Is there a recommended way to deal with an exception being repeated by each stream?

@Dr15Jones
Copy link
Contributor Author

Is it guaranteed that postEndJob() happens after endStream()?

Yes, end job happens after all end streams.

@kpedro88
Copy link
Contributor

Since this exception only happens near the end of the job, it could just be converted to a warning. Unless there's another, better way to handle repeated exceptions like this.

If we do decide we want to track it and issue an exception, we could try to do it through TritonService somehow, but this may require additional synchronization.

@makortel
Copy link
Contributor

makortel commented Mar 1, 2024

assign heterogeneous

@makortel
Copy link
Contributor

makortel commented Mar 1, 2024

Was addressed in #43814

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 1, 2024

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Mar 1, 2024

+heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 1, 2024

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

makortel commented Mar 1, 2024

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants