Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failures in edm::service::Timing #33107

Closed
makortel opened this issue Mar 9, 2021 · 37 comments
Closed

Assertion failures in edm::service::Timing #33107

makortel opened this issue Mar 9, 2021 · 37 comments

Comments

@makortel
Copy link
Contributor

makortel commented Mar 9, 2021

Many workflows (but not all) in CMSSW_11_3_X_2021-03-08-1800 are failing along

cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/6d143f7bbaf77910d1f5a061987aa3a1/opt/cmssw/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/Services/plugins/Timing.cc:192: double edm::service::popStack(): Assertion `!modStack.empty()' failed.

Thread 10 (Thread 0x2ac651000700 (LWP 3130)):
#2  0x00002ac622d1cfd0 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002ac617f24d84 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#5  0x00002ac617f2c9ea in _dl_runtime_resolve_xsave () from /lib64/ld-linux-x86-64.so.2
#6  0x00002ac66192cb5f in EcalTrigPrimProducer::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginSimCalorimetryEcalTrigPrimProducersPlugins.so
#7  0x00002ac61842fd8d in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#8  0x00002ac6184031a0 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#9  0x00002ac618304df2 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00002ac61830503f in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00002ac6183051c0 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00002ac61830529d in tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so

Thread 9 (Thread 0x2ac650001700 (LWP 3129)):
#3  0x00002ac622d1eed2 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002ac61ab69387 in raise () from /lib64/libc.so.6
#6  0x00002ac61ab6aa78 in abort () from /lib64/libc.so.6
#7  0x00002ac61ab621a6 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00002ac61ab62252 in __assert_fail () from /lib64/libc.so.6
#9  0x00002ac622d48dcb in edm::service::Timing::postCommon() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#10 0x00002ac618304e23 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00002ac61830503f in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00002ac6183051c0 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00002ac61830529d in tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so

Thread 8 (Thread 0x2ac64f600700 (LWP 3128)):
#2  0x00002ac622d1cfd0 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002ac617f1f7c5 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#5  0x00002ac617f2009f in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#6  0x00002ac617f24dee in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#7  0x00002ac617f2c9ea in _dl_runtime_resolve_xsave () from /lib64/ld-linux-x86-64.so.2
#8  0x00002ac66192cbd4 in EcalTrigPrimProducer::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginSimCalorimetryEcalTrigPrimProducersPlugins.so
#9  0x00002ac61842fd8d in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00002ac6184031a0 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00002ac618304df2 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00002ac61830503f in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00002ac6183051c0 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#14 0x00002ac61830529d in tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so

Thread 1 (Thread 0x2ac61c82f240 (LWP 2190)):
#2  0x00002ac622d1cfd0 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002ac61ac26c3d in poll () from /lib64/libc.so.6
#5  0x00002ac646e3e50a in frontier_read () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#6  0x00002ac646e3b7a9 in http_read () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#7  0x00002ac646e3b7f3 in read_char () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#8  0x00002ac646e3b8d1 in read_line () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#9  0x00002ac646e3c1e1 in read_connection () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#10 0x00002ac646e3c62f in frontierHttpClnt_post () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#11 0x00002ac646e3c53d in frontierHttpClnt_get () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#12 0x00002ac646e48a44 in get_data () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#13 0x00002ac646e4926e in frontier_postRawData () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#14 0x00002ac646e48edc in frontier_getRawData () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#15 0x00002ac646e40a15 in frontier::Session::getData(std::vector<frontier::Request const*, std::allocator<frontier::Request const*> > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#16 0x00002ac646ddbca4 in coral::FrontierAccess::Statement::execute(coral::AttributeList const&, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/liblcg_FrontierAccess.so
#17 0x00002ac646dce4b9 in coral::FrontierAccess::Query::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/liblcg_FrontierAccess.so
#18 0x00002ac63e31d535 in cond::persistency::PAYLOAD::Table::select(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, cond::Binary&, cond::Binary&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondCoreCondDB.so
#19 0x00002ac63e3322db in cond::persistency::KeyList::loadFromDB[abi:cxx11](unsigned long long) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondCoreCondDB.so
#20 0x00002ac640db5768 in DTKeyedConfigCache::get(cond::persistency::KeyList const&, int, DTKeyedConfig const*&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondToolsDT.so
#21 0x00002ac640db5da8 in DTKeyedConfigCache::getData(cond::persistency::KeyList const&, int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondToolsDT.so
#22 0x00002ac640d1d7b4 in DTConfigDBProducer::readDTCCBConfig(DTConfigManagerRcd const&, DTConfigManager&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#23 0x00002ac640d21afe in DTConfigDBProducer::produce(DTConfigManagerRcd const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#24 0x00002ac640d2f9d5 in decltype ({parm#1}()) edm::convertException::wrap<edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}>(edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#25 0x00002ac640d2fce6 in edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#26 0x00002ac640d30adf in void edm::SerialTaskQueueChain::actionToRun<edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&>(edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#27 0x00002ac640d30b51 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}>(tbb::task_group&, edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#28 0x00002ac6180a4f29 in tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so

Current Modules:
Module: none (crashed)
Module: L1TCaloLayer1:simCaloStage2Layer1Digis
Module: EcalTrigPrimProducer:DMEcalTriggerPrimitiveDigis
Module: EcalTrigPrimProducer:DMEcalTriggerPrimitiveDigis

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_11_3_X_2021-03-08-1800/pyRelValMatrixLogs/run/250202.0_TTbar_13+TTbar_13INPUT+DIGIPRMXUP15_PU25+RECOPRMXUP15_PU25+HARVESTUP15_PU25/step2_TTbar_13+TTbar_13INPUT+DIGIPRMXUP15_PU25+RECOPRMXUP15_PU25+HARVESTUP15_PU25.log#/80-80

@makortel
Copy link
Contributor Author

makortel commented Mar 9, 2021

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2021

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2021

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Mar 9, 2021

Presumably caused by (or related to) the migration to tbb::task_group in #32804.

@silviodonato
Copy link
Contributor

The errors seem random. In CMSSW_11_3_X_2021-03-08-1800 we had 34 errors, while in CMSSW_11_3_X_2021-03-08-2300 we are getting at least 49 errors. For example, 136.885501 crashed only in CMSSW_11_3_X_2021-03-08-1800, while 136.885511 crashes only in CMSSW_11_3_X_2021-03-08-2300.

@dan131riley
Copy link

Here's the assertion failure:

cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/6d143f7bbaf77910d1f5a061987aa3a1/opt/cmssw/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/Services/plugins/Timing.cc:192: 
double edm::service::popStack(): Assertion `!modStack.empty()' failed.

which is here:

//NOTE: We use a per thread stack for module times since unscheduled
// exectuion or tbb task spawning can cause a module to run on the
// same thread as an already running module
static std::vector<double>& moduleTimeStack() {
static thread_local std::vector<double> s_stack;
return s_stack;
}
static double popStack() {
auto& modStack = moduleTimeStack();
assert(!modStack.empty());
double curr_module_time = modStack.back();
modStack.pop_back();
double t = getTime() - curr_module_time;
return t;
}

which would seem to imply that a module finished on a different thread than it started, or there's something funny with thread locals. Since this is a threading-related bug it's non-deterministic.

@Dr15Jones
Copy link
Contributor

I’ll put my effort into this today.

@Dr15Jones
Copy link
Contributor

So there are 'begin' and 'end' callbacks in the ServiceRegistry system. We normally surround the calls they are announcing with an object whose constructor calls the 'begin' to the Services and whose destructor calls 'end'. That isn't always the case. So I'm guessing one of my changes is causing the 'begin' to either not be called or to be called on possibly a different thread than the 'end'.

@dan131riley
Copy link

I'm still getting setup to look at the segfaults on exit. wrt the assertion failures, running 250206.181 so far I've got an exception:

----- Begin Fatal Exception 09-Mar-2021 09:46:15 EST-----------------------
An exception of category 'NotFound' occurred while
   [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
Exception Message:
Service no ServiceRegistry has been set for this thread 
   Additional Info:
      [a] Another exception was caught while trying to clean up files after the primary fatal exception.
----- End Fatal Exception -------------------------------------------------

and a different exception with a stack trace:

terminate called after throwing an instance of 'edm::Exception'
  what():  An exception of category 'LogicError' occurred.
Exception Message:
ModuleCallingContext::getStreamContext() called in context not linked to a StreamContext

There's only one active thread, with this backtrace:

Thread 5 (Thread 0x7fffc23ff700 (LWP 29702)):
#8  0x00007ffff5997358 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7ffff7853eb8 <typeinfo for edm::Exception>, dest=0x7ffff7811cf0 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:90
#9  0x00007ffff7f51e1f in edm::ModuleCallingContext::getStreamContext() const [clone .cold] () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreServiceRegistry.so
#10 0x00007fffeee312e0 in edm::service::MessageLogger::unEstablishModule(edm::ModuleCallingContext const&, char const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreMessageService.so
#11 0x00007ffff7be8343 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007ffff7be855f in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00007ffff7be86e0 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#14 0x00007ffff7be87bd in tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)2>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#15 0x00007ffff640b26d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7ffff20af900, context_guard=..., t=0x7ffff20ba640, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#16 0x00007ffff640b756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7ffff20af900, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#17 0x00007ffff6402a27 in tbb::internal::arena::process (this=0x7ffff214ed00, s=...) at ../../src/tbb/arena.cpp:196
#18 0x00007ffff64013f0 in tbb::internal::market::process (this=0x7ffff214f580, j=...) at ../../src/tbb/market.cpp:667
#19 0x00007ffff63fdbdc in tbb::internal::rml::private_worker::run (this=0x7fffed1be000) at ../../src/tbb/private_server.cpp:266
#20 0x00007ffff63fddd9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#21 0x00007ffff53bfea5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007ffff50e89fd in clone () from /lib64/libc.so.6

@Dr15Jones
Copy link
Contributor

oooh, that exception is really helpful!

@Dr15Jones
Copy link
Contributor

I checked and all calls to modules are safely wrapped to get the 'begin' and 'end' calls made. However, the Timing service uses the stack for non-module related activities as well and maybe one of those is now the problem.

@Dr15Jones
Copy link
Contributor

Are all the jobs failing with the assert running PreMix?

@makortel
Copy link
Contributor Author

makortel commented Mar 9, 2021

Are all the jobs failing with the assert running PreMix?

By quick look over several IB (flavors) it indeed looks like the assertion fires only in PreMix workflows.

@dan131riley
Copy link

Looks like it, all the ones I checked are. The one I'm looking at certainly is:

#8  0x00007ffff5808358 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7ffff77b0e60 <typeinfo for edm::Exception>, dest=0x7ffff7729fc2 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:90
#9  0x00007ffff7f304b2 in edm::ParentContext::streamContext (this=0x7fffa79d2938) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/src/ParentContext.cc:47
#10 0x00007ffff7f300d0 in edm::ModuleCallingContext::streamContext (this=0x7fffa79d2928) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/interface/ModuleCallingContext.h:56
#11 0x00007ffff7f2fbce in edm::ModuleCallingContext::getStreamContext (this=0x7fffa79d2928) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/src/ModuleCallingContext.cc:40
#12 0x00007fffeebd4c86 in edm::service::MessageLogger::unEstablishModule (this=0x7fffd3ef7040, mod=..., state=0x7fffeec2e03d "AfterModStreamBeginRun") at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/MessageService/src/MessageLogger.cc:440
#13 0x00007fffeebd5486 in edm::service::MessageLogger::postModuleStreamBeginRun (this=0x7fffd3ef7040, stream=..., mod=...) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/MessageService/src/MessageLogger.cc:547
#14 0x00007fffeebf92a9 in std::__invoke_impl<void, void (edm::service::MessageLogger::* const&)(edm::StreamContext const&, edm::ModuleCallingContext const&), edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (__f=@0x7fffdbe8ae20: (void (edm::service::MessageLogger::*)(class edm::service::MessageLogger * const, const class edm::StreamContext &, const class edm::ModuleCallingContext &)) 0x7fffeebd5458 <edm::service::MessageLogger::postModuleStreamBeginRun(edm::StreamContext const&, edm::ModuleCallingContext const&)>, __t=@0x7fffdbe8ae30: 0x7fffd3ef7040) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:73
#15 0x00007fffeebf858b in std::__invoke<void (edm::service::MessageLogger::* const&)(edm::StreamContext const&, edm::ModuleCallingContext const&), edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (__fn=@0x7fffdbe8ae20: (void (edm::service::MessageLogger::*)(class edm::service::MessageLogger * const, const class edm::StreamContext &, const class edm::ModuleCallingContext &)) 0x7fffeebd5458 <edm::service::MessageLogger::postModuleStreamBeginRun(edm::StreamContext const&, edm::ModuleCallingContext const&)>) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:95
#16 0x00007fffeebf7728 in std::_Mem_fn_base<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&), true>::operator()<edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (this=0x7fffdbe8ae20) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/functional:114
#17 0x00007fffeebf538a in std::__invoke_impl<void, std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)>&, edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (__f=...) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:60
#18 0x00007fffeebf29ad in std::__invoke<std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)>&, edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (__fn=...) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:95
#19 0x00007fffeebef5fc in std::_Bind<std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)> (edm::service::MessageLogger*, std::_Placeholder<1>, std::_Placeholder<2>)>::__call<void, edm::StreamContext const&, edm::ModuleCallingContext const&, 0ul, 1ul, 2ul>(std::tuple<edm::StreamContext const&, edm::ModuleCallingContext const&>&&, std::_Index_tuple<0ul, 1ul, 2ul>) (this=0x7fffdbe8ae20, __args=...) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/functional:400
#20 0x00007fffeebeb484 in std::_Bind<std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)> (edm::service::MessageLogger*, std::_Placeholder<1>, std::_Placeholder<2>)>::operator()<edm::StreamContext const&, edm::ModuleCallingContext const&, void>(edm::StreamContext const&, edm::ModuleCallingContext const&) (this=0x7fffdbe8ae20) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/functional:484
#21 0x00007fffeebe5b4d in std::_Function_handler<void (edm::StreamContext const&, edm::ModuleCallingContext const&), std::_Bind<std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)> (edm::service::MessageLogger*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, edm::StreamContext const&, edm::ModuleCallingContext const&) (__functor=..., __args#0=..., __args#1=...) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:300
#22 0x00007ffff7be9e23 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#23 0x00007ffff7bea03f in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#24 0x00007ffff7bea1c0 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#25 0x00007ffff7bea29d in tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#26 0x00007ffff628426d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7ffff1f4ab00, context_guard=..., t=0x7ffff1f45540, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#27 0x00007ffff6284756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7ffff1f4ab00, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#28 0x00007ffff62828f4 in tbb::internal::co_local_wait_for_all (arg=0x7ffff1f4ab00) at ../../src/tbb/scheduler.h:948
#29 0x00007ffff4ea31e0 in ?? () from /lib64/libc.so.6
#30 0x0000000000000000 in ?? ()
(gdb) up 9
#9  0x00007ffff7f304b2 in edm::ParentContext::streamContext (this=0x7fffa79d2938) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/src/ParentContext.cc:47
47	      throw Exception(errors::LogicError) << "ParentContext::streamContext called for incorrect type of context";
(gdb) p type_
$1 = edm::ParentContext::Type::kInvalid
(gdb) p *this
$2 = {
  type_ = edm::ParentContext::Type::kInvalid,
  parent_ = {
    global = 0x0,
    internal = 0x0,
    module = 0x0,
    placeInPath = 0x0,
    stream = 0x0
  }
}
(gdb) up
#10 0x00007ffff7f300d0 in edm::ModuleCallingContext::streamContext (this=0x7fffa79d2928) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/interface/ModuleCallingContext.h:56
56	    StreamContext const* streamContext() const { return parent_.streamContext(); }
(gdb) p *this
$3 = {
  previousModuleOnThread_ = 0x0,
  moduleDescription_ = 0x7fffb1303b48,
  parent_ = {
    type_ = edm::ParentContext::Type::kInvalid,
    parent_ = {
      global = 0x0,
      internal = 0x0,
      module = 0x0,
      placeInPath = 0x0,
      stream = 0x0
    }
  },
  state_ = edm::ModuleCallingContext::State::kInvalid
}
(gdb) p *this.moduleDescription_
$4 = {
  parameterSetID_ = {
    hash_ = "\240\245\030\025\017\002\ny\216q\326M\032\275B]"
  },
  moduleName_ = "PreMixingModule",
  moduleLabel_ = "mixData",
  processConfigurationPtr_ = 0x7fffd3ef6f90,
  id_ = 71
}
(gdb) up
#11 0x00007ffff7f2fbce in edm::ModuleCallingContext::getStreamContext (this=0x7fffa79d2928) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/src/ModuleCallingContext.cc:40
40	    return mcc->streamContext();
(gdb) p *this
$5 = {
  previousModuleOnThread_ = 0x0,
  moduleDescription_ = 0x7fffb1303b48,
  parent_ = {
    type_ = edm::ParentContext::Type::kInvalid,
    parent_ = {
      global = 0x0,
      internal = 0x0,
      module = 0x0,
      placeInPath = 0x0,
      stream = 0x0
    }
  },
  state_ = edm::ModuleCallingContext::State::kInvalid
}
(gdb) up
#12 0x00007fffeebd4c86 in edm::service::MessageLogger::unEstablishModule (this=0x7fffd3ef7040, mod=..., state=0x7fffeec2e03d "AfterModStreamBeginRun") at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/MessageService/src/MessageLogger.cc:440
440	          auto stream = previous->getStreamContext();
(gdb) 

@Dr15Jones
Copy link
Contributor

So I see a problem here:

std::exception_ptr exceptPtr = edm::syncWait([&](edm::WaitingTaskHolder&& iHolder) {
manager.processOneOccurrenceAsync<T, U>(
std::move(iHolder), info, edm::ServiceRegistry::instance().presentToken(), streamID, topContext, context);
});

the call to edm::ServiceRegistry::instance().presentToken() is calling a thread local but the code in the lambda is not guaranteed to run in the same thread as the syncWait. The intent was to grab the token from the syncWait thread.

@Dr15Jones
Copy link
Contributor

@dan131riley do you have the traceback for the other threads in the problem you are tracing?

@dan131riley
Copy link

(gdb) thread apply all bt

Thread 5 (Thread 0x7fffc1fff700 (LWP 47252)):
#0  0x00007ffff523754d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ffff5232eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#2  0x00007ffff5232daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffd0571963 in cond::persistency::SessionImpl::startTransaction(bool) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondCoreCondDB.so
#4  0x00007fffd0558256 in cond::persistency::KeyList::loadFromDB[abi:cxx11](unsigned long long) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondCoreCondDB.so
#5  0x00007fffcda86768 in DTKeyedConfigCache::get(cond::persistency::KeyList const&, int, DTKeyedConfig const*&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondToolsDT.so
#6  0x00007fffcda86da8 in DTKeyedConfigCache::getData(cond::persistency::KeyList const&, int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondToolsDT.so
#7  0x00007fffcdb0e7b4 in DTConfigDBProducer::readDTCCBConfig(DTConfigManagerRcd const&, DTConfigManager&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#8  0x00007fffcdb12afe in DTConfigDBProducer::produce(DTConfigManagerRcd const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#9  0x00007fffcdb209d5 in decltype ({parm#1}()) edm::convertException::wrap<edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}>(edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#10 0x00007fffcdb20ce6 in edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#11 0x00007fffcdb21adf in void edm::SerialTaskQueueChain::actionToRun<edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&>(edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#12 0x00007fffcdb21b51 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}>(tbb::task_group&, edm::eventsetup::Callback<DTConfigDBProducer, std::unique_ptr<DTConfigManager, std::default_delete<DTConfigManager> >, DTConfigManagerRcd, edm::eventsetup::CallbackSimpleDecorator<DTConfigManagerRcd> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginL1TriggerConfigDTTPGConfigProducers.so
#13 0x00007ffff7819f29 in tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#14 0x00007ffff628426d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7ffff1ebf200, context_guard=..., t=0x7ffff1dcba40, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#15 0x00007ffff6284756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7ffff1ebf200, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#16 0x00007ffff62828f4 in tbb::internal::co_local_wait_for_all (arg=0x7ffff1ebf200) at ../../src/tbb/scheduler.h:948
#17 0x00007ffff4ea31e0 in ?? () from /lib64/libc.so.6
#18 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7fffc2ea0700 (LWP 47251)):
#0  0x00007ffff4fb64cd in __memmove_ssse3_back () from /lib64/libc.so.6
#1  0x00007fffd071ca91 in CaloSubdetectorGeometry::addValidID(DetId const&) () from /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_X_2021-03-08-1100/lib/slc7_amd64_gcc900/libGeometryCaloGeometry.so
#2  0x00007fffd08e45d0 in HcalGeometry::newCell(Point3DBase<float, GlobalTag> const&, Point3DBase<float, GlobalTag> const&, Point3DBase<float, GlobalTag> const&, float const*, DetId const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libGeometryHcalTowerAlgo.so
#3  0x00007fffcded3a7b in CaloGeometryDBEP<HcalGeometry, CaloGeometryDBReader>::produceAligned(HcalGeometryRecord const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginGeometryHcalEventSetup.so
#4  0x00007fffcdee04a5 in decltype ({parm#1}()) edm::convertException::wrap<edm::eventsetup::Callback<CaloGeometryDBEP<HcalGeometry, CaloGeometryDBReader>, std::unique_ptr<CaloSubdetectorGeometry, std::default_delete<CaloSubdetectorGeometry> >, HcalGeometryRecord, edm::eventsetup::CallbackSimpleDecorator<HcalGeometryRecord> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}>(edm::eventsetup::Callback<CaloGeometryDBEP<HcalGeometry, CaloGeometryDBReader>, std::unique_ptr<CaloSubdetectorGeometry, std::default_delete<CaloSubdetectorGeometry> >, HcalGeometryRecord, edm::eventsetup::CallbackSimpleDecorator<HcalGeometryRecord> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginGeometryHcalEventSetup.so
#5  0x00007fffcdee0786 in edm::eventsetup::Callback<CaloGeometryDBEP<HcalGeometry, CaloGeometryDBReader>, std::unique_ptr<CaloSubdetectorGeometry, std::default_delete<CaloSubdetectorGeometry> >, HcalGeometryRecord, edm::eventsetup::CallbackSimpleDecorator<HcalGeometryRecord> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginGeometryHcalEventSetup.so
#6  0x00007fffcdee2a4f in void edm::SerialTaskQueueChain::actionToRun<edm::eventsetup::Callback<CaloGeometryDBEP<HcalGeometry, CaloGeometryDBReader>, std::unique_ptr<CaloSubdetectorGeometry, std::default_delete<CaloSubdetectorGeometry> >, HcalGeometryRecord, edm::eventsetup::CallbackSimpleDecorator<HcalGeometryRecord> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&>(edm::eventsetup::Callback<CaloGeometryDBEP<HcalGeometry, CaloGeometryDBReader>, std::unique_ptr<CaloSubdetectorGeometry, std::default_delete<CaloSubdetectorGeometry> >, HcalGeometryRecord, edm::eventsetup::CallbackSimpleDecorator<HcalGeometryRecord> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginGeometryHcalEventSetup.so
#7  0x00007fffcdee2ab1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::eventsetup::Callback<CaloGeometryDBEP<HcalGeometry, CaloGeometryDBReader>, std::unique_ptr<CaloSubdetectorGeometry, std::default_delete<CaloSubdetectorGeometry> >, HcalGeometryRecord, edm::eventsetup::CallbackSimpleDecorator<HcalGeometryRecord> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}>(tbb::task_group&, edm::eventsetup::Callback<CaloGeometryDBEP<HcalGeometry, CaloGeometryDBReader>, std::unique_ptr<CaloSubdetectorGeometry, std::default_delete<CaloSubdetectorGeometry> >, HcalGeometryRecord, edm::eventsetup::CallbackSimpleDecorator<HcalGeometryRecord> >::runProducerAsync(tbb::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginGeometryHcalEventSetup.so
#8  0x00007ffff7819f29 in tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#9  0x00007ffff628426d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7ffff1f4b200, context_guard=..., t=0x7ffff1dcb340, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#10 0x00007ffff6284756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7ffff1f4b200, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#11 0x00007ffff62828f4 in tbb::internal::co_local_wait_for_all (arg=0x7ffff1f4b200) at ../../src/tbb/scheduler.h:948
#12 0x00007ffff4ea31e0 in ?? () from /lib64/libc.so.6
#13 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7fffc38a1700 (LWP 47250)):
#0  0x00007ffff6215f67 in ?? () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libz.so.1
#1  0x00007ffff6217097 in inflate () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libz.so.1
#2  0x00007fffc7f756a1 in fn_gunzip_update () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#3  0x00007fffc7f83303 in frontierMemData_append () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#4  0x00007fffc7f83668 in frontierMemData_b64append () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#5  0x00007fffc7f83bcf in frontierPayload_append () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#6  0x00007fffc7f84237 in xml_cdata () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#7  0x00007fffc7f33ebc in doContent (parser=parser@entry=0x7fffc2051800, startTagLevel=startTagLevel@entry=0, enc=<optimized out>, s=<optimized out>, end=<optimized out>, nextPtr=0x7fffc2051830, haveMore=1 '\001') at lib/xmlparse.c:2653
#8  0x00007fffc7f34bec in contentProcessor (parser=0x7fffc2051800, start=<optimized out>, end=<optimized out>, endPtr=<optimized out>) at lib/xmlparse.c:2105
#9  0x00007fffc7f36e43 in XML_ParseBuffer (isFinal=0, len=<optimized out>, parser=0x7fffc2051800) at lib/xmlparse.c:1651
#10 XML_ParseBuffer (parser=0x7fffc2051800, len=<optimized out>, isFinal=0) at lib/xmlparse.c:1623
#11 0x00007fffc7f84d04 in FrontierResponse_append () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#12 0x00007fffc7f7d97a in write_data () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#13 0x00007fffc7f7eb7a in get_data () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#14 0x00007fffc7f7f26e in frontier_postRawData () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#15 0x00007fffc7f7eedc in frontier_getRawData () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#16 0x00007fffc7f76a15 in frontier::Session::getData(std::vector<frontier::Request const*, std::allocator<frontier::Request const*> > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/libfrontier_client.so.2
#17 0x00007fffc7fefca4 in coral::FrontierAccess::Statement::execute(coral::AttributeList const&, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/liblcg_FrontierAccess.so
#18 0x00007fffc7fe24b9 in coral::FrontierAccess::Query::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/external/slc7_amd64_gcc900/lib/liblcg_FrontierAccess.so
#19 0x00007fffd0543535 in cond::persistency::PAYLOAD::Table::select(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, cond::Binary&, cond::Binary&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libCondCoreCondDB.so
#20 0x00007fffc6deef59 in std::unique_ptr<PCaloGeometry, std::default_delete<PCaloGeometry> > cond::persistency::Session::fetchPayload<PCaloGeometry>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginCondCoreGeometryPluginsPlugins.so
#21 0x00007fffc6def36b in cond::persistency::PayloadProxy<PCaloGeometry>::loadPayload() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginCondCoreGeometryPluginsPlugins.so
#22 0x00007fffc6de80b3 in DataProxy<PEcalEndcapRcd, PCaloGeometry, cond::DefaultInitializer<PCaloGeometry> >::prefetch(edm::eventsetup::DataKey const&, edm::EventSetupRecordDetails) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/pluginCondCoreGeometryPluginsPlugins.so
#23 0x00007ffff7bb29cb in edm::SerialTaskQueue::QueuedTask<edm::eventsetup::ESSourceDataProxyBase::prefetchAsyncImpl(edm::WaitingTaskHolder, edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext const&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#24 0x00007ffff7819f29 in tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#25 0x00007ffff628426d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7ffff1eb7200, context_guard=..., t=0x7ffff1dcb240, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#26 0x00007ffff6284756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7ffff1eb7200, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#27 0x00007ffff62828f4 in tbb::internal::co_local_wait_for_all (arg=0x7ffff1eb7200) at ../../src/tbb/scheduler.h:948
#28 0x00007ffff4ea31e0 in ?? () from /lib64/libc.so.6
#29 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fffd3a37700 (LWP 47214)):
#0  0x00007ffff523775d in read () from /lib64/libpthread.so.0
#1  0x00007fffec3acf2b in (anonymous namespace)::full_read (fd=8, stencyE=0x7fffd3a30042 "", len=1, y_order=-1) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/Services/plugins/InitRootHandlers.cc:399
#2  0x00007fffec3ad8ab in edm::service::InitRootHandlers::stacktraceHelperThread () at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/Services/plugins/InitRootHandlers.cc:646
#3  0x00007fffec3b89f5 in std::__invoke_impl<void, void (*)()> (__f=@0x7fffd3bdb4c8: 0x7fffec3ad86e <edm::service::InitRootHandlers::stacktraceHelperThread()>) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:60
#4  0x00007fffec3b8805 in std::__invoke<void (*)()> (wprintf=@0x7fffd3bdb4c8: 0x7fffec3ad86e <edm::service::InitRootHandlers::stacktraceHelperThread()>) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:95
#5  0x00007fffec3b833c in std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul> (OT2_=0x7fffd3bdb4c8) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/thread:244
#6  0x00007fffec3b7f3c in std::thread::_Invoker<std::tuple<void (*)()> >::operator() (OT2_=0x7fffd3bdb4c8) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/thread:251
#7  0x00007fffec3b7c18 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run (OT2_=0x7fffd3bdb4c0) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/thread:195
#8  0x00007ffff5831af0 in std::execute_native_thread_routine (__p=0x7fffd3bdb4c0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#9  0x00007ffff5230ea5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff4f599fd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff3532540 (LWP 47162)):
#0  0x00007ffff4e913d7 in raise () from /lib64/libc.so.6
#1  0x00007ffff4e92ac8 in abort () from /lib64/libc.so.6
#2  0x00007ffff57fc683 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007ffff58080a6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x00007ffff58071a9 in __cxa_call_terminate (ue_header=ue_header@entry=0x7fff9f076360) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#5  0x00007ffff5807ad4 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=0x7fff9f076360, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:677
#6  0x00007ffff54567b3 in _Unwind_RaiseException_Phase2 (exc=0x7fff9f076360, context=0x7fff96de8290, frames_p=0x7fff96de8380) at ../../../libgcc/unwind.inc:64
#7  0x00007ffff5456cf1 in _Unwind_RaiseException (exc=exc@entry=0x7fff9f076360) at ../../../libgcc/unwind.inc:136
#8  0x00007ffff5808358 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7ffff77b0e60 <typeinfo for edm::Exception>, dest=0x7ffff7729fc2 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:90
#9  0x00007ffff7f304b2 in edm::ParentContext::streamContext (this=0x7fffa79d2938) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/src/ParentContext.cc:47
#10 0x00007ffff7f300d0 in edm::ModuleCallingContext::streamContext (this=0x7fffa79d2928) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/interface/ModuleCallingContext.h:56
#11 0x00007ffff7f2fbce in edm::ModuleCallingContext::getStreamContext (this=0x7fffa79d2928) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/ServiceRegistry/src/ModuleCallingContext.cc:40
#12 0x00007fffeebd4c86 in edm::service::MessageLogger::unEstablishModule (this=0x7fffd3ef7040, mod=..., state=0x7fffeec2e03d "AfterModStreamBeginRun") at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/MessageService/src/MessageLogger.cc:440
#13 0x00007fffeebd5486 in edm::service::MessageLogger::postModuleStreamBeginRun (this=0x7fffd3ef7040, stream=..., mod=...) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/MessageService/src/MessageLogger.cc:547
#14 0x00007fffeebf92a9 in std::__invoke_impl<void, void (edm::service::MessageLogger::* const&)(edm::StreamContext const&, edm::ModuleCallingContext const&), edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (__f=@0x7fffdbe8ae20: (void (edm::service::MessageLogger::*)(edm::service::MessageLogger * const, const edm::StreamContext &, const edm::ModuleCallingContext &)) 0x7fffeebd5458 <edm::service::MessageLogger::postModuleStreamBeginRun(edm::StreamContext const&, edm::ModuleCallingContext const&)>, __t=@0x7fffdbe8ae30: 0x7fffd3ef7040) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:73
#15 0x00007fffeebf858b in std::__invoke<void (edm::service::MessageLogger::* const&)(edm::StreamContext const&, edm::ModuleCallingContext const&), edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (__fn=@0x7fffdbe8ae20: (void (edm::service::MessageLogger::*)(edm::service::MessageLogger * const, const edm::StreamContext &, const edm::ModuleCallingContext &)) 0x7fffeebd5458 <edm::service::MessageLogger::postModuleStreamBeginRun(edm::StreamContext const&, edm::ModuleCallingContext const&)>) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:95
#16 0x00007fffeebf7728 in std::_Mem_fn_base<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&), true>::operator()<edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (this=0x7fffdbe8ae20) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/functional:114
#17 0x00007fffeebf538a in std::__invoke_impl<void, std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)>&, edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (__f=...) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:60
#18 0x00007fffeebf29ad in std::__invoke<std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)>&, edm::service::MessageLogger*&, edm::StreamContext const&, edm::ModuleCallingContext const&> (__fn=...) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:95
#19 0x00007fffeebef5fc in std::_Bind<std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)> (edm::service::MessageLogger*, std::_Placeholder<1>, std::_Placeholder<2>)>::__call<void, edm::StreamContext const&, edm::ModuleCallingContext const&, 0ul, 1ul, 2ul>(std::tuple<edm::StreamContext const&, edm::ModuleCallingContext const&>&&, std::_Index_tuple<0ul, 1ul, 2ul>) (this=0x7fffdbe8ae20, __args=...) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/functional:400
#20 0x00007fffeebeb484 in std::_Bind<std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)> (edm::service::MessageLogger*, std::_Placeholder<1>, std::_Placeholder<2>)>::operator()<edm::StreamContext const&, edm::ModuleCallingContext const&, void>(edm::StreamContext const&, edm::ModuleCallingContext const&) (this=0x7fffdbe8ae20) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/functional:484
#21 0x00007fffeebe5b4d in std::_Function_handler<void (edm::StreamContext const&, edm::ModuleCallingContext const&), std::_Bind<std::_Mem_fn<void (edm::service::MessageLogger::*)(edm::StreamContext const&, edm::ModuleCallingContext const&)> (edm::service::MessageLogger*, std::_Placeholder<1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, edm::StreamContext const&, edm::ModuleCallingContext const&) (__functor=..., __args#0=..., __args#1=...) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:300
#22 0x00007ffff7be9e23 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#23 0x00007ffff7bea03f in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#24 0x00007ffff7bea1c0 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#25 0x00007ffff7bea29d in tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-08-1800/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#26 0x00007ffff628426d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7ffff1f4ab00, context_guard=..., t=0x7ffff1f45540, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#27 0x00007ffff6284756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7ffff1f4ab00, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#28 0x00007ffff62828f4 in tbb::internal::co_local_wait_for_all (arg=0x7ffff1f4ab00) at ../../src/tbb/scheduler.h:948
#29 0x00007ffff4ea31e0 in ?? () from /lib64/libc.so.6
#30 0x0000000000000000 in ?? ()
(gdb) 

@Dr15Jones
Copy link
Contributor

So I fixed the SecondaryEventProvider call but I am still getting the crash in service::MessageLogger::unEstablishModule

@Dr15Jones
Copy link
Contributor

From the crash I see, the problem seem to be that ModuleCallingContext::previousModuleOnThread is returning a value other than nullptr. Looking at the traceback, there is no other module.

@Dr15Jones
Copy link
Contributor

So for debugging purposes I added an assert(previousOnThread_ == nullptr) everywhere in ModuleCallingContext that sets the value to see where it happens. The job aborted on a thread that didn't have any other modules on it.

@Dr15Jones
Copy link
Contributor

Here is something really weird I've noticed about some of the jobs (including 1 I ran in the debugger). Thread 1 does not have main in it. It looks like a dynamically spawned thread

Thread 1 (Thread 0x7fdd11e5b540 (LWP 16624)):
#0  0x00007fdd1384987d in nanosleep () from /lib64/libc.so.6
#1  0x00007fdd13849714 in sleep () from /lib64/libc.so.6
#2  0x00007fdd06c39fd0 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fdbbb71e5c0 in PreMixingHcalWorker::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/slc7_amd64_gcc900/pluginSimCalorimetryHcalSimProducersPlugins.so
#5  0x00007fdc5496433c in edm::PreMixingModule::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/slc7_amd64_gcc900/pluginSimGeneralPreMixingModulePlugins.so
#6  0x00007fdd16bcf900 in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doStreamBeginRun (this=0x7fdc3a578700, id=..., info=..., mcc=0x7fdc469842e8) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc:200
#7  0x00007fdd16b8c9fb in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::callWorkerStreamBegin<int> (this=0x7fdc469842c0, id=..., info=..., mcc=0x7fdc469842e8) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/WorkerT.cc:347
#8  0x00007fdd16b8a336 in edm::workerimpl::DoStreamBeginTrans<edm::stream::EDProducerAdaptorBase, edm::RunTransitionInfo const>::operator() (this=0x7fdbb9df7cff, iWorker=0x7fdc469842c0, id=..., info=..., mcc=0x7fdc469842e8) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/WorkerT.cc:97
#9  0x00007fdd16b81eb5 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDoStreamBegin (this=0x7fdc469842c0, id=..., info=..., mcc=0x7fdc469842e8) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/WorkerT.cc:364
#10 0x00007fdd1697f7a1 in edm::workerhelper::CallImpl<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >::call (iWorker=0x7fdc469842c0, id=..., info=..., actReg=0x7fdd09958c10, mcc=0x7fdc469842e8, context=0x7fdc45de2810) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/Worker.h:677
#11 0x00007fdd169b6f8b in edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const (this=0x7fdc469842c0) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/Worker.h:1099
#12 0x00007fdd169bb011 in edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) (iFunc=...) at /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Utilities/interface/ConvertException.h:21
#13 0x00007fdd169b705a in edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> > (this=0x7fdc469842c0, transitionInfo=..., streamID=..., parentContext=..., context=0x7fdc45de2810) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/Worker.h:1098
#14 0x00007fdd169b2b15 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const (this=0x7fdc469842c0) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/Worker.h:1048
#15 0x00007fdd169ca752 in tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() (this=0x7fdba8fb7d40) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/tbb/2020_U3-ljfedo/include/tbb/task.h:1059
#16 0x00007fdd14ba626d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7fdc98163900, context_guard=..., t=0x7fdba8fb7d40, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#17 0x00007fdd14ba6756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fdc98163900, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#18 0x00007fdd14b9da27 in tbb::internal::arena::process (this=0x7fdd1112ed00, s=...) at ../../src/tbb/arena.cpp:196
#19 0x00007fdd14b9c3f0 in tbb::internal::market::process (this=0x7fdd1112f580, j=...) at ../../src/tbb/market.cpp:667
#20 0x00007fdd14b98bdc in tbb::internal::rml::private_worker::run (this=0x7fdd0726e080) at ../../src/tbb/private_server.cpp:266
#21 0x00007fdd14b98dd9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#22 0x00007fdd13b5aea5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007fdd1388298d in clone () from /lib64/libc.so.6

In fact, thread 8 which is where tha abort happens looks more like what I would expect to be the main thread

Thread 8 (Thread 0x7fdbb9dff700 (LWP 17124)):
#0  0x00007fdd13877c5d in poll () from /lib64/libc.so.6
#1  0x00007fdd06c3a587 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007fdd06c3adec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007fdd06c3bed2 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fdd137ba387 in raise () from /lib64/libc.so.6
#6  0x00007fdd137bba78 in abort () from /lib64/libc.so.6
#7  0x00007fdd137b31a6 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00007fdd137b3252 in __assert_fail () from /lib64/libc.so.6
#9  0x00007fdd16f62b41 in edm::ModuleCallingContext::setContext (this=0x7fdc46984228, state=edm::ModuleCallingContext::State::kRunning, parent=..., previousOnThread=0x7fdc469842e8) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/ServiceRegistry/src/ModuleCallingContext.cc:31
#10 0x00007fdd16938284 in edm::ModuleContextSentry::ModuleContextSentry (this=0x7fdb9ec24d88, moduleCallingContext=0x7fdc46984228, parentContext=...) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/interface/ModuleContextSentry.h:15
#11 0x00007fdd169b7010 in edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> > (this=0x7fdc46984200, transitionInfo=..., streamID=..., parentContext=..., context=0x7fdc45de2810) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/Worker.h:1091
#12 0x00007fdd169b2b15 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const (this=0x7fdc46984200) at /uscms_data/d2/cdj/build/temp/fix/task_group/CMSSW_11_3_X_2021-03-08-2300/src/FWCore/Framework/src/Worker.h:1048
#13 0x00007fdd169ca752 in tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() (this=0x7fdba8fb7e40) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/tbb/2020_U3-ljfedo/include/tbb/task.h:1059
#14 0x00007fdd14ba626d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7fdc98162400, context_guard=..., t=0x7fdba8fb7e40, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#15 0x00007fdd14ba6756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fdc98162400, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#16 0x00007fdd14ba48f4 in tbb::internal::co_local_wait_for_all (arg=0x7fdc98162400) at ../../src/tbb/scheduler.h:948
#17 0x00007fdd137cc190 in ?? () from /lib64/libc.so.6
#18 0x0000000000000000 in ?? ()

@dan131riley
Copy link

The segfaults at exit are do_notify_exit_observers() again, and I'm seeing the same oddity--in the stack trace (from a core dump), thread 1 was started from clone() and thread 4 is the one running _start():

(gdb) thread apply all bt

Thread 5 (Thread 0x7f2fd8e36700 (LWP 113834)):
#0  0x00007f3000ee475d in read () from /lib64/libpthread.so.0
#1  0x00007f2fee0cdf0b in (anonymous namespace)::full_read (fd=8, inbuf=0x7f2fd8e2f042 "", len=1, timeout_s=-1) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/Services/plugins/InitRootHandlers.cc:402
#2  0x00007f2fee0ce884 in edm::service::stacktraceHelperThread () at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/Services/plugins/InitRootHandlers.cc:671
#3  0x00007f2fee0d9989 in std::__invoke_impl<void, void (*)()> (__f=@0x7f2ff0ea3948: 0x7f2fee0ce84e <edm::service::stacktraceHelperThread()>) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:60
#4  0x00007f2fee0d9799 in std::__invoke<void (*)()> (__fn=@0x7f2ff0ea3948: 0x7f2fee0ce84e <edm::service::stacktraceHelperThread()>) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/bits/invoke.h:95
#5  0x00007f2fee0d92d0 in std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul> (this=0x7f2ff0ea3948) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/thread:244
#6  0x00007f2fee0d8ed0 in std::thread::_Invoker<std::tuple<void (*)()> >::operator() (this=0x7f2ff0ea3948) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/thread:251
#7  0x00007f2fee0d8bac in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run (this=0x7f2ff0ea3940) at /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/include/c++/9.3.0/thread:195
#8  0x00007f30014deaf0 in std::execute_native_thread_routine (__p=0x7f2ff0ea3940) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#9  0x00007f3000eddea5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f3000c069fd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f2fff1df540 (LWP 113754)):
#0  0x00007f3001106480 in _fini () from /cvmfs/cms-ib.cern.ch/nweek-02671/slc7_amd64_gcc900/external/gcc/9.3.0/lib64/libgcc_s.so.1
#1  0x00007f300426d0a8 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2  0x00007f3000b41d39 in __run_exit_handlers () from /lib64/libc.so.6
#3  0x00007f3000b41d87 in exit () from /lib64/libc.so.6
#4  0x00007f3000b2a55c in __libc_start_main () from /lib64/libc.so.6
#5  0x0000000000437999 in _start ()

Thread 3 (Thread 0x7f2fb5dfe700 (LWP 113860)):
#0  tbb::internal::observer_list::do_notify_exit_observers (this=0x7f3001f4c800 <tbb::internal::the_global_observer_list>, last=0x7f2ff10a6f20, worker=worker@entry=true) at ../../src/tbb/observer_proxy.cpp:273
#1  0x00007f3001f2d2b3 in tbb::internal::observer_list::notify_exit_observers (worker=true, last=@0x7f2ffda37950: 0x7f2ff10a6f20, this=<optimized out>) at ../../src/tbb/observer_proxy.h:157
#2  tbb::internal::generic_scheduler::cleanup_worker (arg=0x7f2ffda37900, worker=<optimized out>) at ../../src/tbb/scheduler.cpp:1336
#3  0x00007f3001f23bf0 in tbb::internal::rml::private_worker::run (this=0x7f2ff6490100) at ../../src/tbb/private_server.cpp:282
#4  0x00007f3001f23dd9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#5  0x00007f3000eddea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f3000c069fd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f2fb49ff700 (LWP 113861)):
#0  0x00007f3000bcd8ed in nanosleep () from /lib64/libc.so.6
#1  0x00007f3000bcd784 in sleep () from /lib64/libc.so.6
#2  0x00007f2fee0ce849 in (anonymous namespace)::sig_abort (sig=11) at /mnt/data1/dsr/tmp/CMSSW_11_3_X_2021-03-08-1800/src/FWCore/Services/plugins/InitRootHandlers.cc:647
#3  <signal handler called>
#4  tbb::internal::observer_list::do_notify_exit_observers (this=0x7f3001f4c800 <tbb::internal::the_global_observer_list>, last=0x7f2ff10a6f20, worker=worker@entry=true) at ../../src/tbb/observer_proxy.cpp:273
#5  0x00007f3001f2d2b3 in tbb::internal::observer_list::notify_exit_observers (worker=true, last=@0x7f2ffda1f950: 0x7f2ff10a6f20, this=<optimized out>) at ../../src/tbb/observer_proxy.h:157
#6  tbb::internal::generic_scheduler::cleanup_worker (arg=0x7f2ffda1f900, worker=<optimized out>) at ../../src/tbb/scheduler.cpp:1336
#7  0x00007f3001f23bf0 in tbb::internal::rml::private_worker::run (this=0x7f2ff6490000) at ../../src/tbb/private_server.cpp:282
#8  0x00007f3001f23dd9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#9  0x00007f3000eddea5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f3000c069fd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f2fb67ff700 (LWP 113859)):
#0  tbb::internal::observer_list::do_notify_exit_observers (this=0x7f3001f4c800 <tbb::internal::the_global_observer_list>, last=0x7f2ff10a6f20, worker=worker@entry=true) at ../../src/tbb/observer_proxy.cpp:273
#1  0x00007f3001f2d2b3 in tbb::internal::observer_list::notify_exit_observers (worker=true, last=@0x7f2ffda27950: 0x7f2ff10a6f20, this=<optimized out>) at ../../src/tbb/observer_proxy.h:157
#2  tbb::internal::generic_scheduler::cleanup_worker (arg=0x7f2ffda27900, worker=<optimized out>) at ../../src/tbb/scheduler.cpp:1336
#3  0x00007f3001f23bf0 in tbb::internal::rml::private_worker::run (this=0x7f2ff6490080) at ../../src/tbb/private_server.cpp:282
#4  0x00007f3001f23dd9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#5  0x00007f3000eddea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f3000c069fd in clone () from /lib64/libc.so.6
(gdb) 

@Dr15Jones
Copy link
Contributor

So the thread which says it has a previous module running on it, the context information says the previous module is the 'PreMixingModule'.

@dan131riley
Copy link

Apparently my workaround for the notify_exit_observers() crash only works on some workflows. On my aarch64 VM with debug TBB the wf I was testing is still fine, but with wf 136.885501 I get an assertion failure, and I get crashes on amd64 even after completely disabling the InitRootHandlers task_scheduler_observer. Feels like a TBB bug.

@Dr15Jones
Copy link
Contributor

Dr15Jones commented Mar 9, 2021

So the previous module on thread is supposed to be non-nullptr inside the PreMixingModule because of this

ModuleContextSentry moduleContextSentry(&moduleCallingContext, parentContext);

However, when I was seeing the problem the previous module on thread was going off in BEGIN RUN which is not a place that PreMixingModule will be called from.

@Dr15Jones
Copy link
Contributor

So I did spot one behavioral difference between the old implementation and the new one. In

processOneOccurrence<OccurrenceTraits<RunPrincipal, BranchActionStreamEnd> >(
workerManager_, info, sContext.streamID(), &sContext, mcc);
processOneOccurrence<OccurrenceTraits<RunPrincipal, BranchActionGlobalEnd> >(
workerManager_, info, StreamID::invalidStreamID(), nullptr, mcc);

previously the processOneOccurenceAsync was actually a no-op if no workers are associated with the provider. Now the implementation uses syncWait which must always create a tbb task (by calling tbb::task_group::run) where that task could have been stolen by another thread and then the tbb::task_group::wait done in the syncWait call could itself steal another task and run a module on top of the PreMixingModule. Such behavior would cause the module previously on thread to be set.

@Dr15Jones
Copy link
Contributor

So if I short circuit SecondaryEventProvider calls so that no call to edm::syncWait happens if there are not modules associated with the provider then it appears that the problem goes away. Although this avoids the symptom we see I can't fully explain why the module previously on thread is not behaving properly.

I'll be making a pull request shortly.

@Dr15Jones
Copy link
Contributor

See #33125

@Dr15Jones
Copy link
Contributor

@dan131riley if you get a chance, could you see if #33125 also fixes the case you were looking at?

@dan131riley
Copy link

@dan131riley if you get a chance, could you see if #33125 also fixes the case you were looking at?

I'll give it a try but I doubt it will make a difference. I noticed earlier that my initial attempts to replicate the problem didn't succeed until I added the TimeMemorySummary customization, and now my latest assertion failure has me looking at the FastTimerService:

#3  tbb::assertion_failure (filename=filename@entry=0xfffff64cbc08 "../../src/tbb/observer_proxy.cpp", line=line@entry=340, expression=expression@entry=0xfffff64cc078 "proxy->my_ref_count >= 1", comment=comment@entry=0xfffff64cc058 "reference for observer missing") at ../../src/tbb/tbb_assert_impl.h:56
#4  0x0000fffff64c0018 in tbb::internal::task_scheduler_observer_v3::observe (this=0xffffa170b200, enable=<optimized out>) at ../../src/tbb/observer_proxy.cpp:323
#5  0x0000fffff19f4500 in FastTimerService::~FastTimerService() () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/cc8_aarch64_gcc9/pluginHLTriggerTimerPlugins.so
#6  0x0000fffff19f45c0 in edm::serviceregistry::ServiceWrapper<FastTimerService>::~ServiceWrapper() () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/cc8_aarch64_gcc9/pluginHLTriggerTimerPlugins.so
#7  0x0000fffff7b5867c in edm::serviceregistry::ServicesManager::~ServicesManager() () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/cc8_aarch64_gcc9/libFWCoreServiceRegistry.so
#8  0x0000fffff7b58774 in edm::serviceregistry::ServicesManager::~ServicesManager() () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/cc8_aarch64_gcc9/libFWCoreServiceRegistry.so
#9  0x0000fffff7b58774 in edm::serviceregistry::ServicesManager::~ServicesManager() () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/cc8_aarch64_gcc9/libFWCoreServiceRegistry.so
#10 0x00000000004149a8 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#11 0x0000fffff7da623c in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-03-08-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#12 0x000000000040f288 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#13 0x000000000040ee78 in main ()

@dan131riley
Copy link

FastTimerService needed the same treatment as the task_scheduler_observer in InitRootHandlers, see #33126.

@Dr15Jones
Copy link
Contributor

Although I found a fix to stop the symptom we saw (i.e. the crashing) I do not fully understand the underlying cause.

So what is being seen is the edm::CurrentModuleOnThread::getCurrentModuleOnThread() is returning a non nullptr and the information associated to that ModuleCallingContext is all invalid, except for the ModuleDescription which says it is from "PreMixingModule".

When I run the Tracer service in the job, I see that all stream instances of the PreMixingModule have already been run for the stream end Run transition before the problem happens so non of them are on the 'current thread'.

Only ModuleContextSentry manipulates the values in edm::CurrentModuleOnThread, setting it in its constructor and destructor. When I added an assert to the destructor

~ModuleContextSentry() {
  assert(moduleCallingContext_->previousModuleOnThread() == nullptr);
  CurrentModuleOnThread::setCurrentModuleOnThread(moduleCallingContext_->previousModuleOnThread());
  moduleCallingContext_->setContext(ModuleCallingContext::State::kInvalid, ParentContext(), nullptr);
}

That never went off even though the assert I added to the constructor

ModuleContextSentry(ModuleCallingContext* moduleCallingContext, ParentContext const& parentContext)
 : moduleCallingContext_(moduleCallingContext) {
   assert(nullptr == CurrentModuleOnThread::getCurrentModuleOnThread());
   moduleCallingContext_->setContext(
       ModuleCallingContext::State::kRunning, parentContext, CurrentModuleOnThread::getCurrentModuleOnThread());
   CurrentModuleOnThread::setCurrentModuleOnThread(moduleCallingContext_);
}

Does go off, saying that PreMixingModule is already on the current thread, even though I know from a gdb backtrace AND the Tracer service telling me that is not true.

So it seems to behave as if ~ModlueContextSentry is never called even though the object itself is on the stack.

@Dr15Jones
Copy link
Contributor

Alice, we have passed beyond the looking glass. Given that I'm seeing crazy behavior, I decided to pursue crazy. In the constructor of ModuleContextSentry I record (as member data of the class) the address of the thread local variable &CurrentModuleOnThread::currentModuleOnThread_. In ~ModuleContextSentry I compare the address recorded in the constructor to the one thread local address again. I am seeing jobs where they are NOT EQUAL.

My hypothesis is calling tbb::task::suspend is allowing the entire stack to move to another thread!

@dan131riley
Copy link

OneAPI docs say:

"There are no guarantees that the same thread that called tbb::task::suspend continues execution after the suspended point. However, these guarantees are provided for the outermost blocking oneTBB calls (such as tbb::parallel_for and tbb::flow::graph::wait_for_all) and tbb::task_arena::execute calls."

@Dr15Jones
Copy link
Contributor

So our other uses work because I have a task_arena. Good to know.

@makortel
Copy link
Contributor Author

+1

No further signs of these failures

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants