Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop or deadlock, possibly in SiStripDigitizerAlgorithm::accumulateSimHits #15376

Closed
dan131riley opened this issue Aug 5, 2016 · 14 comments

Comments

@dan131riley
Copy link

dan131riley commented Aug 5, 2016

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc530/CMSSW_8_1_THREADED_X_2016-08-03-1100/pyRelValMatrixLogs/run/202.0_TTbar+TTbarINPUT+DIGIPU1+RECOPU1+HARVEST/step2_TTbar+TTbarINPUT+DIGIPU1+RECOPU1+HARVEST.log

Killed by an external termination signal, with three threads idle. Possibly an infinite loop in SiStripDigitizerAlgorithm::accumulateSimHits?

%MSG-w MemoryCheck:  PoolOutputModule:FEVTDEBUGHLToutput  03-Aug-2016 21:47:08 CEST Run: 1 Event: 110
MemoryCheck: module PoolOutputModule:FEVTDEBUGHLToutput VSIZE 51726.2 16 RSS 1256.51 -35.8203
%MSG
[2016-08-03 23:29:36.767610 +0200][Error  ][XRootD            ] [p05799459y31761.cern.ch:1095] Impossible to send message kXR_read (handle: 0x01000000, offset: 1176773446, size: 527401). Trying to recover.
[2016-08-03 23:29:38.671294 +0200][Error  ][XRootD            ] [p05799459y31761.cern.ch:1095] Unable to get the response to request kXR_read (handle: 0x01000000, offset: 1176773446, size: 527401)


A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Thread 10 (Thread 0x7f66b317f700 (LWP 32352)):
#0  0x0000003f0160f37d in waitpid () from /lib64/libpthread.so.0
#1  0x00007f66bffd96b7 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginFWCoreServicesPlugins.so
#2  0x00007f66bffd9f62 in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginFWCoreServicesPlugins.so
#3  0x00007f66c2e88a50 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:84
#4  0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f66ab367700 (LWP 32353)):
#0  0x0000003f012e90a3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f66ab4343ba in XrdSys::IOEvents::PollE::Begin (this=0x7f66ac028a60, syncsem=<optimized out>, retcode=<optimized out>, eTxt=<optimized out>) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/./XrdSys/XrdSysIOEventsPollE.icc:213
#2  0x00007f66ab430a8d in XrdSys::IOEvents::BootStrap::Start (parg=0x7ffcbbbb91e0) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdSys/XrdSysIOEvents.cc:131
#3  0x00007f66ab42f647 in XrdSysThread_Xeq (myargs=0x7f66ab5fbb20) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdSys/XrdSysPthread.cc:86
#4  0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f66aa966700 (LWP 32354)):
#0  0x0000003f0160f00d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f66ab42fdeb in XrdSysTimer::Wait (mills=<optimized out>) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdSys/XrdSysTimer.cc:239
#2  0x00007f66ab53acaa in XrdCl::TaskManager::RunTasks (this=0x7f66abdd3880) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdCl/XrdClTaskManager.cc:244
#3  0x00007f66ab53ae29 in RunRunnerThread (arg=<optimized out>) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdCl/XrdClTaskManager.cc:37
#4  0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f66a9f65700 (LWP 32355)):
#0  0x0000003f012e5309 in syscall () from /lib64/libc.so.6
#1  0x00007f66ab5964e1 in XrdSys::LinuxSemaphore::Wait (this=0x7f66ac343600) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x7f66ac028998) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x7f66ac028980) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f66ab596679 in RunRunnerThread (arg=<optimized out>) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f66a9564700 (LWP 32356)):
#0  0x0000003f012e5309 in syscall () from /lib64/libc.so.6
#1  0x00007f66ab5964e1 in XrdSys::LinuxSemaphore::Wait (this=0x7f66ac343600) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x7f66ac028998) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x7f66ac028980) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f66ab596679 in RunRunnerThread (arg=<optimized out>) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f66a8b63700 (LWP 32357)):
#0  0x0000003f012e5309 in syscall () from /lib64/libc.so.6
#1  0x00007f66ab5964e1 in XrdSys::LinuxSemaphore::Wait (this=0x7f66ac343600) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x7f66ac028998) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x7f66ac028980) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f66ab596679 in RunRunnerThread (arg=<optimized out>) at /build/cmsbuild/auto-builds/CMSSW_8_1_0_pre6-slc6_amd64_gcc530/build/CMSSW_8_1_0_pre6-build/BUILD/slc6_amd64_gcc530/external/xrootd/4.3.0/xrootd-4.3.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f665193f700 (LWP 32406)):
#0  0x0000003f012e5309 in syscall () from /lib64/libc.so.6
#1  0x00007f66c31b9962 in tbb::internal::futex_wait (comparand=2, futex=0x7f66c1affa2c) at ../../include/tbb/machine/linux_common.h:60
#2  tbb::internal::binary_semaphore::P (this=0x7f66c1affa2c) at ../../src/tbb/semaphore.h:206
#3  rml::internal::thread_monitor::commit_wait (c=<synthetic pointer>, this=0x7f66c1affa20) at ../../src/rml/include/../server/thread_monitor.h:259
#4  tbb::internal::rml::private_worker::run (this=0x7f66c1affa00) at ../../src/tbb/private_server.cpp:282
#5  0x00007f66c31b9989 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:228
#6  0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#7  0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f6650f3e700 (LWP 32407)):
#0  0x0000003f012e5309 in syscall () from /lib64/libc.so.6
#1  0x00007f66c31b9962 in tbb::internal::futex_wait (comparand=2, futex=0x7f66c1affaac) at ../../include/tbb/machine/linux_common.h:60
#2  tbb::internal::binary_semaphore::P (this=0x7f66c1affaac) at ../../src/tbb/semaphore.h:206
#3  rml::internal::thread_monitor::commit_wait (c=<synthetic pointer>, this=0x7f66c1affaa0) at ../../src/rml/include/../server/thread_monitor.h:259
#4  tbb::internal::rml::private_worker::run (this=0x7f66c1affa80) at ../../src/tbb/private_server.cpp:282
#5  0x00007f66c31b9989 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:228
#6  0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#7  0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f665047f700 (LWP 32408)):
#0  0x00007f669d437d29 in SiStripDigitizerAlgorithm::accumulateSimHits(__gnu_cxx::__normal_iterator<PSimHit const*, std::vector<PSimHit, std::allocator<PSimHit> > >, __gnu_cxx::__normal_iterator<PSimHit const*, std::vector<PSimHit, std::allocator<PSimHit> > >, unsigned long, unsigned int, StripGeomDetUnit const*, Vector3DBase<float, GlobalTag> const&, TrackerTopology const*, CLHEP::HepRandomEngine*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimTrackerSiStripDigitizerPlugins.so
#1  0x00007f669d442bf2 in SiStripDigitizer::accumulateStripHits(edm::Handle<std::vector<PSimHit, std::allocator<PSimHit> > >, TrackerTopology const*, unsigned long, unsigned int, CLHEP::HepRandomEngine*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimTrackerSiStripDigitizerPlugins.so
#2  0x00007f669d4430a2 in SiStripDigitizer::accumulate(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimTrackerSiStripDigitizerPlugins.so
#3  0x00007f669ee51f0e in edm::MixingModule::accumulateEvent(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimGeneralMixingModulePlugins.so
#4  0x00007f669ee5206e in edm::MixingModule::pileAllWorkers(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimGeneralMixingModulePlugins.so
#5  0x00007f669ee5d7e3 in void edm::PileUp::readPileUp<std::_Bind<std::_Mem_fn<void (edm::MixingModule::*)(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)> (std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID)> >(edm::EventID const&, std::vector<edm::SecondaryEventIDAndFileInfo, std::allocator<edm::SecondaryEventIDAndFileInfo> >&, std::_Bind<std::_Mem_fn<void (edm::MixingModule::*)(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)> (std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID)>, int, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimGeneralMixingModulePlugins.so
#6  0x00007f669ee57fb1 in edm::MixingModule::doPileUp(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimGeneralMixingModulePlugins.so
#7  0x00007f669ed9098a in edm::BMixingModule::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libMixingBase.so
#8  0x00007f66c44d2062 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetup const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#9  0x00007f66c441eedf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#10 0x00007f66c43d9f0c in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::doWork<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::doWork<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#11 0x00007f66c43da0d2 in bool edm::Worker::doWork<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#12 0x00007f66c441184e in decltype ({parm#1}()) edm::convertException::wrap<void edm::Path::processOneOccurrence<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(void edm::Path::processOneOccurrence<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#13 0x00007f66c4411b02 in void edm::Path::processOneOccurrence<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#14 0x00007f66c4411f25 in void edm::StreamSchedule::processOneEvent<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal&, edm::EventSetup const&, bool)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#15 0x00007f66c441221b in decltype ({parm#1}()) edm::convertException::wrap<void edm::StreamSchedule::processOneEvent<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal&, edm::EventSetup const&, bool)::{lambda()#1}>(void edm::StreamSchedule::processOneEvent<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal&, edm::EventSetup const&, bool)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#16 0x00007f66c441248f in void edm::StreamSchedule::processOneEvent<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal&, edm::EventSetup const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#17 0x00007f66c44785d2 in edm::EventProcessor::processEvent(unsigned int) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#18 0x00007f66c4478e4a in edm::EventProcessor::processEventsForStreamAsync(unsigned int, std::atomic<bool>*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#19 0x00007f66c447f5b4 in edm::StreamProcessingTask::execute() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#20 0x00007f66c31c49b5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f66c1a27e00, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:463
#21 0x00007f66c31be1a4 in tbb::internal::arena::process (this=this@entry=0x7f66c1adf780, s=...) at ../../src/tbb/arena.cpp:156
#22 0x00007f66c31bd264 in tbb::internal::market::process (this=0x7f66c1b07e00, j=...) at ../../src/tbb/market.cpp:502
#23 0x00007f66c31b9736 in tbb::internal::rml::private_worker::run (this=0x7f66c1aff980) at ../../src/tbb/private_server.cpp:275
#24 0x00007f66c31b9989 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:228
#25 0x0000003f01607aa1 in start_thread () from /lib64/libpthread.so.0
#26 0x0000003f012e8aad in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f66c2b21bc0 (LWP 32335)):
#0  0x0000003f012df283 in poll () from /lib64/libc.so.6
#1  0x00007f66bffd9dc4 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginFWCoreServicesPlugins.so
#2  0x00007f66bffda02a in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginFWCoreServicesPlugins.so
#3  0x00007f66bffda114 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000003f012cf437 in sched_yield () from /lib64/libc.so.6
#6  0x00007f66c31c2c97 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x7f66c1b06400, completion_ref_count=@0x7f66c1afb628: 2) at ../../src/tbb/custom_scheduler.h:273
#7  0x00007f66c31c480d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f66c1b06400, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:581
#8  0x00007f66c4478b5e in edm::EventProcessor::readAndProcessEvent() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#9  0x00007f66c43e726a in statemachine::HandleEvent::readAndProcessEvent() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#10 0x00007f66c43e87f8 in statemachine::HandleEvent::HandleEvent(boost::statechart::state<statemachine::HandleEvent, statemachine::HandleLumis, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::n

Waiting for stacktrace completion failed: timed out waiting for GDB to complete.

A fatal system signal has occurred: external termination request
a, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#11 0x00007f66c43f057e in boost::statechart::state<statemachine::HandleEvent, statemachine::HandleLumis, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<statemachine::HandleLumis> const&, boost::statechart::state_machine<statemachine::Machine, statemachine::Starting, std::allocator<void>, boost::statechart::null_exception_translator>&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#12 0x00007f66c43f074a in boost::statechart::simple_state<statemachine::FirstLumi, statemachine::HandleLumis, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#13 0x00007f66c447f068 in boost::statechart::state_machine<statemachine::Machine, statemachine::Starting, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#14 0x00007f66c447228d in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#15 0x000000000040c814 in main::{lambda()#1}::operator()() const ()
#16 0x000000000040ac94 in main ()
@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2016

A new Issue was created by @dan131riley Dan Riley.

@davidlange6, @smuzaffar, @davidlt, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

I thought the problem was in the timeout from the xrootd request. I got this conclusion since the first timestamp in the log is

03-Aug-2016 21:45:25

then the last timestamp before the xrootd message is

%MSG-w MemoryCheck:  PoolOutputModule:FEVTDEBUGHLToutput  03-Aug-2016 21:47:08 CEST Run: 1 Event: 110

and finally the xrootd message timestamp is

[2016-08-03 23:29:36.767610 +0200][Error  ][XRootD            ]

which is approaching 2 hours after the begin of the job.

@Dr15Jones
Copy link
Contributor

@bbockelm what do you think?

@Dr15Jones
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2016

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@dan131riley
Copy link
Author

dan131riley commented Aug 5, 2016

My first thought was also a timeout or deadlock in xrootd; I eventually rejected that theory because none of threads are waiting on xrootd--three of them are idle in the tbb scheduler, and thread 2 isn't waiting at all, which is what lead me to an infinite loop:

Thread 2 (Thread 0x7f665047f700 (LWP 32408)):
#0  0x00007f669d437d29 in SiStripDigitizerAlgorithm::accumulateSimHits(__gnu_cxx::__normal_iterator<PSimHit const*, std::vector<PSimHit, std::allocator<PSimHit> > >, __gnu_cxx::__normal_iterator<PSimHit const*, std::vector<PSimHit, std::allocator<PSimHit> > >, unsigned long, unsigned int, StripGeomDetUnit const*, Vector3DBase<float, GlobalTag> const&, TrackerTopology const*, CLHEP::HepRandomEngine*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimTrackerSiStripDigitizerPlugins.so
#1  0x00007f669d442bf2 in SiStripDigitizer::accumulateStripHits(edm::Handle<std::vector<PSimHit, std::allocator<PSimHit> > >, TrackerTopology const*, unsigned long, unsigned int, CLHEP::HepRandomEngine*) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimTrackerSiStripDigitizerPlugins.so
#2  0x00007f669d4430a2 in SiStripDigitizer::accumulate(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimTrackerSiStripDigitizerPlugins.so
#3  0x00007f669ee51f0e in edm::MixingModule::accumulateEvent(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimGeneralMixingModulePlugins.so
#4  0x00007f669ee5206e in edm::MixingModule::pileAllWorkers(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week0/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-08-03-1100/lib/slc6_amd64_gcc530/pluginSimGeneralMixingModulePlugins.so

@bbockelm
Copy link
Contributor

bbockelm commented Aug 5, 2016

@dan131riley - could you reformat the original post to used a fixed-sized font (wrap it in three backticks)? My brain hurts trying to parse the stack traces.

@bbockelm
Copy link
Contributor

bbockelm commented Aug 5, 2016

From what I can parse, I don't see anything actively waiting on Xrootd. It's possible the responses from EOS were quite slow - instead of a deadlock, just hit a timeout.

How reliable is the issue?

@dan131riley
Copy link
Author

Not reliable at all--I don't see any previous issues that look like it.

@Dr15Jones
Copy link
Contributor

I've seen jobs which timed out before and a number of them I attributed (rightly or wrongly) to EOS being very slow in responding based on the timestamps in the log before and after xrootd messages.

@bbockelm
Copy link
Contributor

bbockelm commented Aug 5, 2016

Yeah - it's worth noting that the storage timeouts are pretty dumb: a timeout of, say, 30s gets reset for each individual xrootd operation. That means anytime the TTreeCache is ineffective, we may trigger lots of individual xrootd operations and really extend the timeout.

Unfortunately, the storage layer has no concept of the relationship between successive reads - that's probably what we'd need to have a more intelligent timeout infrastructure.

@dan131riley
Copy link
Author

This is the MixingModule, where every thread has its own input file, so there could have been a problem with one EOS connection. However, these are very fast events, so it is fairly unlikely that we'd just happen to get the stack trace while thread 2 wasn't waiting. Is it plausible that the xrootd thread got an EINTR when the timeout signal was delivered and somehow managed to recover in time to give the MixingModule thread something to do before the stack trace completed?

-dan

@bbockelm
Copy link
Contributor

bbockelm commented Aug 5, 2016

Well, certainly the hope is that EINTR doesn't affect any of the observable behavior of the system. I have no way to prove that though! I'm more confident that a EINTR resulting in a short read will throw a C++ exception as opposed to returning insufficient data.

@dan131riley - thinking out loud: Java's signal handlers do a lot more than just stack traces. We could take some inspiration from them:

  • What would it take to get the StorageFactory to cough up IO statistics to stdout for a limited set of signals (i.e., maybe this isn't a great idea for SIGSEGV but OK for SIGQUIT?)
  • And/or print out simple statistics such as memory & CPU usage.

This would allow us to differentiate between extremely slow IO and CPU usage in cases like this.

@smuzaffar
Copy link
Contributor

We do not see such issue any more. So closing it, please open a new open if need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants