-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation violation in all DQM online clients during HI run #40110
Comments
A new Issue was created by @rvenditti . @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
more info provided at #40111 |
assign core
|
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
You could try to run the job under valgrind cmsRun yourConfig_cfg.py Be forewarned, it will be very slow (like 60x slower than running cmsRun normally). |
valgrind latest output:
|
The valgrind output is saying that memory is being shared between to Event data products, which is not something that we allow. The question now is how that happened. To figure that out, it would help to figure out what is the lookup info for the products (i.e. module label, product instance label and process name as well as the C++ type). What is the commands needed to reproduce? |
Were there any other warnings issued by valgrind? |
Hi, @Dr15Jones , valgrind produced lot of output. Please check following logs (limited by my console buffer size): To reproduce:
Put process.source.minEventsPerLumi = 99999999 to the csc_dqm_sourceclient-live_cfg.py and run:
|
Running the following on architecture el8_amd64_gcc10:
but with MessageLogger cout threshold = 'DEBUG', reproduces the crash and produces the following log file: There are a few warnings emitted early in the process:
I've been told these warnings are unusual. The following info messages are also apparently unusual:
|
The data product having problems is
given this comes from the DQMStreamerReader and that is not code maintained by the Core, it would be best to inform DQM. |
assign dqm |
New categories assigned: dqm @jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Given this is a non ROOT file, is it possible that there was a schema change on the data stored in the file relative to what code is checked out? If so, then it is highly probably that is the cause as only the ROOT I/O based systems can handle any sort of schema change. |
We are asking the HLT expert to remove this product hltHbherecoFromGPU from the standard DQM stream. |
@Dr15Jones Now with the new run, the DQM clients are running without crashes. Thank you for your help!! |
For the record, these are not unusual. It's just the Strip unpacker reporting a FED is out of DAQ, and that happens all the time. |
So I put some print statements in the source. The event with the crash is the first event which contains a |
In the HLT process, is either the hltHbhereco or hltHbherecoFromGPU and EDAlias of the other? If so, this could explain the problem. In PoolOutputModule one can't write the data product and its alias into the same output as the pointers are the same. It is possible the Streamer output module doesn't prohibit the same behavior which would cause ROOT to share the pointed to objects between the two data products. |
..and in fact, the collection Back to the issue, I don't know what's causing it. It doesn't look to me like DataFormats changed. We do write the product The only thing that caught my eye is that, in the failed configuration, the event content (of the cmsRun /afs/cern.ch/work/m/missirol/public/fog/cmssw40110/repack_cfg.py /afs/cern.ch/work/m/missirol/public/fog/cmssw40110/data/run362294/run362294_ls0050_streamDQM_sm-c2a11-43-01.dat tmp.root |
Just read #40110 (comment), and would say that the answer is yes. This config is representative of the menu that was running online (modulo the fact that online we also had the extra |
@Dr15Jones @makortel how does the @smorovic it's possible we may need to improve the DAQ output modules to behave in the same way. |
assign daq |
It throws an exception at the beginning of the job. |
Hi @Dr15Jones , where is the exception condition, is it related to
|
The exception seems to be generated here: cmssw/FWCore/Framework/src/ProductSelector.cc Lines 77 to 84 in e0e820a
It is possible to ask for the |
As far as I can tell, the |
I copied in [1] the output of [2]. This should be very close to what was used online [*]. [1] |
I also tried to create a version that at doesn't throw an exception very earlty due to missing parameters: |
Thanks @missirol, I was able to reproduce the issue with your configuration (even without input it ran enough). I see now the check in I should be able to come up with a fix shortly. Given the situation with the data taking, is it useful to be backported to 12_4_X? |
IMHO we can live with 12.4.x as it is, and have the fix only in master. |
Thanks for checking, Matti. Fwiw, I agree with Andrea. |
The fix is in #40136. |
+core |
We are observing DQM Online clients crashing with following segmentation violation (this is for the csc client: https://cmsweb.cern.ch/dqm/dqm-square/tmp/tmp/content_parser_productionPARSER_run362293PARSER_job28.log but all other clients show the same behavior):
Fri Nov 18 16:45:38 CET 2022
Thread 3 (Thread 0x7fbb53f1f700 (LWP 43397) "cmsRun"):
#0 0x00007fbb76b96b43 in select () from /lib64/libc.so.6
#1 0x00007fbb6f4f9a92 in lat::IOSelectSelect::wait(long) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_2_patch1/external/slc7_amd64_gcc10/lib/libclasslib.so
#2 0x00007fbb6f4fa55e in lat::IOSelector::wait(long) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_2_patch1/external/slc7_amd64_gcc10/lib/libclasslib.so
#3 0x00007fbb6f4fa318 in lat::IOSelector::dispatch(long) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_2_patch1/external/slc7_amd64_gcc10/lib/libclasslib.so
#4 0x00007fbb71379626 in DQMNet::run() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libDQMServicesCore.so
#5 0x00007fbb7137a0bc in communicate(void*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libDQMServicesCore.so
#6 0x00007fbb76e76ea5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fbb76b9fb0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fbb549f5700 (LWP 43390) "cmsRun"):
#0 0x00007fbb76e7e1d9 in waitpid () from /lib64/libpthread.so.0
#1 0x00007fbb6fcd3567 in edm::service::cmssw_stacktrace_fork() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007fbb6fcd40da in edm::service::InitRootHandlers::stacktraceHelperThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007fbb77476f90 in std::execute_native_thread_routine (__p=0x7fbb704d0030) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007fbb76e76ea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fbb76b9fb0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fbb74cf1740 (LWP 43345) "cmsRun"):
#0 0x00007fbb76b94ddd in poll () from /lib64/libc.so.6
#1 0x00007fbb6fcd381f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007fbb6fcd41ac in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007fbb6fcd6afb in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4
#5 0x0000000000000000 in ?? ()
#6 0x00007fbb794a8347 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::M_release() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#7 0x00007fbb795863d6 in edm::DataManagingProductResolver::resetProductData(bool) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#8 0x00007fbb795720a8 in edm::Principal::clearPrincipal() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007fbb794bcdcd in edm::EventPrincipal::clearEventPrincipal() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007fbb794c425e in edm::FunctorWaitingTask<edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}>, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#3}> >, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#2}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#1}> >::runLast(edm::WaitingTaskHolder)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007fbb794a293f in tbb::detail::d1::function_taskedm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007fbb77c78bec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fbb74362d00, this=0x7fbb7438f700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#13 tbb::detail::r1::task_dispatcher::local_wait_for_alltbb::detail::r1::external_waiter (waiter=..., t=, this=0x7fbb7438f700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#14 tbb::detail::r1::task_dispatcher::execute_and_wait (t=, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#15 0x00007fbb794e1ffd in edm::FinalWaitingTask::wait() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#16 0x00007fbb794cd98c in edm::EventProcessor::processLumis(std::shared_ptr const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#17 0x00007fbb794d8d53 in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#19 0x00007fbb77c670eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/arena.cpp:698
#20 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#21 0x000000000040971c in main ()
Current Modules:
Module: none (crashed)
A fatal system signal has occurred: segmentation violation
-- process exit: -11 --
The crash seems to be due to DQMNet, but we don't have any clue to debug it.
Could cmssw experts have a look and give us suggestions if possible?
The text was updated successfully, but these errors were encountered: