-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failures in edm::service::Timing #33107
Comments
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Presumably caused by (or related to) the migration to |
The errors seem random. In |
Here's the assertion failure:
which is here: cmssw/FWCore/Services/plugins/Timing.cc Lines 182 to 197 in 9be9e4c
which would seem to imply that a module finished on a different thread than it started, or there's something funny with thread locals. Since this is a threading-related bug it's non-deterministic. |
I’ll put my effort into this today. |
So there are 'begin' and 'end' callbacks in the ServiceRegistry system. We normally surround the calls they are announcing with an object whose constructor calls the 'begin' to the Services and whose destructor calls 'end'. That isn't always the case. So I'm guessing one of my changes is causing the 'begin' to either not be called or to be called on possibly a different thread than the 'end'. |
I'm still getting setup to look at the segfaults on exit. wrt the assertion failures, running 250206.181 so far I've got an exception:
and a different exception with a stack trace:
There's only one active thread, with this backtrace:
|
oooh, that exception is really helpful! |
I checked and all calls to modules are safely wrapped to get the 'begin' and 'end' calls made. However, the Timing service uses the stack for non-module related activities as well and maybe one of those is now the problem. |
Are all the jobs failing with the assert running PreMix? |
By quick look over several IB (flavors) it indeed looks like the assertion fires only in PreMix workflows. |
Looks like it, all the ones I checked are. The one I'm looking at certainly is:
|
So I see a problem here: cmssw/Mixing/Base/src/SecondaryEventProvider.cc Lines 20 to 23 in 6029d3d
the call to |
@dan131riley do you have the traceback for the other threads in the problem you are tracing? |
|
So I fixed the SecondaryEventProvider call but I am still getting the crash in |
From the crash I see, the problem seem to be that ModuleCallingContext::previousModuleOnThread is returning a value other than |
So for debugging purposes I added an |
Here is something really weird I've noticed about some of the jobs (including 1 I ran in the debugger). Thread 1 does not have
In fact, thread 8 which is where tha abort happens looks more like what I would expect to be the main thread
|
The segfaults at exit are
|
So the thread which says it has a previous module running on it, the context information says the previous module is the 'PreMixingModule'. |
Apparently my workaround for the |
So the previous module on thread is supposed to be non-nullptr inside the PreMixingModule because of this
However, when I was seeing the problem the previous module on thread was going off in BEGIN RUN which is not a place that PreMixingModule will be called from. |
So I did spot one behavioral difference between the old implementation and the new one. In cmssw/Mixing/Base/src/SecondaryEventProvider.cc Lines 93 to 96 in 6029d3d
previously the |
So if I short circuit SecondaryEventProvider calls so that no call to I'll be making a pull request shortly. |
See #33125 |
@dan131riley if you get a chance, could you see if #33125 also fixes the case you were looking at? |
I'll give it a try but I doubt it will make a difference. I noticed earlier that my initial attempts to replicate the problem didn't succeed until I added the
|
FastTimerService needed the same treatment as the |
Although I found a fix to stop the symptom we saw (i.e. the crashing) I do not fully understand the underlying cause. So what is being seen is the When I run the Tracer service in the job, I see that all stream instances of the Only ~ModuleContextSentry() {
assert(moduleCallingContext_->previousModuleOnThread() == nullptr);
CurrentModuleOnThread::setCurrentModuleOnThread(moduleCallingContext_->previousModuleOnThread());
moduleCallingContext_->setContext(ModuleCallingContext::State::kInvalid, ParentContext(), nullptr);
} That never went off even though the assert I added to the constructor ModuleContextSentry(ModuleCallingContext* moduleCallingContext, ParentContext const& parentContext)
: moduleCallingContext_(moduleCallingContext) {
assert(nullptr == CurrentModuleOnThread::getCurrentModuleOnThread());
moduleCallingContext_->setContext(
ModuleCallingContext::State::kRunning, parentContext, CurrentModuleOnThread::getCurrentModuleOnThread());
CurrentModuleOnThread::setCurrentModuleOnThread(moduleCallingContext_);
} Does go off, saying that So it seems to behave as if |
Alice, we have passed beyond the looking glass. Given that I'm seeing crazy behavior, I decided to pursue crazy. In the constructor of My hypothesis is calling |
"There are no guarantees that the same thread that called tbb::task::suspend continues execution after the suspended point. However, these guarantees are provided for the outermost blocking oneTBB calls (such as tbb::parallel_for and tbb::flow::graph::wait_for_all) and tbb::task_arena::execute calls." |
So our other uses work because I have a task_arena. Good to know. |
+1 No further signs of these failures |
This issue is fully signed and ready to be closed. |
Many workflows (but not all) in CMSSW_11_3_X_2021-03-08-1800 are failing along
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_11_3_X_2021-03-08-1800/pyRelValMatrixLogs/run/250202.0_TTbar_13+TTbar_13INPUT+DIGIPRMXUP15_PU25+RECOPRMXUP15_PU25+HARVESTUP15_PU25/step2_TTbar_13+TTbar_13INPUT+DIGIPRMXUP15_PU25+RECOPRMXUP15_PU25+HARVESTUP15_PU25.log#/80-80
The text was updated successfully, but these errors were encountered: