-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop or deadlock, possibly in SiStripDigitizerAlgorithm::accumulateSimHits #15376
Comments
A new Issue was created by @dan131riley Dan Riley. @davidlange6, @smuzaffar, @davidlt, @Dr15Jones can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
I thought the problem was in the timeout from the xrootd request. I got this conclusion since the first timestamp in the log is
then the last timestamp before the xrootd message is
and finally the xrootd message timestamp is
which is approaching 2 hours after the begin of the job. |
@bbockelm what do you think? |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
My first thought was also a timeout or deadlock in xrootd; I eventually rejected that theory because none of threads are waiting on xrootd--three of them are idle in the tbb scheduler, and thread 2 isn't waiting at all, which is what lead me to an infinite loop:
|
@dan131riley - could you reformat the original post to used a fixed-sized font (wrap it in three backticks)? My brain hurts trying to parse the stack traces. |
From what I can parse, I don't see anything actively waiting on Xrootd. It's possible the responses from EOS were quite slow - instead of a deadlock, just hit a timeout. How reliable is the issue? |
Not reliable at all--I don't see any previous issues that look like it. |
I've seen jobs which timed out before and a number of them I attributed (rightly or wrongly) to EOS being very slow in responding based on the timestamps in the log before and after xrootd messages. |
Yeah - it's worth noting that the storage timeouts are pretty dumb: a timeout of, say, 30s gets reset for each individual xrootd operation. That means anytime the TTreeCache is ineffective, we may trigger lots of individual xrootd operations and really extend the timeout. Unfortunately, the storage layer has no concept of the relationship between successive reads - that's probably what we'd need to have a more intelligent timeout infrastructure. |
This is the MixingModule, where every thread has its own input file, so there could have been a problem with one EOS connection. However, these are very fast events, so it is fairly unlikely that we'd just happen to get the stack trace while thread 2 wasn't waiting. Is it plausible that the xrootd thread got an EINTR when the timeout signal was delivered and somehow managed to recover in time to give the MixingModule thread something to do before the stack trace completed? -dan |
Well, certainly the hope is that EINTR doesn't affect any of the observable behavior of the system. I have no way to prove that though! I'm more confident that a EINTR resulting in a short read will throw a C++ exception as opposed to returning insufficient data. @dan131riley - thinking out loud: Java's signal handlers do a lot more than just stack traces. We could take some inspiration from them:
This would allow us to differentiate between extremely slow IO and CPU usage in cases like this. |
We do not see such issue any more. So closing it, please open a new open if need. |
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc530/CMSSW_8_1_THREADED_X_2016-08-03-1100/pyRelValMatrixLogs/run/202.0_TTbar+TTbarINPUT+DIGIPU1+RECOPU1+HARVEST/step2_TTbar+TTbarINPUT+DIGIPU1+RECOPU1+HARVEST.log
Killed by an external termination signal, with three threads idle. Possibly an infinite loop in SiStripDigitizerAlgorithm::accumulateSimHits?
The text was updated successfully, but these errors were encountered: