-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DQM: Fix potential deadlock in setBinLabel. #28916
DQM: Fix potential deadlock in setBinLabel. #28916
Conversation
This one only triggers in the error path, because getFullname tries to take a lock that we already hold. However, there is another opportunity for a deadlock via edm::Log* (which probably takes a lock), and the per ME-locks which could be taken before or after edm::Log* is called. I reviewed all usages of edm::Log* in DQMServices/Core and did not see anything problematic. Note that this is the only usage of edm::Log* in MonitorElement, and it now avoids taking multiple locks at the same time, so there should not be any more problems of this type.
@cmsbuild ping ? |
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28916/13709
|
A new Pull Request was created by @schneiml (Marcel Schneider) for master. It involves the following packages: DQMServices/Core @andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test 136.8861, 250202.183 No changes expected, except that these two workflows now don't time out. |
AFAIK the MessageLogger does not take any locks. |
To understand where the deadlock happens
This will give you the stack trace for all threads If you post the results here we can help. |
@Dr15Jones the actual deadlock here is trivial, it is The |
please test workflow 136.8861, 250202.183 |
The tests are being triggered in jenkins.
|
-1 Tested at: e2a9965 CMSSW: CMSSW_11_1_X_2020-02-10-2300 I found follow errors while testing this PR Failed tests: RelVals
When I ran the RelVals I found an error in the following workflows: runTheMatrix-results/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/step3_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25.log |
Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped) |
Pull request #28916 was updated. @SiewYan, @andrius-k, @mkirsano, @kmaeshima, @schneiml, @efeyazgan, @cmsbuild, @jfernan2, @agrohsje, @fioriNTU, @alberto-sanchez, @qliphy can you please check and sign again. |
please test workflow 136.8861, 250202.183 Hope I got the command right now. And that 250202.183 actually passes. An output comparison for that wf would be interesting, it triggers a lot of rare code paths (related to processing multiple runs). |
The tests are being triggered in jenkins.
|
+1 |
Comparison job queued. |
Comparison is ready @slava77 comparisons for the following workflows were not done due to missing matrix map:
Comparison Summary:
|
Thanks @schneiml. This PR fixes 250202.183,136.8861. The log is partially changed, but was i expected? |
@silviodonato "Expected" is maybe a bit too confident, but it is within the range of changes that #28622 and the PRs that came before caused in general. Specifcally, that The new |
merge |
@schneiml 136.8861 and 250202.183 are ok in CMSSW_11_1_X_2020-02-12-1100, thanks. |
Run-dependent MC, see #28214. |
@makortel thanks for the link, I was guessing it is sth. in that direction. What matters is that this wf processes multiple runs in a single reco job (sth. we technically support, though I was not aware we actually do that), followed by multi-run harvesting with a rather broken configuration: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-99170b/4610/runTheMatrix-results/250202.183_TTbar_13UP18_RD_test+TTbar_13UP18_RD_test+DIGIPRMXUP18_PU25_RD_test+RECOPRMXUP18_PU25_RD_test+HARVESTUP18_PU25/DQM_V0001_R000316083__Global__CMSSW_X_Y_Z__RECO.root With the old I guess I am fine with all of that, though potential users of this WF might want to be aware. Also, the output files of such jobs must not be uploaded to production DQMGUIs -- the correct run number |
@schneiml we have to start running this Run-dependent MC Relvals. What would you recommend about harvest step? Who should I get in touch to fix this for long term usage? |
@chayanit I think the right thing to do would be to use the Regarding run number 1 vs. 999999, I think we (DQM/PPD) need to make a definition, but to me 999999 like for any other multi-run harvesting seems easier. |
PR description:
Fixes the IB test failure in 136.8861 (and maybe others). The deadlock was triggered by
LogMessageMonitor:ClusterizerLogMessageMonCommon
inbookHistograms
. Reported in #28622.This one only triggers in the error path, because getFullname tries to
take a lock that we already hold.
However, there is another opportunity for a deadlock via edm::Log*
(which probably takes a lock), and the per ME-locks which could be taken
before or after edm::Log* is called. I reviewed all usages of edm::Log*
in DQMServices/Core and did not see anything problematic.
Note that this is the only usage of edm::Log* in MonitorElement, and it
now avoids taking multiple locks at the same time, so there should not
be any more problems of this type.
This PR now also includes some commits from #28813 related to
Sumw2
, to fix a completely unrelated problem with 250202.183 which happens after the previous deadlock.PR validation:
Runs 136.8861 without deadlock.