-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTPPS related issue in several IB workflows #35928
Comments
A new Issue was created by @qliphy Qiang Li. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign dqm, db, alca |
New categories assigned: dqm,db,alca @jfernan2,@ahmad3213,@yuanchao,@rvenditti,@emanueleusai,@ggovi,@francescobrivio,@francescobrivio,@pbo0,@malbouis,@malbouis,@tvami,@tvami,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@jan-kaspar can you please have a look? |
In addition to those two above, also wf 136.8642 was tested in #35914 without crashes. |
If it problem is a threading issue (e.g. a race condition), it would not show up in PR tests (unless multithreaded tests were enabled explicitly), but would show up IB tests. |
Just to make it more evident there is another crash (spotted by @mmusich) in the same IB and still related to CTPPS:
I have reported it in #35936 |
Thanks @makortel for the suggestion, I believe @malbouis is testing this. |
As a comment, I'm not sure how useful it is, but I have ran workflows 136.8311 and 136.796 with runTheMatrix on the latest IB and observed no crashes. Will try it with --nThreads > 1 in one of these to see if I'm able to reproduce the error. |
Multi-threading in PR tests would effectively destroy bitwise reproducibility of simulated workflows, e.g. because different GEN events could be simulated in different EDM streams leading to different random number sequences between runs. So far race conditions have been rare enough to continue with the current setup. |
I have run |
I guess testing the multithreaded option is not as straightforward as I thought. %MSG-i ThreadStreamSetup: (NoModuleName) 01-Nov-2021 13:07:29 CET pre-events The situation can be fixed by either
|
Instead of just running step3 of wf 136.8311, I tried to run locally with 4 threads: |
A very quick look at the PPS code I found cmssw/CondFormats/PPSObjects/interface/PPSAssociationCuts.h Lines 55 to 56 in 6681a5f
where a mutable is likely to be the cause of thread-safety problems. |
Indeed, it looks like those vectors are filled from a cmssw/CondFormats/PPSObjects/src/PPSAssociationCuts.cc Lines 60 to 63 in 6681a5f
This is not thread-safe. |
NOTE: the DB does have a mechanism to modify the object right after read from DB but before it is put out into the EventSetup. That allows one to avoid using mutables and provides a thread-safe way to update objects coming out of storage. |
Thanks @Dr15Jones ! Could you please give me a pointer to this mechanism? I can open a fix RP shortly then. |
@jan-kaspar I can try to find it but this is really the domain for @cms-sw/db-l2 |
Thanks @Dr15Jones ! Anyone's help appreciated. I will try googling in the meantime. |
So it looks like you must call
with cmssw/CondCore/SiPixelPlugins/plugins/plugin.cc Lines 51 to 54 in 6d2f660
|
Thanks again, I will give it a try! |
In general, one should never use mutable data for data products for the Run, LuminosityBlock, Event or the EventSetup. |
Hopefully, here's a fix: #35941 |
After merging #35941 new IB tests look good. |
+1 |
Although the issues mentioned in #35927 should have been fixed mostly by #35766
There appears several CTPPS related issues in IB:
https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_12_2/2021-10-31-2300?selectedArchs=slc7_amd64_gcc900&selectedFlavors=X&selectedStatus=failed
For example, workflow 136.8311
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_2_X_2021-10-31-2300/pyRelValMatrixLogs/run/136.8311_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017/step2_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017.log#/115-115
----- Begin Fatal Exception 01-Nov-2021 02:51:10 CET-----------------------
An exception of category 'FatalRootError' occurred while
[0] Processing Event run: 305064 lumi: 36 event: 55020723 stream: 1
[1] Running path 'MINIAODoutput_step'
[2] Prefetching for module PoolOutputModule/'MINIAODoutput'
[3] Calling method for module CTPPSProtonProducer/'ctppsProtons'
Additional Info:
[a] Fatal Root Error: @sub=TFormula::Eval
Formula is invalid and not ready to execute
and workflow 136.796
Module: CTPPSProtonProducer:ctppsProtons (crashed)
Module: StandAloneMuonProducer:displacedStandAloneMuons
Module: LXXXCorrectorProducer:ak4CaloResidualCorrector
Module: PreshowerPhiClusterProducer:multi5x5SuperClustersWithPreshower
A fatal system signal has occurred: segmentation violation
The text was updated successfully, but these errors were encountered: