DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920

schneiml · 2020-02-11T13:35:12Z

PR description:

This prevents the crashes in 1361.181 reported in #28622. Prevents, not fixes.

I'd like to summon @Dr15Jones and @makortel to this discussion: What happens here is that a job with VALIDATION enabled (which I am pretty sure does contain some edm::EDAnalyzers -- I have not checked it though) is requested to run with concurrent lumisections. EDM should prevent this, since it is not safe to have concurrent lumisections with legacy modules [1]. However, the new DQMStore still detects that a new lumisection starts before the previous one is saved, and that consequently it needs to copy MEs (this triggers the assertLegacySafe assertion by default, unless it is explicitly turned off). The problem with that is now that edm::EDAnalyzer based DQM code could hold pointers that get free'd by the DQMStore later (as the first lumisection ends). For this reason, it is only safe to disable assertLegacySafe when there are no edm::EDAnalyzer based DQMStore users in the process.

So, with this PR, 1361.181 runs but is unsafe. What we should do instead is either make sure that EDM actually does not use concurrent lumisections at all [2] when there are legacy modules (then we can keep the assertion enabled and it will not fire), or remove all legacy modules from the jobs using concurrent lumis (that is really what we need to do, but much harder than it sounds).

We can of course also just keep the unsafe behaviour. It seems to work for now (maybe because the legacy modules involved don't do anything dangerous [3]), but I can't make any guarantees about that.

[1] actually, the majority of DQM currently runs in edm::one::EDProducers watching lumis which should cause the same effect.
[2] I think EDM might overlap writing the current lumi and processing the next even when there are modules blocking concurrent lumis -- that would explain the behaviour.
[3] (Edit:) Since this entire story is about lumis, but legacy modules by default deal with per-job MEs, we should actually be safe as long as the legacy modules don't explicitly use Scope::LUMI. But then the (meaningful) legacy plugins do actually set a scope different from JOB manually (else they would not produce any output in the reco step), and technically the same problem exists with RUN MEs. Except, we don't do anything close to concurrent runs currently. But, we have reasons to believe that there is no risk of use-after-free in this workflow today.

PR validation:

runTheMatrix.py  -l 1361.181  -i all --job-reports -t 4

passes. Note the -t 4, bare 1361.181 did not trigger the problem.

See concerns above.

Note that this does not actually mean it is safe to use concurrent lumis; we just disable the check if concurrent lumis are actually requested. In fact, we know that assertLegacySafe fails even when there are legacy modules that should block concurrent lumis.

cmsbuild · 2020-02-11T13:35:37Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-02-11T13:40:01Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28920/13715

This PR adds an extra 36KB to repository

cmsbuild · 2020-02-11T13:40:24Z

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

Configuration/Applications

@silviodonato, @kpedro88, @cmsbuild, @franzoni, @fabiocos, @davidlange6 can you please review it and eventually sign? Thanks.
@makortel, @Martin-Grunewald this is something you requested to watch as well.
@davidlange6, @silviodonato, @fabiocos you are the release manager for this.

cms-bot commands are listed here

schneiml · 2020-02-11T13:41:10Z

please test 1361.181

silviodonato · 2020-02-11T14:22:53Z

please test workflow 1361.181

cmsbuild · 2020-02-11T14:23:16Z

The tests are being triggered in jenkins.
Test Parameters:

MATRIX_EXTRAS = 1361.181
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/4598/console Started: 2020/02/11 15:26

Dr15Jones · 2020-02-11T14:48:47Z

What happens here is that a job with VALIDATION enabled (which I am pretty sure does contain some edm::EDAnalyzers -- I have not checked it though) is requested to run with concurrent lumisections. EDM should prevent this, since it is not safe to have concurrent lumisections with legacy modules

@schneiml I'm afraid you have the wrong mental model for how this work. Concurrent LumiBlocks are not prevented, instead the framework still starts the new LuminosityBlock before the last ends while still guaranteeing that such legacy and edm::one modules will see the previous end LuminosityBlock call before the new begin LuminosityBlock call. So it isn't that the legacy prevent concurrent lumis, it is more they prevent processing the events in the next lumi before all events in the previous lumi finish. Sorry we gave you the wrong impression for what is happening.

schneiml · 2020-02-11T15:02:20Z

@Dr15Jones yes, I pretty much inferred that this is what's going on, and it is basically fine. Except for the problem explained above: The new DQMStore tries to not allocate new ME objects as long as possible, using only a single instance of each ME, which allows the legacy code holding pointers to (global) MEs to work all fine (even if they are filled at the weirdest times). This obviously does not work as soon a new lumisection starts before the last one is saved, and once we have more than a single copy of an ME, the "all legacy interactions keep working" guarantee no longer holds.

So as written above, we now have three options:

Change these workflows to absolutely not use concurrent lumis.
or Clean up and remove the legacy modules from these workflows.
or Merge this PR and accept the risk.

I'd prefer the first or second, but the third is obviously much easier.

cmsbuild · 2020-02-11T16:17:02Z

+1
Tested at: 35ad7e2
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ffb461/4598/summary.html
CMSSW: CMSSW_11_1_X_2020-02-10-2300
SCRAM_ARCH: slc7_amd64_gcc820

cmsbuild · 2020-02-11T16:17:06Z

Comparison job queued.

cmsbuild · 2020-02-11T17:46:04Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ffb461/4598/summary.html

@slava77 comparisons for the following workflows were not done due to missing matrix map:

/data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/results/JR-comparison/PR-ffb461/1361.181_GluGluHToZZTo4L_M125_Pow_py8_Evt_13UP18ml+GluGluHToZZTo4L_M125_Pow_py8_Evt_13UP18ml+DIGIUP18ml+RECOUP18ml+HARVESTUP18+NANOUP18ml

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 34
DQMHistoTests: Total histograms compared: 2694005
DQMHistoTests: Total failures: 1
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2693658
DQMHistoTests: Total skipped: 346
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
Checked 147 log files, 16 edm output root files, 34 DQM output files

silviodonato · 2020-02-11T17:46:17Z

urgent

makortel · 2020-02-17T19:12:38Z

First of those is MuonAlignment, for which I opened an issue #28975. I don't know if there are more.

A further test reveals that the MuonAlignment is the only legacy DQM module in the step2 of 1000.0.

silviodonato · 2020-02-25T13:58:49Z

With #29027 we removed MuonAlignment from wf 1000.0 and from CMSSW

makortel · 2020-02-25T14:57:43Z

Now that full matrix non-harvesting steps should be clean from legacy modules, in theory this configuration flag should no longer be needed (in full matrix). I'm left to wonder why the assert fires also for DQMOneLumiEDAnalyzer:

#28920 (comment)

Matti and I looked at the backtrace for the failing IB. It is not caused by an edm::EDAnalyzer. It is caused by a DQMOneLumiEDAnalyzer, in particular AlcaBeamMonitor.

(is it perhaps not detecting legacy modules properly?)

makortel · 2020-02-25T15:24:14Z

@Dr15Jones reminded me that my question was essentially answered in #28920 (comment).

Can assertLegacySafe be removed then?

schneiml · 2020-02-25T17:39:11Z

@makortel the key point is that this assert does not detect the presence of legacy modules, but simply watches the behavior of the DQMStore (which is in the end driven by EDM). That makes it less useful, but also much simpler, and much easier to understand:

If assertLegacySafe is set: There is only a single histogram for each ME name. No cloning, duplicating, free'ing happens ever. Holding any sort of pointers is safe.
Else: there may be multiple histograms for each ME. In some cases, it may not be safe to touch pointers handed out earlier. It is safe in the transitions exposed by non-legacy modules.

Ideally the assert would check if legacy modules are present in the job, but that is quite hard, since the very specific property of legacy modules is that they might spontaneously, at any point in time, get the edm::Service and call into the DQMStore. So the DQMStore cannot know if there actually are legacy modules without asking for this information somewhere else. Therefore, the assert is conservative and has to be removed manually.

makortel · 2020-02-25T17:42:20Z

Ok, so the change in this PR should cover all jobs whose configuration is generated with cmsDriver. Does that propagate to T0 configurations automatically or not? (I'm a bit afraid of the latter)

davidlange6 · 2020-02-25T18:13:52Z

The t0 uses config builder. So should be good. Can be checked via the unit tests in Configuration/dataProcessing On Feb 25, 2020 1:09 PM, Matti Kortelainen <notifications@github.com> wrote: Ok, so the change in this PR should cover all jobs whose configuration is generated with cmsDriver. Does that propagate to T0 configurations automatically or not? (I'm a bit afraid of the latter) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#28920?email_source=notifications&email_token=ABGPFQYSDSSDLSOYASDKNPTREVNFRA5CNFSM4KTA5K72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM43HXA#issuecomment-590984156>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQ7SQAFWWLINEHSC6TDREVNFRANCNFSM4KTA5K7Q>.

silviodonato · 2020-03-02T14:59:26Z

@schneiml have you tried to use the tool edmCheckMultithreading to check the modules in the DQM sequence?

schneiml · 2020-03-02T15:06:40Z

@silviodonato I am at it. It gave three hot candidates: https://cmssdt.cern.ch/dxr/CMSSW/source/Validation/RecoTau/plugins/DQMHistNormalizer.cc , https://cmssdt.cern.ch/dxr/CMSSW/source/DQM/TrigXMonitorClient/src/L1ScalersClient.cc , and https://cmssdt.cern.ch/dxr/CMSSW/source/Validation/RecoTau/plugins/DQMFileLoader.cc .

Though, I think these are false positives, since edmCheckMultithreading will check all modules configured in the configuration, even if they are not on any sequence. So no I am marrying master...schneiml:dqm-dqmdumpsequence with the logic from edmCheckMultithreading for a more precise check.

schneiml · 2020-03-02T17:04:39Z

The new combined tool [1] indicates that there are no legacy DQM modules on any DQM or VALIDATION sequences defined in autoDQM/autoValidation, at least for scenario pp and a recent era. That is a good sign. There are a few non-DQM legacy modules though, but they should not hurt.

There might be legacy modules hiding in other corners of the configuration parameter space, but chances are not very high.

[1] master...schneiml:dqm-dqmdumpsequence

makortel · 2020-03-02T18:07:40Z

@schneiml

There are a few non-DQM legacy modules though, but they should not hurt.

Could you please list them in #25090?

silviodonato · 2020-03-11T10:09:35Z

@schneiml , we can try to run your tests master...schneiml:dqm-dqmdumpsequence in a special IB and check if this is sustainable or not.
@smuzaffar how do you suggest to proceed? Perhaps the easiest (quick&dirty) way is to simply integrate these test (master...schneiml:dqm-dqmdumpsequence). If they take too much time we can revert the change. A nicer and more complex way is to have a special IB including these new tests.

schneiml · 2020-03-11T10:46:50Z

@silviodonato as you might have seen I just updated the test with @smuzaffar 's comments. Now it should not interfer with other tests, but it might take ~2h [1] to complete. Splitting it into more pieces would be better, but then we'd need to arbitrarily split the workflow numbers, and I'd prefer to avoid that complexity.

[1] Extrapolating from that it took 15min at 16 threads. Actually I run it single-threaded on cmsswconfigexplore to serve the html UI, and it takes almost 24h to run, but that is on the smallest VM openstack has to offer...

silviodonato · 2020-03-16T09:52:12Z

+operations

cmsbuild · 2020-03-16T09:52:36Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @silviodonato, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

silviodonato · 2020-03-16T09:52:44Z

+1
See #29150

makortel · 2020-03-16T13:14:26Z

Configuration/Applications/python/ConfigBuilder.py

@@ -2232,6 +2232,8 @@ def prepare(self, doChecking = False):
            self.pythonCfgCode +="process.options.numberOfThreads=cms.untracked.uint32("+self._options.nThreads+")\n"
            self.pythonCfgCode +="process.options.numberOfStreams=cms.untracked.uint32("+self._options.nStreams+")\n"
            self.pythonCfgCode +="process.options.numberOfConcurrentLuminosityBlocks=cms.untracked.uint32("+self._options.nConcurrentLumis+")\n"
+            if self._options.nConcurrentLumis > 1:
+              self.pythonCfgCode +="if process.DQMStore: process.DQMStore.assertLegacySafe=cms.untracked.bool(False)\n"


This should have probably been something like

self.pythonCfgCode +="if hasattr(process, 'DQMStore'): process.DQMStore.assertLegacySafe=cms.untracked.bool(False)\n"

The latest IB (CMSSW_11_1/2020-03-16-1100) shows failures like

LHE input from article 18334 Note: this tool is DEPRECATED, use xrdfs instead. customising the process with customiseWithTimeMemorySummary from Validation/Performance/TimeMemorySummary Starting /data/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 9000 cmsRun -j JobReport1.xml step1_NONE.py ----- Begin Fatal Exception 16-Mar-2020 12:51:11 CET----------------------- An exception of category 'ConfigFileReadError' occurred while [0] Processing the python configuration file named step1_NONE.py Exception Message: unknown python problem occurred. AttributeError: 'Process' object has no attribute 'DQMStore' At: step1_NONE.py(92): <module> ----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_1_X_2020-03-16-1100/pyRelValMatrixLogs/run/1370.0_GluGluHToGG_M125_Pow_MINLO_NNLOPS_py8_13+GluGluHToGG_M125_Pow_MINLO_NNLOPS_py8_13+Hadronizer_TuneCUETP8M1_13TeV_powhegEmissionVeto_2p_HToGG_M125_13+DIGIUP15+RECOUP15+HARVESTUP15+NANOUP15Had/step1_GluGluHToGG_M125_Pow_MINLO_NNLOPS_py8_13+GluGluHToGG_M125_Pow_MINLO_NNLOPS_py8_13+Hadronizer_TuneCUETP8M1_13TeV_powhegEmissionVeto_2p_HToGG_M125_13+DIGIUP15+RECOUP15+HARVESTUP15+NANOUP15Had.log#/

Use hasattr to check process.DQMStore in ConfigBuilder.py (Fix of #28920)

Fix Python3 problems with #28920

cmsbuild added this to the CMSSW_11_1_X milestone Feb 11, 2020

cmsbuild added code-checks-pending comparison-pending operations-pending orp-pending pending-signatures tests-pending labels Feb 11, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Feb 11, 2020

schneiml mentioned this pull request Feb 11, 2020

DQM: new DQMStore. #28622

Merged

cmsbuild added tests-started and removed tests-pending labels Feb 11, 2020

cmsbuild added tests-approved and removed tests-started labels Feb 11, 2020

cmsbuild added comparison-available urgent and removed comparison-pending labels Feb 11, 2020

cmsbuild mentioned this pull request Feb 24, 2020

Mydev #29030

Closed

cmsbuild added fully-signed operations-approved and removed operations-pending pending-signatures labels Mar 16, 2020

cmsbuild added orp-approved and removed orp-pending labels Mar 16, 2020

cmsbuild merged commit d8696a8 into cms-sw:master Mar 16, 2020

makortel reviewed Mar 16, 2020

View reviewed changes

silviodonato mentioned this pull request Mar 16, 2020

Use hasattr to check process.DQMStore in ConfigBuilder.py (Fix of #28920) #29217

Merged

cmsbuild added a commit that referenced this pull request Mar 17, 2020

Merge pull request #29217 from silviodonato/DQM-fix

2e80bfb

Use hasattr to check process.DQMStore in ConfigBuilder.py (Fix of #28920)

cmsbuild mentioned this pull request Mar 17, 2020

Disable temporarily workflow 310, 311, 312 #29218

Merged

davidlange6 mentioned this pull request Mar 17, 2020

Fix Python3 problems with #28920 #29219

Merged

cmsbuild added a commit that referenced this pull request Mar 17, 2020

Merge pull request #29219 from silviodonato/DQM-fix-py3

87c17b3

Fix Python3 problems with #28920

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920

DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920

schneiml commented Feb 11, 2020 •

edited

Loading

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

schneiml commented Feb 11, 2020

silviodonato commented Feb 11, 2020

cmsbuild commented Feb 11, 2020 •

edited

Loading

Dr15Jones commented Feb 11, 2020

schneiml commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

silviodonato commented Feb 11, 2020

makortel commented Feb 17, 2020

silviodonato commented Feb 25, 2020

makortel commented Feb 25, 2020

makortel commented Feb 25, 2020

schneiml commented Feb 25, 2020

makortel commented Feb 25, 2020

davidlange6 commented Feb 25, 2020 via email

silviodonato commented Mar 2, 2020

schneiml commented Mar 2, 2020 •

edited

Loading

schneiml commented Mar 2, 2020

makortel commented Mar 2, 2020

silviodonato commented Mar 11, 2020

schneiml commented Mar 11, 2020

silviodonato commented Mar 16, 2020

cmsbuild commented Mar 16, 2020

silviodonato commented Mar 16, 2020

makortel Mar 16, 2020

DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920

DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920

Conversation

schneiml commented Feb 11, 2020 • edited Loading

PR description:

PR validation:

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

schneiml commented Feb 11, 2020

silviodonato commented Feb 11, 2020

cmsbuild commented Feb 11, 2020 • edited Loading

Dr15Jones commented Feb 11, 2020

schneiml commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

cmsbuild commented Feb 11, 2020

silviodonato commented Feb 11, 2020

makortel commented Feb 17, 2020

silviodonato commented Feb 25, 2020

makortel commented Feb 25, 2020

makortel commented Feb 25, 2020

schneiml commented Feb 25, 2020

makortel commented Feb 25, 2020

davidlange6 commented Feb 25, 2020 via email

silviodonato commented Mar 2, 2020

schneiml commented Mar 2, 2020 • edited Loading

schneiml commented Mar 2, 2020

makortel commented Mar 2, 2020

silviodonato commented Mar 11, 2020

schneiml commented Mar 11, 2020

silviodonato commented Mar 16, 2020

cmsbuild commented Mar 16, 2020

silviodonato commented Mar 16, 2020

makortel Mar 16, 2020

Choose a reason for hiding this comment

schneiml commented Feb 11, 2020 •

edited

Loading

cmsbuild commented Feb 11, 2020 •

edited

Loading

schneiml commented Mar 2, 2020 •

edited

Loading