-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920
DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920
Conversation
Note that this does not actually mean it is safe to use concurrent lumis; we just disable the check if concurrent lumis are actually requested. In fact, we know that assertLegacySafe fails even when there are legacy modules that should block concurrent lumis.
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28920/13715
|
A new Pull Request was created by @schneiml (Marcel Schneider) for master. It involves the following packages: Configuration/Applications @silviodonato, @kpedro88, @cmsbuild, @franzoni, @fabiocos, @davidlange6 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test 1361.181 |
please test workflow 1361.181 |
The tests are being triggered in jenkins.
|
@schneiml I'm afraid you have the wrong mental model for how this work. Concurrent LumiBlocks are not prevented, instead the framework still starts the new LuminosityBlock before the last ends while still guaranteeing that such legacy and edm::one modules will see the previous end LuminosityBlock call before the new begin LuminosityBlock call. So it isn't that the legacy prevent concurrent lumis, it is more they prevent processing the events in the next lumi before all events in the previous lumi finish. Sorry we gave you the wrong impression for what is happening. |
@Dr15Jones yes, I pretty much inferred that this is what's going on, and it is basically fine. Except for the problem explained above: The new So as written above, we now have three options:
I'd prefer the first or second, but the third is obviously much easier. |
+1 |
Comparison job queued. |
Comparison is ready @slava77 comparisons for the following workflows were not done due to missing matrix map:
Comparison Summary:
|
urgent |
A further test reveals that the |
With #29027 we removed |
Now that full matrix non-harvesting steps should be clean from legacy modules, in theory this configuration flag should no longer be needed (in full matrix). I'm left to wonder why the assert fires also for
(is it perhaps not detecting legacy modules properly?) |
@Dr15Jones reminded me that my question was essentially answered in #28920 (comment). Can |
@makortel the key point is that this assert does not detect the presence of legacy modules, but simply watches the behavior of the
Ideally the assert would check if legacy modules are present in the job, but that is quite hard, since the very specific property of legacy modules is that they might spontaneously, at any point in time, get the |
Ok, so the change in this PR should cover all jobs whose configuration is generated with |
The t0 uses config builder. So should be good. Can be checked via the unit tests in Configuration/dataProcessing
On Feb 25, 2020 1:09 PM, Matti Kortelainen <notifications@github.com> wrote:
Ok, so the change in this PR should cover all jobs whose configuration is generated with cmsDriver. Does that propagate to T0 configurations automatically or not? (I'm a bit afraid of the latter)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#28920?email_source=notifications&email_token=ABGPFQYSDSSDLSOYASDKNPTREVNFRA5CNFSM4KTA5K72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM43HXA#issuecomment-590984156>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQ7SQAFWWLINEHSC6TDREVNFRANCNFSM4KTA5K7Q>.
|
@schneiml have you tried to use the tool |
@silviodonato I am at it. It gave three hot candidates: https://cmssdt.cern.ch/dxr/CMSSW/source/Validation/RecoTau/plugins/DQMHistNormalizer.cc , https://cmssdt.cern.ch/dxr/CMSSW/source/DQM/TrigXMonitorClient/src/L1ScalersClient.cc , and https://cmssdt.cern.ch/dxr/CMSSW/source/Validation/RecoTau/plugins/DQMFileLoader.cc . Though, I think these are false positives, since |
The new combined tool [1] indicates that there are no legacy DQM modules on any DQM or VALIDATION sequences defined in There might be legacy modules hiding in other corners of the configuration parameter space, but chances are not very high. |
@schneiml , we can try to run your tests master...schneiml:dqm-dqmdumpsequence in a special IB and check if this is sustainable or not. |
@silviodonato as you might have seen I just updated the test with @smuzaffar 's comments. Now it should not interfer with other tests, but it might take ~2h [1] to complete. Splitting it into more pieces would be better, but then we'd need to arbitrarily split the workflow numbers, and I'd prefer to avoid that complexity. [1] Extrapolating from that it took 15min at 16 threads. Actually I run it single-threaded on |
+operations |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @silviodonato, @fabiocos (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
@@ -2232,6 +2232,8 @@ def prepare(self, doChecking = False): | |||
self.pythonCfgCode +="process.options.numberOfThreads=cms.untracked.uint32("+self._options.nThreads+")\n" | |||
self.pythonCfgCode +="process.options.numberOfStreams=cms.untracked.uint32("+self._options.nStreams+")\n" | |||
self.pythonCfgCode +="process.options.numberOfConcurrentLuminosityBlocks=cms.untracked.uint32("+self._options.nConcurrentLumis+")\n" | |||
if self._options.nConcurrentLumis > 1: | |||
self.pythonCfgCode +="if process.DQMStore: process.DQMStore.assertLegacySafe=cms.untracked.bool(False)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have probably been something like
self.pythonCfgCode +="if hasattr(process, 'DQMStore'): process.DQMStore.assertLegacySafe=cms.untracked.bool(False)\n"
The latest IB (CMSSW_11_1/2020-03-16-1100) shows failures like
LHE input from article 18334
Note: this tool is DEPRECATED, use xrdfs instead.
customising the process with customiseWithTimeMemorySummary from Validation/Performance/TimeMemorySummary
Starting /data/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 9000 cmsRun -j JobReport1.xml step1_NONE.py
----- Begin Fatal Exception 16-Mar-2020 12:51:11 CET-----------------------
An exception of category 'ConfigFileReadError' occurred while
[0] Processing the python configuration file named step1_NONE.py
Exception Message:
unknown python problem occurred.
AttributeError: 'Process' object has no attribute 'DQMStore'
At:
step1_NONE.py(92): <module>
----- End Fatal Exception -------------------------------------------------
Use hasattr to check process.DQMStore in ConfigBuilder.py (Fix of #28920)
Fix Python3 problems with #28920
PR description:
This prevents the crashes in 1361.181 reported in #28622. Prevents, not fixes.
I'd like to summon @Dr15Jones and @makortel to this discussion: What happens here is that a job with VALIDATION enabled (which I am pretty sure does contain some
edm::EDAnalyzer
s -- I have not checked it though) is requested to run with concurrent lumisections. EDM should prevent this, since it is not safe to have concurrent lumisections with legacy modules [1]. However, the newDQMStore
still detects that a new lumisection starts before the previous one is saved, and that consequently it needs to copy MEs (this triggers theassertLegacySafe
assertion by default, unless it is explicitly turned off). The problem with that is now thatedm::EDAnalyzer
based DQM code could hold pointers that getfree
'd by theDQMStore
later (as the first lumisection ends). For this reason, it is only safe to disableassertLegacySafe
when there are noedm::EDAnalyzer
basedDQMStore
users in the process.So, with this PR, 1361.181 runs but is unsafe. What we should do instead is either make sure that EDM actually does not use concurrent lumisections at all [2] when there are legacy modules (then we can keep the assertion enabled and it will not fire), or remove all legacy modules from the jobs using concurrent lumis (that is really what we need to do, but much harder than it sounds).
We can of course also just keep the unsafe behaviour. It seems to work for now (maybe because the legacy modules involved don't do anything dangerous [3]), but I can't make any guarantees about that.
[1] actually, the majority of DQM currently runs in
edm::one::EDProducer
s watching lumis which should cause the same effect.[2] I think EDM might overlap writing the current lumi and processing the next even when there are modules blocking concurrent lumis -- that would explain the behaviour.
[3] (Edit:) Since this entire story is about lumis, but legacy modules by default deal with per-job MEs, we should actually be safe as long as the legacy modules don't explicitly use
Scope::LUMI
. But then the (meaningful) legacy plugins do actually set a scope different fromJOB
manually (else they would not produce any output in the reco step), and technically the same problem exists withRUN
MEs. Except, we don't do anything close to concurrent runs currently. But, we have reasons to believe that there is no risk of use-after-free in this workflow today.PR validation:
passes. Note the
-t 4
, bare1361.181
did not trigger the problem.See concerns above.