-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid unnecessary Framework delays closing luminosity blocks (simple version) #43263
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43263/37653
|
please test |
A new Pull Request was created by @wddgit (W. David Dagenhart) for master. It involves the following packages:
@cmsbuild, @makortel, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ccf76/35779/summary.html Comparison SummarySummary:
|
Comparison failures are related to #39803 |
+core |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
This pull request fixes a problem related to luminosity blocks being closed later than necessary in some cases. It was noticed online in heavy ion runs. It should only be a problem online and not occur offline. The initial report and detailed description of the problem are in Issue #42931. There are two items listed in the opening comment of that issue. This PR only addresses the first of those two items. "Lumisections that no longer receive events are closed late". It does not help with the second item.
The changes should not affect the behavior for the first stream that notices that a luminosity block has ended. Other streams though will notice the stream has ended and immediately call streamEndLumiAsync instead of waiting for it to be called by a task in the source serial queue. The delay is caused when that queue is blocked.
This PR should not affect the output of modules or anything else. It only affects the order in which things run. There are already many unit tests that verify the correct things are running and producing proper output. Those did not need modification. One unit test (see FWCore/Integration/test/testLateLumiClosure_cfg.py) is designed to demonstrate the problem. One can see the behavior in the log output of that test. Comparing the output with and without this PR shows that this PR fixes the problem. It is difficult to make this into a pass or fail type of test because timing can vary from one execution to the next. But I manually ran that test and verified that most of the time the problem is fixed by this PR (every time when I manually ran it). We usually do not want a unit test that occasionally fails. For example, behavior can be affected when the operating system pauses a thread due to contention issues.
(Note that this is my second try at this. The first version was too complex. This version is simpler, but there is a small imperfection in it. The following two lines of code in
EventProcessor::handleNextEventForStreamAsync
do not occur atomically.and
When the condition is false and if in the very short time between them another stream ends the lumi, then reads the next lumi and progresses all the way to the point it is asking the input source what is next and that is waiting, the delayed lumi problem can still occur. After a few hours discussing this we decided the probability of this is small enough to ignore and also that the consequences of the delay occurring very rarely are not dire and so the simpler version is good enough. In the original version of the PR (see #43240), the code 100% avoids this problem. But there are about 10 extra lines of code with complicated synchronization. We decided the improvement that code provides was not worth the complexity. That code is still in GitHub though. If in the future the delay occurs often enough to cause problems, then we could still use it. This seems highly unlikely. It probably never occurs or if it does on rare occasions it is likely no one will notice.)
PR validation:
Existing Core unit tests pass. testLateLumiClosure_cfg.py exhibits the expected improvement in closing lumis