Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary Framework delays closing luminosity blocks (simple version) #43263

Merged
merged 1 commit into from
Nov 20, 2023

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Nov 13, 2023

PR description:

This pull request fixes a problem related to luminosity blocks being closed later than necessary in some cases. It was noticed online in heavy ion runs. It should only be a problem online and not occur offline. The initial report and detailed description of the problem are in Issue #42931. There are two items listed in the opening comment of that issue. This PR only addresses the first of those two items. "Lumisections that no longer receive events are closed late". It does not help with the second item.

The changes should not affect the behavior for the first stream that notices that a luminosity block has ended. Other streams though will notice the stream has ended and immediately call streamEndLumiAsync instead of waiting for it to be called by a task in the source serial queue. The delay is caused when that queue is blocked.

This PR should not affect the output of modules or anything else. It only affects the order in which things run. There are already many unit tests that verify the correct things are running and producing proper output. Those did not need modification. One unit test (see FWCore/Integration/test/testLateLumiClosure_cfg.py) is designed to demonstrate the problem. One can see the behavior in the log output of that test. Comparing the output with and without this PR shows that this PR fixes the problem. It is difficult to make this into a pass or fail type of test because timing can vary from one execution to the next. But I manually ran that test and verified that most of the time the problem is fixed by this PR (every time when I manually ran it). We usually do not want a unit test that occasionally fails. For example, behavior can be affected when the operating system pauses a thread due to contention issues.

(Note that this is my second try at this. The first version was too complex. This version is simpler, but there is a small imperfection in it. The following two lines of code in EventProcessor::handleNextEventForStreamAsync do not occur atomically.

    if (streamLumiStatus_[iStreamIndex]->haveStartedNextLumiOrEndedRun()) {

and

    sourceResourcesAcquirer_.serialQueueChain().push(*group, [this, iTask = std::move(iTask), iStreamIndex]() mutable {

When the condition is false and if in the very short time between them another stream ends the lumi, then reads the next lumi and progresses all the way to the point it is asking the input source what is next and that is waiting, the delayed lumi problem can still occur. After a few hours discussing this we decided the probability of this is small enough to ignore and also that the consequences of the delay occurring very rarely are not dire and so the simpler version is good enough. In the original version of the PR (see #43240), the code 100% avoids this problem. But there are about 10 extra lines of code with complicated synchronization. We decided the improvement that code provides was not worth the complexity. That code is still in GitHub though. If in the future the delay occurs often enough to cause problems, then we could still use it. This seems highly unlikely. It probably never occurs or if it does on rare occasions it is likely no one will notice.)

PR validation:

Existing Core unit tests pass. testLateLumiClosure_cfg.py exhibits the expected improvement in closing lumis

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43263/37653

  • This PR adds an extra 40KB to repository

@wddgit
Copy link
Contributor Author

wddgit commented Nov 13, 2023

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

  • FWCore/Framework (core)
  • FWCore/Integration (core)

@cmsbuild, @makortel, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel, @missirol this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ccf76/35779/summary.html
COMMIT: 1f75574
CMSSW: CMSSW_14_0_X_2023-11-13-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/43263/35779/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 277 lines to the logs
  • Reco comparison results: 138 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3363028
  • DQMHistoTests: Total failures: 1788
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3361218
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 214 log files, 167 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

Comparison failures are related to #39803

@makortel
Copy link
Contributor

+core

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@rappoccio
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 9c1afec into cms-sw:master Nov 20, 2023
11 checks passed
@wddgit wddgit deleted the fixDelayedLumiClosureSimple branch March 11, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants