Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve IndexIntoFile for concurrent lumis/runs #37532

Merged
merged 12 commits into from
May 5, 2022

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Apr 11, 2022

PR description:

Modifies the class IndexIntoFile, so that it can handle concurrent runs and better handle concurrent lumis. This PR does not include the implementation of concurrent runs. That will be submitted in a separate PR in the future. There should not be any further changes needed in IndexIntoFile.h or IndexIntoFile.cc in that future PR.

This PR includes heavy edits to the code that supports the "noRunLumiSort" ordering in IndexIntoFile (mostly in the nested class IndexIntoFile::IndexIntoFileItrEntryOrder). That ordering is currently used in some contexts to support file merging with files that have been created with concurrent lumis. That ordering allows fast cloning even when concurrent lumis have scrambled the event order in the files at lumi boundaries. Before concurrent lumis, events from a luminosity block would all be written contiguously into the output file. With concurrent lumis, events from a following lumi can be written before all the events from the preceding lumi are written. This is because the time to process an event can vary. The same thing will also happen when concurrent runs are implemented at run boundaries.

This also modifies the code in IndexIntoFile used by output modules to build the new IndexIntoFile written into a new output file.

There are some minimal changes outside IndexIntoFile needed to deal with the changes in the IndexIntoFile interface and behavior.

One might or might not consider this to be a bug fix. The existing version of the code should be working properly in the contexts where it is currently used. No one has reported any problems and the problem should be obvious if it occurs. On the other hand, it is currently possible to create files where the events in a run are not contiguous by file merging. This is similar to what will also happen with concurrent runs. If one is reading with "noRunLumiSort" mode, then these noncontiguous events from different runs can cause an assert failure in the Framework. We might consider backporting this change to 12_2_X and 12_3_X. Before those release series, "noRunLumiSort" mode did not exist and there was not a problem. One might hesitate to backport this PR because it touches a significant number of lines of critical code. The potential for a new bug is a risk. My inclination would be to not backport it unless problems actually occur, although I will backport it if asked. I am not aware of any conflicts between this PR and those earlier releases.

PR validation:

New unit tests are added in FWCore/Integration/test and DataFormats/Provenance/test to cover these changes.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37532/29244

  • This PR adds an extra 148KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

  • DataFormats/Provenance (core)
  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • IOPool/Input (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Apr 11, 2022

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-37b6b7/23827/summary.html
COMMIT: ec0a81d
CMSSW: CMSSW_12_4_X_2022-04-11-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37532/23827/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3593039
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3593009
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 200 log files, 45 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

Copy link
Contributor

@Dr15Jones Dr15Jones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made ti through the first two files (stopped before the first tests)

DataFormats/Provenance/interface/IndexIntoFile.h Outdated Show resolved Hide resolved
DataFormats/Provenance/interface/IndexIntoFile.h Outdated Show resolved Hide resolved
/// 3. All runs and lumis associated with a run should be processed
/// when the last contiguous sequence of events for that run is processed.
/// If a run has no events, then it is interspersed within that sequence
/// of runs according to its run TTree entry number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is a source supposed to tell the framework that all parts of a Lumi or a Run have now been read from the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Framework will know all parts were read if shouldProcessRun() returned true for any Run entry in a contiguous sequence of Run entries for the same Run and then a different Run is encountered (or the end of all input). For a single input file, the run entries that should be read will always be contiguous (by design) so they can all be read and merged in one contiguous pass. Earlier entries can occur, but only with shouldProcessRun() returning false.

Most of the time, there will be a single contiguous sequence of Run entries where shouldProcessRun() returns true for all of them and only for those Run entries. There is one special case when crossing a file boundary that allows shouldProcessRun to return false just after a file boundary. In that case, the RunPrincipal remembers it has products merged into it that still need to be written, but does not merge the products from that Run entry (or entries) after the file boundary.

I hope that is clear. I know this is complicated.

DataFormats/Provenance/interface/IndexIntoFile.h Outdated Show resolved Hide resolved
DataFormats/Provenance/interface/IndexIntoFile.h Outdated Show resolved Hide resolved
DataFormats/Provenance/src/IndexIntoFile.cc Outdated Show resolved Hide resolved
DataFormats/Provenance/src/IndexIntoFile.cc Outdated Show resolved Hide resolved
DataFormats/Provenance/src/IndexIntoFile.cc Outdated Show resolved Hide resolved
DataFormats/Provenance/src/IndexIntoFile.cc Show resolved Hide resolved
DataFormats/Provenance/src/IndexIntoFile.cc Outdated Show resolved Hide resolved
@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-37b6b7/24345/summary.html
COMMIT: 70bb9c4
CMSSW: CMSSW_12_4_X_2022-04-29-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37532/24345/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 7 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3695434
  • DQMHistoTests: Total failures: 13
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3695398
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 205 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

makortel commented May 3, 2022

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented May 3, 2022

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented May 4, 2022

@wddgit @makortel @Dr15Jones is any effect on performance expected?
If so, should we run the tests with the profiler in order to quantify it (even just for reference)?

@wddgit
Copy link
Contributor Author

wddgit commented May 4, 2022

I am not expecting any significant performance change. My first inclination was that profiling was not necessary.

I have not run a profiler with CMS code in a while. Does igprof still work or are we using something else these days? For me it would take some time to figure out how to do this. Is there an established procedure documented somewhere? If you want me to do this, I will do it. If there is already an automated way to run the profiler and compare or an expert willing to test it, that would be great.

Most of the code changes are only used when noRunLumiSort mode is enabled. I am not aware of any relVal or runTheMatrix type tests that use that mode. I added some unit tests that run it. noRunLumiSort mode is currently used when merging files when we want fast cloning to occur and concurrent lumis might have scrambled the event order at luminosity block boundaries. Also one would expect any performance issues to be more likely to show up in a file created by skimming that included a large number of Runs and Lumis. That kind of input file would challenge this new code the most. But even in those cases, I would guess the effect would be small. Most of the work occurs once per input file. It's not in a tight loop that occurs many times.

In the normal iteration modes, the changes are small. There are some changes when filling IndexIntoFile, but those might even make it faster by a tiny amount. It would probably indicate a bug somewhere if the performance changed significantly, because it was a design goal to perturb the other iteration modes as little as possible.

@makortel
Copy link
Contributor

makortel commented May 4, 2022

enable profiling

@makortel
Copy link
Contributor

makortel commented May 4, 2022

@cmsbuild, please test

@makortel
Copy link
Contributor

makortel commented May 4, 2022

If there is already an automated way to run the profiler and compare or an expert willing to test it, that would be great.

I just launched automated profiling, let's see what it gives (although I agree with David that the impact on regular workflows should be negligible).

@perrotta
Copy link
Contributor

perrotta commented May 4, 2022

Thank you @wddgit and @makortel
I would have even stick on your judgement (while I am not able to evaluate myself the possible effects on performance of this PR, that's why I asked): since we're now running with the profiler, let see the outcome, and merge this right after

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-37b6b7/24453/summary.html
COMMIT: 70bb9c4
CMSSW: CMSSW_12_4_X_2022-05-04-1100/slc7_amd64_gcc10
Additional Tests: PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37532/24453/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
39434.501 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3700548
  • DQMHistoTests: Total failures: 14
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3700512
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 205 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@perrotta
Copy link
Contributor

perrotta commented May 5, 2022

Needless to say, @wddgit and @makortel your expectations were correct: profile outputs do not show any visible effect on performance for the tested workflows with this PR merged. Also references to IndexIntoFile show quite similar scores in this PR tests and in the IB.. Good to have verified it.

@perrotta
Copy link
Contributor

perrotta commented May 5, 2022

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants