Improve IndexIntoFile for concurrent lumis/runs #37532

wddgit · 2022-04-11T18:52:15Z

PR description:

Modifies the class IndexIntoFile, so that it can handle concurrent runs and better handle concurrent lumis. This PR does not include the implementation of concurrent runs. That will be submitted in a separate PR in the future. There should not be any further changes needed in IndexIntoFile.h or IndexIntoFile.cc in that future PR.

This PR includes heavy edits to the code that supports the "noRunLumiSort" ordering in IndexIntoFile (mostly in the nested class IndexIntoFile::IndexIntoFileItrEntryOrder). That ordering is currently used in some contexts to support file merging with files that have been created with concurrent lumis. That ordering allows fast cloning even when concurrent lumis have scrambled the event order in the files at lumi boundaries. Before concurrent lumis, events from a luminosity block would all be written contiguously into the output file. With concurrent lumis, events from a following lumi can be written before all the events from the preceding lumi are written. This is because the time to process an event can vary. The same thing will also happen when concurrent runs are implemented at run boundaries.

This also modifies the code in IndexIntoFile used by output modules to build the new IndexIntoFile written into a new output file.

There are some minimal changes outside IndexIntoFile needed to deal with the changes in the IndexIntoFile interface and behavior.

One might or might not consider this to be a bug fix. The existing version of the code should be working properly in the contexts where it is currently used. No one has reported any problems and the problem should be obvious if it occurs. On the other hand, it is currently possible to create files where the events in a run are not contiguous by file merging. This is similar to what will also happen with concurrent runs. If one is reading with "noRunLumiSort" mode, then these noncontiguous events from different runs can cause an assert failure in the Framework. We might consider backporting this change to 12_2_X and 12_3_X. Before those release series, "noRunLumiSort" mode did not exist and there was not a problem. One might hesitate to backport this PR because it touches a significant number of lines of critical code. The potential for a new bug is a risk. My inclination would be to not backport it unless problems actually occur, although I will backport it if asked. I am not aware of any conflicts between this PR and those earlier releases.

PR validation:

New unit tests are added in FWCore/Integration/test and DataFormats/Provenance/test to cover these changes.

cmsbuild · 2022-04-11T18:59:17Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37532/29244

This PR adds an extra 148KB to repository

cmsbuild · 2022-04-11T18:59:42Z

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

DataFormats/Provenance (core)
FWCore/Framework (core)
FWCore/Integration (core)
IOPool/Input (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

wddgit · 2022-04-11T19:00:34Z

please test

cmsbuild · 2022-04-11T23:54:39Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-37b6b7/23827/summary.html
COMMIT: ec0a81d
CMSSW: CMSSW_12_4_X_2022-04-11-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37532/23827/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 2 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3593039
DQMHistoTests: Total failures: 8
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3593009
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 200 log files, 45 edm output root files, 48 DQM output files
TriggerResults: no differences found

Dr15Jones

I made ti through the first two files (stopped before the first tests)

DataFormats/Provenance/interface/IndexIntoFile.h

Dr15Jones · 2022-04-14T15:52:36Z

DataFormats/Provenance/interface/IndexIntoFile.h

+    ///   3. All runs and lumis associated with a run should be processed
+    ///   when the last contiguous sequence of events for that run is processed.
+    ///   If a run has no events, then it is interspersed within that sequence
+    ///   of runs according to its run TTree entry number.


How is a source supposed to tell the framework that all parts of a Lumi or a Run have now been read from the file?

The Framework will know all parts were read if shouldProcessRun() returned true for any Run entry in a contiguous sequence of Run entries for the same Run and then a different Run is encountered (or the end of all input). For a single input file, the run entries that should be read will always be contiguous (by design) so they can all be read and merged in one contiguous pass. Earlier entries can occur, but only with shouldProcessRun() returning false.

Most of the time, there will be a single contiguous sequence of Run entries where shouldProcessRun() returns true for all of them and only for those Run entries. There is one special case when crossing a file boundary that allows shouldProcessRun to return false just after a file boundary. In that case, the RunPrincipal remembers it has products merged into it that still need to be written, but does not merge the products from that Run entry (or entries) after the file boundary.

I hope that is clear. I know this is complicated.

DataFormats/Provenance/interface/IndexIntoFile.h

DataFormats/Provenance/src/IndexIntoFile.cc

Trying to clarify so the comment is easier to understand.

Move some into Info class, make const, or change to return output value

cmsbuild · 2022-04-30T01:32:30Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-37b6b7/24345/summary.html
COMMIT: 70bb9c4
CMSSW: CMSSW_12_4_X_2022-04-29-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37532/24345/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 7 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3695434
DQMHistoTests: Total failures: 13
DQMHistoTests: Total nulls: 1
DQMHistoTests: Total successes: 3695398
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: -0.004 KiB( 48 files compared)
DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
Checked 205 log files, 45 edm output root files, 49 DQM output files
TriggerResults: no differences found

makortel · 2022-05-03T00:00:32Z

+1

cmsbuild · 2022-05-03T00:00:51Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2022-05-04T10:48:13Z

@wddgit @makortel @Dr15Jones is any effect on performance expected?
If so, should we run the tests with the profiler in order to quantify it (even just for reference)?

wddgit · 2022-05-04T14:38:07Z

I am not expecting any significant performance change. My first inclination was that profiling was not necessary.

I have not run a profiler with CMS code in a while. Does igprof still work or are we using something else these days? For me it would take some time to figure out how to do this. Is there an established procedure documented somewhere? If you want me to do this, I will do it. If there is already an automated way to run the profiler and compare or an expert willing to test it, that would be great.

Most of the code changes are only used when noRunLumiSort mode is enabled. I am not aware of any relVal or runTheMatrix type tests that use that mode. I added some unit tests that run it. noRunLumiSort mode is currently used when merging files when we want fast cloning to occur and concurrent lumis might have scrambled the event order at luminosity block boundaries. Also one would expect any performance issues to be more likely to show up in a file created by skimming that included a large number of Runs and Lumis. That kind of input file would challenge this new code the most. But even in those cases, I would guess the effect would be small. Most of the work occurs once per input file. It's not in a tight loop that occurs many times.

In the normal iteration modes, the changes are small. There are some changes when filling IndexIntoFile, but those might even make it faster by a tiny amount. It would probably indicate a bug somewhere if the performance changed significantly, because it was a design goal to perturb the other iteration modes as little as possible.

makortel · 2022-05-04T14:55:05Z

enable profiling

makortel · 2022-05-04T14:55:12Z

@cmsbuild, please test

makortel · 2022-05-04T15:00:29Z

If there is already an automated way to run the profiler and compare or an expert willing to test it, that would be great.

I just launched automated profiling, let's see what it gives (although I agree with David that the impact on regular workflows should be negligible).

perrotta · 2022-05-04T15:24:10Z

Thank you @wddgit and @makortel
I would have even stick on your judgement (while I am not able to evaluate myself the possible effects on performance of this PR, that's why I asked): since we're now running with the profiler, let see the outcome, and merge this right after

cmsbuild · 2022-05-05T01:45:15Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-37b6b7/24453/summary.html
COMMIT: 70bb9c4
CMSSW: CMSSW_12_4_X_2022-05-04-1100/slc7_amd64_gcc10
Additional Tests: PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37532/24453/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
39434.501 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

No significant changes to the logs found
Reco comparison results: 8 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3700548
DQMHistoTests: Total failures: 14
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3700512
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
Checked 205 log files, 45 edm output root files, 49 DQM output files
TriggerResults: no differences found

perrotta · 2022-05-05T08:32:56Z

Needless to say, @wddgit and @makortel your expectations were correct: profile outputs do not show any visible effect on performance for the tested workflows with this PR merged. Also references to IndexIntoFile show quite similar scores in this PR tests and in the IB.. Good to have verified it.

perrotta · 2022-05-05T08:33:02Z

+1

cmsbuild added this to the CMSSW_12_4_X milestone Apr 11, 2022

cmsbuild added code-checks-pending core-pending orp-pending pending-signatures tests-pending labels Apr 11, 2022

wddgit mentioned this pull request Apr 11, 2022

Concurrent Lumi mode with noRunLumiSort option, bug #36967

Closed

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 11, 2022

cmsbuild added tests-started and removed tests-pending labels Apr 11, 2022

cmsbuild added tests-approved and removed tests-started labels Apr 11, 2022

Dr15Jones reviewed Apr 14, 2022

View reviewed changes

wddgit added 8 commits April 19, 2022 18:06

Improve IndexIntoFile for concurrent lumis/runs

fd12984

Edit comment about EntryOrder iteration

766902f

Trying to clarify so the comment is easier to understand.

Change override->final

356c929

Add comment related to indexedSize()

ea0e58e

Parenthesis to curly braces

6467d8a

Add useful information to exception message

c44d7dc

Use indexedSize() function instead of data member

73b76b1

Refactor functions initializing iterator

087e4ea

Move some into Info class, make const, or change to return output value

wddgit force-pushed the concurrentIndexIntoFile2 branch from ec0a81d to 087e4ea Compare April 20, 2022 21:15

cmsbuild added code-checks-pending and removed tests-approved code-checks-approved labels Apr 20, 2022

cmsbuild added tests-approved and removed tests-started labels Apr 30, 2022

cmsbuild mentioned this pull request May 2, 2022

Added IOV synchronization callbacks to ActivityRegistry #37771

Merged

cmsbuild added core-approved fully-signed and removed core-pending pending-signatures labels May 3, 2022

cmsbuild added tests-started and removed tests-approved labels May 4, 2022

cmsbuild added tests-approved and removed tests-started labels May 5, 2022

cmsbuild added orp-approved and removed orp-pending labels May 5, 2022

cmsbuild merged commit 53a15fb into cms-sw:master May 5, 2022

This was referenced May 5, 2022

[master] Drop references to legacy generators #37813

Merged

[master] Drop legacy MC generators cms-sw/cmsdist#7844

Merged

wddgit mentioned this pull request May 6, 2022

Minor cleanup in IndexIntoFile unit tests #37850

Merged

wddgit deleted the concurrentIndexIntoFile2 branch October 25, 2022 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve IndexIntoFile for concurrent lumis/runs #37532

Improve IndexIntoFile for concurrent lumis/runs #37532

wddgit commented Apr 11, 2022 •

edited

Loading

cmsbuild commented Apr 11, 2022

cmsbuild commented Apr 11, 2022

wddgit commented Apr 11, 2022

cmsbuild commented Apr 11, 2022

Dr15Jones left a comment

Dr15Jones Apr 14, 2022

wddgit Apr 15, 2022

cmsbuild commented Apr 30, 2022

makortel commented May 3, 2022

cmsbuild commented May 3, 2022

perrotta commented May 4, 2022

wddgit commented May 4, 2022

makortel commented May 4, 2022

makortel commented May 4, 2022

makortel commented May 4, 2022

perrotta commented May 4, 2022

cmsbuild commented May 5, 2022

perrotta commented May 5, 2022 •

edited

Loading

perrotta commented May 5, 2022

Improve IndexIntoFile for concurrent lumis/runs #37532

Improve IndexIntoFile for concurrent lumis/runs #37532

Conversation

wddgit commented Apr 11, 2022 • edited Loading

PR description:

PR validation:

cmsbuild commented Apr 11, 2022

cmsbuild commented Apr 11, 2022

wddgit commented Apr 11, 2022

cmsbuild commented Apr 11, 2022

Comparison Summary

Dr15Jones left a comment

Choose a reason for hiding this comment

Dr15Jones Apr 14, 2022

Choose a reason for hiding this comment

wddgit Apr 15, 2022

Choose a reason for hiding this comment

cmsbuild commented Apr 30, 2022

Comparison Summary

makortel commented May 3, 2022

cmsbuild commented May 3, 2022

perrotta commented May 4, 2022

wddgit commented May 4, 2022

makortel commented May 4, 2022

makortel commented May 4, 2022

makortel commented May 4, 2022

perrotta commented May 4, 2022

cmsbuild commented May 5, 2022

Comparison Summary

perrotta commented May 5, 2022 • edited Loading

perrotta commented May 5, 2022

wddgit commented Apr 11, 2022 •

edited

Loading

perrotta commented May 5, 2022 •

edited

Loading