Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Concurrent*Filters, upgrade to allow a stream to skip a lumi #43816

Merged
merged 1 commit into from
Feb 18, 2024

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Jan 30, 2024

PR description:

Improve ConcurrentHadronizerFilter and ConcurrentGeneratorFilter to handle the new Framework behavior where a stream can skip a lumi if before that stream starts at least one other stream already processed the lumi and other streams already started processing all events in the lumi.

PR #43522 implements the new Framework behavior. It is waiting for this and other pull requests that prepare modules outside the Framework for this behavior to be merged.

PR validation:

Relies on existing tests. The behavior of the modules should not change. The unit test TestGeneratorInterfacePythia8ConcurrentGeneratorFilter is most relevant because it has multiple lumis. It passes and I stepped through that one with a debugger and the correct things seem to be happening. These are used in a lot of runTheMatrix tests also (although usually with 1 lumi, which isn't as useful as a test). The changes are almost identical in the two modules.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 30, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43816/38628

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

  • GeneratorInterface/Core (generators)

@bbilin, @menglu21, @GurpreetSinghChahal, @SiewYan, @cmsbuild, @alberto-sanchez, @mkirsano can you please review it and eventually sign? Thanks.
@alberto-sanchez, @mkirsano this is something you requested to watch as well.
@sextonkennedy, @rappoccio, @antoniovilela you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Jan 30, 2024

enable threading

@wddgit
Copy link
Contributor Author

wddgit commented Jan 30, 2024

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-644838/37118/summary.html
COMMIT: e1ace0a
CMSSW: CMSSW_14_0_X_2024-01-30-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43816/37118/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 88 lines to the logs
  • Reco comparison results: 39 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3248614
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3248589
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 200 log files, 161 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@wddgit
Copy link
Contributor Author

wddgit commented Feb 2, 2024

@makortel @Dr15Jones Since this is more a technical concurrency change than a real generator issue, it might be worth one or both of you reviewing this one in addition to the generator expert. It involves only about 20 line of code...

@@ -158,12 +158,13 @@ namespace edm {
void initLumi(gen::GenStreamCache<HAD, DEC>* cache, LuminosityBlock const& index, EventSetup const& es) const;
ParameterSet config_;
mutable std::atomic<gen::GenStreamCache<HAD, DEC>*> useInLumi_{nullptr};
mutable std::atomic<unsigned long long> greatestNStreamEndLumis_{0};
mutable std::atomic<unsigned long long> nextNGlobalBeginLumis_{1};
mutable std::atomic<bool> streamEndRunComplete_{true};
// The next two data members are thread safe and can be safely mutable because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this become

Suggested change
// The next two data members are thread safe and can be safely mutable because
// The next three data members are thread safe and can be safely mutable because

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed the comment in both files. Thanks. I will push the changes momentarily.

if (greatestNStreamEndLumis_.compare_exchange_strong(expected, streamCachePtr->nStreamEndLumis_)) {
unsigned long long expected = this->luminosityBlockCache(lumi.index())->nGlobalBeginLumis_;
unsigned long long nextValue = expected + 1;
if (nextNGlobalBeginLumis_.compare_exchange_strong(expected, nextValue)) {
streamEndRunComplete_ = false;
useInLumi_ = streamCachePtr;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understood, is the idea that the first streamEndLuminosityBlock (of a Lumi) to arrive here would change the streamEndRunComplete_ and useInLumi_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment at this point in both files.

@@ -158,12 +158,13 @@ namespace edm {
void initLumi(gen::GenStreamCache<HAD, DEC>* cache, LuminosityBlock const& index, EventSetup const& es) const;
ParameterSet config_;
mutable std::atomic<gen::GenStreamCache<HAD, DEC>*> useInLumi_{nullptr};
mutable std::atomic<unsigned long long> greatestNStreamEndLumis_{0};
mutable std::atomic<unsigned long long> nextNGlobalBeginLumis_{1};
mutable std::atomic<bool> streamEndRunComplete_{true};
// The next two data members are thread safe and can be safely mutable because
// they are only modified/read in globalBeginRun and globalBeginLuminosityBlock.
Copy link
Contributor

@Dr15Jones Dr15Jones Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm beginning to wonder if this is actually a safe assumption going forward. Previously, because every stream had to see every luminosity block, it meant we could never have concurrent global concurrent LuminosityBlock and/or Runs as the global transition had to fully finish and then start the stream transitions before we'd ask the Source to see what is the next transition.

But now since we allow skipping a stream begin LuminosityBlock transition, it might be possible to change the scheduler to start the global begin LuminosityBlock transition, check the source and see there are no Events in that LuminosityBlock and then start another global begin LuminosityBlock transition concurrently.

I came to this conclusion after re-reading our global module documentation
https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkGlobalModuleInterface#edm_LuminosityBlockCacheT

where no guarantees are made that concurrent calls to global* transitions will not happen (except globalBegin always finish before streamBegin would happen, and the reverse for End transitions as well as some guarantees related to Summary calls).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought on this comment is that we should go ahead with this PR as is. This PR is relatively small and accomplishes the immediate goal with minimal changes.

If we decide to add the ability to handle concurrent global begin transitions, then we should handle that as a separate effort. For this module to support that additional concurrency would require significantly more changes than the 20 or so lines changed here. For example, the useInLumi_ pointer would not work as is because multiple global lumi transitions could be using it in concurrently. So it would be significantly more development work to allow this module to support concurrent global begin lumi transitions.

And I think that assumption is likely built into other things as well. I know as I was working on EventProcessor implementing run concurrency I was taking that as an assumption I could rely on. Just figuring out what we would need to modify would take some non-trivial effort.

I also think we would need to consider whether that additional concurrency gives us a performance improvement that is significant enough to justify the effort to develop it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the least I think the code should document the assumptions about what it expects the frameworks behavior to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment on top of that section of six member variables in a row and removed the 2 line comment that started this discussion. Let me know if there is a wording that you would like better. Pushed and squashed.

Note that this module forces globalBeginLuminosityBlock to wait until one stream has proceeded all the way to the end stream lumi transition for the previous lumi before it starts...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there is a long comment at the top of the file describing the unusual behavior and why it is necessary and a couple alternatives.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 2, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43816/38688

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 2, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43816/38690

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 2, 2024

Pull request #43816 was updated. @mkirsano, @SiewYan, @GurpreetSinghChahal, @menglu21, @bbilin, @cmsbuild, @alberto-sanchez can you please check and sign again.

@wddgit
Copy link
Contributor Author

wddgit commented Feb 2, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 3, 2024

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-644838/37194/summary.html
COMMIT: 86bd5d6
CMSSW: CMSSW_14_0_X_2024-02-02-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43816/37194/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2024

Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X.

@cmsbuild cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 6, 2024
@makortel
Copy link
Contributor

makortel commented Feb 8, 2024

From core point of view this PR looks good now. @cms-sw/generators-l2 could you review and sign? Thanks!

@cmsbuild cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 8, 2024
@menglu21
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit ff5ccb1 into cms-sw:master Feb 18, 2024
12 checks passed
@wddgit wddgit deleted the streamSkipLumiConcurrentFilters branch March 11, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants