Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MessageLogger context handling #34557

Merged
merged 1 commit into from
Jul 21, 2021

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Jul 19, 2021

PR description:

Remove assertions recently added in pull request #34506 that caused problems documented in issue #34520. Even if the asserts uncovered a real problem, using the MessageLogger context system to test for this is probably not the best design. It was not my original intent. I thought it was just paranoid protection against something that couldn't happen.

Instead of asserting, the MessageLogger context will go into an "unknown state". The context line will print "unknown context" in the spot where the context normally goes.

This only affects the context printed in a MessageLogger message. It only affects it in the unusual case where one module ends, there was a previous module running/waiting when it started, the context for the previous module is not in one of the normally expected states.

This is a limited fix intended to address only the recent assert failure referenced above. Practically speaking, this is probably good enough, but as I implemented this I noticed there are issues in this part of the code which I did not try to fix. We might want to follow this up with more improvements.

MessageLogger is using thread locals and the ActivityRegistry to keep track of which module is currently running. Concurrent tasks and waits inside of module level transition functions could be problematic for this design. It worked perfectly before concurrency. We've discussed similar issues before. I think Chris has brought this up more than once.

We do not and have not ever set the context for the module transitions writeLumi, writeRun, and writeProcessBlock. I have not added support for that here. The specific case where the assert failed was related to writeLumi. This is something we could fix in the future although I've never noticed MessageLogger messages being printed in those contexts...

The other two possible GlobalContext states that might have caused those asserts to fail are kBeginJob and kEndJob. Those are handled in a different way which also could be problematic if those methods ever have sub transition function concurrency/waits in the future.

PR validation:

Relies on existing unit tests. This change only affects the response to behavior that should really not be happening.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34557/24088

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

  • FWCore/MessageService (core)

@makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy, @perrotta you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Jul 19, 2021

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ff2bc4/16993/summary.html
COMMIT: 93d0b72
CMSSW: CMSSW_12_0_X_2021-07-19-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34557/16993/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /build/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-ff2bc4/11634.912_TTbar_14TeV+2021_DD4hepDB+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST+ALCA

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 2996268
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2996245
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 38 files compared)
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

Hmh, workflow 1325.81 shows small differences in reco comparisons, all_OldVSNew_TTbar13nanoEDM106Xv1in2017wf1325p81c_nanoaodFlatTable_fatJetTable__DQM_obj_floats__particleNetMD_QCD_100.png and all_OldVSNew_TTbar13nanoEDM106Xv1in2017wf1325p81c_nanoaodFlatTable_tauTable__DQM_obj_floats__rawDeepTau2017v2p1VSmu_442.png. There is no way this PR could cause those, but let's nevertheless test again. Probably some level of non-reproducibility in these neural networks (e.g. different vectorization levels in reference and test?)

@makortel
Copy link
Contributor

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ff2bc4/17038/summary.html
COMMIT: 93d0b72
CMSSW: CMSSW_12_0_X_2021-07-20-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34557/17038/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 2996268
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2996245
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 38 files compared)
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy, @perrotta (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Jul 21, 2021

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants