Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSC Unpacker fix for handling of rare CSC data corruption #41934

Merged

Conversation

barvic
Copy link
Contributor

@barvic barvic commented Jun 12, 2023

CSC Unpacker fix for handling of rare data corruption, related to invalid CSC DMB header/trailer data

PR description:

  • related to HLT crash in run-367770 from CSCDCCUnpacker issue HLT crash in run-367770 from CSCDCCUnpacker #41747 in CMSSW_13_0_6
  • observed single HLT crash case so far
  • was reproduced with supplied by HLT script and data file
  • added safeguard against invalid nullptr data, caused by rare case of CSC hardware readout corruption

PR validation:

Ran PR validation runTheMatrix basic battery of tests.
Notes:
Supplied by the HLT test script for 13_0_6 doesn't work with 13_2_X and 13_1_X (doesn't look like CSC related issue),
but the script passed without crashes with this code modification for 13_0_X.
For 13_2_X and 13_1_X tests was using custom minimal CSC test script to run only affected muonCSCDigis CSC unpacking sequence with corrupted data file.
Tests passed without crashes with proposed modification.

This PR would need backport for 13_1_X and 13_0_X.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41934/35888

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @barvic (Victor Barashko) for master.

It involves the following packages:

  • EventFilter/CSCRawToDigi (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @missirol, @ptcox this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Jun 12, 2023

Is it understood what causes csc_itr->second to be nullptr in the first place ?

@barvic
Copy link
Contributor Author

barvic commented Jun 12, 2023

@fwyzard
Current understanding is that CSC readout electronics board (single DMB in this particular case) was in failed state and sent corrupted data chunk (basically garbage).
That data chunk was containing corrupted data words combinations, which by coincidence were similar to signature data words (in this case DMB trailer words), which are used to mark and locate data chunks of particular readout boards in the event raw data buffer.
So this chunk was still considered data format-wise as a chamber data candidate for unpacking recovery attempt and error reporting,
and then was passed for further unpacking, but with NULL pointer to that invalid chamber data chunk position in the FED event data buffer.
That chunk is unrecoverable and should be rejected from the unpacking.
Added nullptr check should let skip unpacking of such corrupted data.

General comment:
Current CSC unpacking policy is to try to extract as much as possible data for multiple chambers from single FED event data, especially in cases of partial readout data corruption, caused by individual chambers/boards. It lets us to unpack CSC data more efficiently, but rarely we could still observe cases of new previously unknown data corruptions, which could potentially cause unpacking issues, like this one and which are handled on case by case basis.
(As far as I recall previous case of unknown severe data corruption, which was causing CSC unpacking issues was back in 2015.)

@fwyzard
Copy link
Contributor

fwyzard commented Jun 13, 2023

@barvic thank you for the explanation!

Comment on lines 199 to 200
<< "skip unpacking of CSC " << cscid << " due format errors (NULL pointer to chamber data)"
<< std::dec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< "skip unpacking of CSC " << cscid << " due format errors (NULL pointer to chamber data)"
<< std::dec;
<< "skip unpacking of CSC " << cscid << " due to format errors (NULL pointer to chamber data)"
<< std::dec;

Just a small typo. One question: is the idea to print cscid as "hex", and then switch to std::dec ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@missirol thanks! (cut-n-paste legacy typos from other log message few lines below). Fixed both.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41934/35901

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

Pull request #41934 was updated. @cmsbuild, @mandrenguyen, @clacaputo can you please check and sign again.

@missirol
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-98f315/33131/summary.html
COMMIT: 316383c
CMSSW: CMSSW_13_2_X_2023-06-13-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/41934/33131/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
25034.114 step 2
25034.21 step 2
25034.9921 step 2
25034.999 step 2
25034.99 step 2
25061.97 step 1
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially added 11 lines to the logs
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3190910
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3190882
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@missirol
Copy link
Contributor

type bugfix

@cms-sw/reconstruction-l2 , could you please review this simple fix ? It addresses the HLT issue in #41747.

@barvic, could you please backport this PR to CMSSW_13_1_X and CMSSW_13_0_X ? (if it helps, I can also take care of the backports, just let me know)

@mandrenguyen
Copy link
Contributor

+reconstruction

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

type muon

@cmsbuild cmsbuild added the muon label Jun 14, 2023
@perrotta
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants