Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Rare Undefined Behavior in PixelThresholdClusterizer #35275

Merged
merged 3 commits into from
Sep 29, 2021

Conversation

OzAmram
Copy link
Contributor

@OzAmram OzAmram commented Sep 14, 2021

This is a small fix to address the undefined behavior reported in issue #35036. The change is just to check the range is sensible before initializing an array.

No changes in output are expected.

@mmusich @ferencek @tsusa @tvami @czangela

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35275/25270

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @OzAmram (Oz Amram) for master.

It involves the following packages:

  • RecoLocalTracker/SiPixelClusterizer (reconstruction)

@jpata, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@tvami
Copy link
Contributor

tvami commented Sep 14, 2021

@cmsbuild , please test

@tvami
Copy link
Contributor

tvami commented Sep 14, 2021

type bug-fix

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-110411/18609/summary.html
COMMIT: ffdea77
CMSSW: CMSSW_12_1_X_2021-09-14-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35275/18609/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3000833
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3000805
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 38 files compared)
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Contributor

mmusich commented Sep 15, 2021

please test workflow 134.706 for CMSSW_12_1_UBSAN_X

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-110411/18617/summary.html
COMMIT: ffdea77
CMSSW: CMSSW_12_1_UBSAN_X_2021-09-13-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35275/18617/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-110411/134.706_RunMuonEG2015B+RunMuonEG2015B+HLTDR2_50ns+RECODR2_50nsreHLT_HIPM+HARVESTDR2

Summary:

  • You potentially added 21172 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 63006 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3001001
  • DQMHistoTests: Total failures: 396848
  • DQMHistoTests: Total nulls: 38
  • DQMHistoTests: Total successes: 2604093
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -45.455 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 136.731,... ): 0.004 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 140.53 ): -44.531 KiB Hcal/DigiRunHarvesting
  • DQMHistoSizes: changed ( 140.53 ): -1.172 KiB RPC/DCSInfo
  • DQMHistoSizes: changed ( 250202.181 ): -0.064 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): 0.308 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: found differences in 13 / 38 workflows

@@ -222,6 +222,11 @@ void PixelThresholdClusterizer::copy_to_buffer(DigiIterator begin, DigiIterator
// std::cout << (doMissCalibrate ? "VI from db" : "VI linear") << std::endl;
}
#endif

//avoid undefined behavior
if (end <= begin)
Copy link
Contributor

@jpata jpata Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to essentially assert/crash here, and ensure that copy_to_buffer is not called with incorrect inputs (by fixing the calling code)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally have a strong preference on this. But I am not an expert in the clusterizer code so someone else would have to try to understand why copy_to_buffer is being called with these incorrect inputs and come up with a fix (maybe @ferencek or @czangela ?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preference here either but calling copy_to_buffer in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre3/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc#L151 only if end>begin perhaps would be a more elegant solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I think it makes sense for a function itself to act in case it encounters an undefined behavior rather than to leave the checking to the caller. So in the end I think I like the current fix more.

In general, I am not a big fan of asserts in the production code. For anything unexpected or undefined, isn't it better to deal with it gracefully and report a LogError? In this particular case it looks like we are encountering a situation where a pixel module has zero digis produced and an empty vector (DetSet) is passed on to the clustering routine. So this I would say is nothing particularly alarming and probably does not even require issuing a LogError. The following commented out line https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre3/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc#L135 seems to suggest that this scenario can indeed occur. However, what I am a bit confused about is why these empty DetSets are not simply dropped from the digi collection? Either way, the clusterizer code should be able to handle such cases without any trouble.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this empty case corresponding to (end == begin) ?
or is this a case of end before begin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slava77, end == begin

@OzAmram, I had a quick chat with @tsusa today precisely about this issue of end == begin which suggests that in the digi collection we have an empty vector of digis stored for a particular detId. The question then is why is this empty vector not dropped from the collection in the first place. It would therefore be good to check how that happens. On the other hand, even if there is some issue in the digi producer which can lead to such situations, the clusterizer should be immune against such cases which is what this PR achieves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tend to agree that it would be good if the clusterizer does not fail for such a case. Maybe we add LogWarning message to this PR and followup later to try and track down the upstream issue?

Copy link
Contributor

@jpata jpata Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, at least the code comments should be reasonably clear why a function is expected to sometimes get incorrect inputs. Something like "avoid undefined behaviour" may be quite mysterious for a reader later.

So if fixing this at the source is out of scope, how about:

  //In rare cases, this function gets called with an empty DetSet.
  //This is not expected to be a problem because of XYZ
  if (end <= begin)
    return;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

begin == end is a reasonable case for a generic caller and it makes sense to me that if there is some preamble computation in the method to skip it if it's clearly not needed.
However, the part with end < begin looks bad and better be resolved at the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @slava77. The end < begin case looks bad but by construction it is not possible in this particular case. In the caller code the begin and end are iterators from the same vector so in the worst case end==begin. So for such cases the method could issue a LogWarning and return.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35275/25399

  • This PR adds an extra 20KB to repository

@smuzaffar
Copy link
Contributor

please test

@smuzaffar
Copy link
Contributor

@OzAmram , sorry I force pushed a change here in order to get a new commit. The previous commit has reached the max commit statuses limit of 1000 which was causing bot to fail with error message like

Validation Failed
This SHA and context has reached the maximum number of statuses.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35275/25574

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

Pull request #35275 was updated. @jpata, @slava77 can you please check and sign again.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-110411/19166/summary.html
COMMIT: 866cfe7
CMSSW: CMSSW_12_1_X_2021-09-27-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35275/19166/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-110411/134.706_RunMuonEG2015B+RunMuonEG2015B+HLTDR2_50ns+RECODR2_50nsreHLT_HIPM+HARVESTDR2

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 40
  • DQMHistoTests: Total histograms compared: 3211080
  • DQMHistoTests: Total failures: 5
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3211052
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 39 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 169 log files, 37 edm output root files, 40 DQM output files
  • TriggerResults: no differences found

@jpata
Copy link
Contributor

jpata commented Sep 28, 2021

+reconstruction

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@smuzaffar
Copy link
Contributor

please test workflow 134.706 for CMSSW_12_1_UBSAN_X

lets test based on latest UBSAN IB

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-110411/19198/summary.html
COMMIT: 866cfe7
CMSSW: CMSSW_12_1_UBSAN_X_2021-09-27-2300/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35275/19198/install.sh to create a dev area with all the needed externals and cmssw changes.

Found compilation warnings

Unit Tests

I found errors in the following unit tests:

---> test EcnaCalculationsExample had ERRORS
---> test testUCTUnpacker had ERRORS

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-110411/134.706_RunMuonEG2015B+RunMuonEG2015B+HLTDR2_50ns+RECODR2_50nsreHLT_HIPM+HARVESTDR2

Summary:

  • You potentially added 21448 lines to the logs
  • Reco comparison results: 66379 differences found in the comparisons
  • DQMHistoTests: Total files compared: 40
  • DQMHistoTests: Total histograms compared: 3211080
  • DQMHistoTests: Total failures: 415441
  • DQMHistoTests: Total nulls: 14
  • DQMHistoTests: Total successes: 2795603
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.162 KiB( 39 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): 0.117 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 136.731,... ): 0.004 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 250202.181 ): -0.533 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): 0.246 KiB SiStrip/MechanicalView
  • Checked 169 log files, 37 edm output root files, 40 DQM output files
  • TriggerResults: found differences in 14 / 39 workflows

@perrotta
Copy link
Contributor

+1

  • Bug fix for some rare but not impossible case
  • Still failing unit tests in UBSAN are not related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants