Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in splitVertices #45656

Merged
merged 1 commit into from
Aug 8, 2024

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Aug 6, 2024

PR description:

Add

alpaka::syncBlockThreads(acc);

at the end of the loop on the vertices to ensure that all threads are properly synchronised before resetting the shared memory.

Clean up the kernel to use the SoA accessors and the cms::alpakatools utilities.

PR validation:

Running the HLT with compute-sanitizer --tool=racecheck without this PR warns about a potential race condition in splitVertices().
This PR fixes the warning: see #44923 (comment).

Running the current online HLT menu over 20k events on top of CMSSW 14.0.13-patch2 does not result in any changes to the HLT results (see #45656 (comment)) and performance on GPU (see #45656 (comment)).

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backport to 14.0.x for data taking.

Add

    alpaka::syncBlockThreads(acc);

at the end of the loop on the vertices to ensure that all threads are
properly synchronised before resetting the shared memory.

Clean up the kernel to use the SoA accessors and the cms::alpakatools
utilities.
@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2024

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 6, 2024

type bugfix

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 6, 2024

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 6, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2024

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2024

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

  • RecoTracker/PixelVertexFinding (reconstruction)

@jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @JanFSchulte, @VinInn, @VourMa, @dgulhan, @felicepantaleo, @gpetruc, @missirol, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

Comment on lines -28 to -35
auto& __restrict__ data = pdata;
auto& __restrict__ ws = pws;
auto nt = ws.ntrks();
float const* __restrict__ zt = ws.zt();
float const* __restrict__ ezt2 = ws.ezt2();
float* __restrict__ zv = data.zv();
float* __restrict__ wv = data.wv();
float const* __restrict__ chi2 = data.chi2();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should no longer be needed, because the SoA accessors include the __restrict__ qualifier.

Comment on lines -41 to -44
ALPAKA_ASSERT_ACC(zt);
ALPAKA_ASSERT_ACC(wv);
ALPAKA_ASSERT_ACC(chi2);
ALPAKA_ASSERT_ACC(nn);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are no longer needed, because the SoA View should guarantee that these are non-null.

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 6, 2024

I think all changes should be fine.

Before merging this PR I would like to

  • check the impact on the HLT results to make sure there is none;
  • check the impact on the module's performance to make sure it's not affected (or slightly improved).

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2024

+1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86968d/40811/summary.html
COMMIT: e746fdf
CMSSW: CMSSW_14_1_X_2024-08-06-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45656/40811/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 1 lines from the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3530500
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3530474
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 45 files compared)
  • Checked 200 log files, 170 edm output root files, 46 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 6
  • DQMHistoTests: Total histograms compared: 37022
  • DQMHistoTests: Total failures: 1310
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 35712
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 5 files compared)
  • Checked 20 log files, 25 edm output root files, 6 DQM output files
  • TriggerResults: no differences found

@missirol
Copy link
Contributor

missirol commented Aug 7, 2024

@fwyzard , thanks for the fix !

Based on the check below, the HLT throughput is basically unaffected (same settings as in #45631 (comment), but a different Hilton node [*]).

  • Pre-PR: 614.5 +/- 2.1 events/sec (json).
  • Post-PR: 614.7 +/- 1.5 events/sec (json).

[*]

  • Input data: run-383631, LSs 476-479, ~40k events (PU ~64).
  • HLT menu: /cdaq/physics/Run2024/2e34/v1.4.3/HLT/V2 (current pp online menu).
  • Release: CMSSW_14_0_13_patch1_MULTIARCHS (with and without this PR).
  • Node: hilton-c2b02-44-01 (same hardware as a 2022/23 HLT node) (CPUs: 2 AMD EPYC 7763 64-Core; GPUs: 2 NVIDIA Tesla T4).
  • 8 jobs, 32 threads and 24 streams per job.
  • NVIDIA MPS enabled.
  • Each of the two measurements reported above is the average of 4 repetitions.

mmusich added a commit to mmusich/hltScripts that referenced this pull request Aug 7, 2024
@mmusich
Copy link
Contributor

mmusich commented Aug 7, 2024

check the impact on the HLT results to make sure there is none;

I tried to evaluate this with this script (which makes use of some of the error stream input files from #44923 (comment)) with and without this PR and I am seeing some (small) changes on CPU only:

$ hltDiff -o output_1.root -n ../../../new/CMSSW_14_0_X_2024-08-06-2300/src/output_1.root
Processed events: 0 out of 8200 (0%)
Processed events: 820 out of 8200 (10%)
Processed events: 1640 out of 8200 (20%)
Processed events: 2460 out of 8200 (30%)
Processed events: 3280 out of 8200 (40%)
Processed events: 4100 out of 8200 (50%)
Processed events: 4920 out of 8200 (60%)
Processed events: 5740 out of 8200 (70%)
Processed events: 6560 out of 8200 (80%)
Processed events: 7380 out of 8200 (90%)
Found 8200 matching events, out of which 82 have different HLT results

      Events    Accepted      Gained        Lost       Other  Trigger
        8200        4742           -          -1           -  AlCa_PFJet40_v31
        8200        4742           -          -1           -  AlCa_PFJet40_CPUOnly_v10
        8200        6902          +1           -           -  AlCa_AK8PFJet40_v26
        8200         397           -          -1           -  DST_PFScouting_DoubleMuon_v5
        8200        1489          +2           -           -  DST_PFScouting_DatasetMuon_v5
        8200           0           -           -          ~1  HLT_DoubleEle8_CaloIdM_TrackIdM_Mass8_DZ_PFHT350_v31
        8200           0           -           -          ~1  HLT_DoubleEle8_CaloIdM_TrackIdM_Mass8_PFHT350_v31
        8200         676          +1           -           -  HLT_Mu0_L1DoubleMu_v12
        8200         468          +1           -           -  HLT_Mu4_L1DoubleMu_v12
        8200         687          +2           -           -  HLT_Mu3_PFJet40_v27
        8200        2218          +2           -           -  HLT_Mu3_L1SingleMu5orSingleMu7_v12
        8200          22           -          -1          ~1  HLT_Ele30_WPTight_Gsf_v10
        8200          20           -          -1          ~1  HLT_Ele32_WPTight_Gsf_v24
        8200          17           -           -          ~1  HLT_Ele35_WPTight_Gsf_v18
        8200          12           -           -          ~1  HLT_Ele38_WPTight_Gsf_v18
        8200          11           -           -          ~1  HLT_Ele40_WPTight_Gsf_v18
        8200          20           -          -1          ~1  HLT_Ele32_WPTight_Gsf_L1DoubleEG_v18
        8200        6902          +1           -           -  HLT_AK8PFJet40_v27
        8200        4290          +2          -1           -  HLT_AK8PFJet60_v26
        8200        4742           -          -1           -  HLT_PFJet40_v32
        8200         916           -          -1           -  HLT_PFJetFwd40_v30
        8200        3852          +2          -1           -  HLT_AK8PFJetFwd40_v26
        8200         947          +1           -           -  HLT_AK8PFJetFwd60_v25
        8200         681          +1           -           -  HLT_PFHT250_v28
        8200           2           -           -          ~1  HLT_Mu12_DoublePFJets100_PNetBTag_0p11_v5
        8200           3           -           -          ~1  HLT_Tau3Mu_Mu7_Mu1_TkMu1_Tau15_v15
        8200           3           -           -          ~1  HLT_Tau3Mu_Mu7_Mu1_TkMu1_Tau15_Charge1_v15
        8200           3           -           -          ~1  HLT_Tau3Mu_Mu7_Mu1_TkMu1_IsoTau15_v15
        8200           3           -           -          ~1  HLT_Tau3Mu_Mu7_Mu1_TkMu1_IsoTau15_Charge1_v15
        8200           0           -           -          ~1  HLT_Mu43NoFiltersNoVtx_Photon43_CaloIdL_v14
        8200           0           -           -          ~1  HLT_Mu48NoFiltersNoVtx_Photon48_CaloIdL_v14
        8200           0           -           -          ~1  HLT_Mu38NoFiltersNoVtxDisplaced_Photon38_CaloIdL_v10
        8200           0           -           -          ~1  HLT_Mu43NoFiltersNoVtxDisplaced_Photon43_CaloIdL_v10
        8200           0           -           -          ~3  HLT_Ele28_HighEta_SC20_Mass55_v22
        8200           1           -          -1           -  HLT_TrkMu12_DoubleTrkMu5NoFiltersNoVtx_v16
        8200         221          +3          -3          ~7  HLT_Ele8_CaloIdL_TrackIdL_IsoVL_PFJet30_v27
        8200         177          +2          -2          ~7  HLT_Ele12_CaloIdL_TrackIdL_IsoVL_PFJet30_v29
        8200           4           -           -          ~4  HLT_Ele14_eta2p5_IsoVVVL_Gsf_PFHT200_PNetBTag0p53_v4
        8200         137          +1          -1          ~4  HLT_Ele23_CaloIdL_TrackIdL_IsoVL_PFJet30_v29
        8200         250          +4          -3          ~9  HLT_Ele8_CaloIdM_TrackIdM_PFJet30_v29
        8200         139          +1           -          ~3  HLT_Ele17_CaloIdM_TrackIdM_PFJet30_v27
        8200         114          +1          -1          ~3  HLT_Ele23_CaloIdM_TrackIdM_PFJet30_v29
        8200           2           -           -          ~1  HLT_Ele50_CaloIdVT_GsfTrkIdT_PFJet165_v29
        8200           1           -           -          ~1  HLT_Ele50_CaloIdVT_GsfTrkIdT_AK8PFJet220_SoftDropMass40_v11
        8200           1           -           -          ~1  HLT_Ele50_CaloIdVT_GsfTrkIdT_AK8PFJet220_SoftDropMass40_PNetBB0p06_v8
        8200           1           -           -          ~1  HLT_Ele50_CaloIdVT_GsfTrkIdT_AK8PFJet230_SoftDropMass40_v11
        8200           1           -           -          ~1  HLT_Ele50_CaloIdVT_GsfTrkIdT_AK8PFJet230_SoftDropMass40_PNetBB0p06_v8
        8200           0           -           -          ~1  HLT_Ele50_CaloIdVT_GsfTrkIdT_AK8PFJet230_SoftDropMass40_PNetBB0p10_v8
        8200         803          +2           -           -  HLT_L3Mu10NoVtx_v9
        8200         450          +2           -           -  HLT_L3Mu10NoVtx_DxyMin0p01cm_v9
        8200          23           -          -1          ~1  HLT_SingleEle8_v8
        8200         795          +5          -7          ~7  HLT_SingleEle8_SingleEGL1_v8
        8200        6911          +1           -           -  Dataset_AlCaLowPtJet
        8200        2386          +1          -1           -  Dataset_EGamma0
        8200        2386          +1          -1           -  Dataset_EGamma1
        8200           7          +4          -6           -  Dataset_EventDisplay
        8200          66          +6          -8           -  Dataset_HLTMonitor
        8200        7115          +1           -           -  Dataset_JetMET0
        8200        7115          +1           -           -  Dataset_JetMET1
        8200        1489          +2           -           -  Dataset_Muon0
        8200        1489          +2           -           -  Dataset_Muon1
        8200         503          +1           -           -  Dataset_ParkingDoubleMuonLowMass0
        8200         503          +1           -           -  Dataset_ParkingDoubleMuonLowMass1
        8200         503          +1           -           -  Dataset_ParkingDoubleMuonLowMass2
        8200         503          +1           -           -  Dataset_ParkingDoubleMuonLowMass3
        8200         503          +1           -           -  Dataset_ParkingDoubleMuonLowMass4
        8200         503          +1           -           -  Dataset_ParkingDoubleMuonLowMass5
        8200         503          +1           -           -  Dataset_ParkingDoubleMuonLowMass6
        8200         503          +1           -           -  Dataset_ParkingDoubleMuonLowMass7

when running on GPU differences seem to be more contained:

$ hltDiff -o output_1.root -n ../../../new/CMSSW_14_0_X_2024-08-06-2300/src/output_1.root
Processed events: 0 out of 8200 (0%)
Processed events: 820 out of 8200 (10%)
Processed events: 1640 out of 8200 (20%)
Processed events: 2460 out of 8200 (30%)
Processed events: 3280 out of 8200 (40%)
Processed events: 4100 out of 8200 (50%)
Processed events: 4920 out of 8200 (60%)
Processed events: 5740 out of 8200 (70%)
Processed events: 6560 out of 8200 (80%)
Processed events: 7380 out of 8200 (90%)
Found 8200 matching events, out of which 44 have different HLT results

      Events    Accepted      Gained        Lost       Other  Trigger
        8200        6900          +1          -1           -  AlCa_AK8PFJet40_v26
        8200           6           -           -          ~1  HLT_DiPFJetAve180_PPSMatch_Xi0p3_QuadJet_Max2ProtPerRP_v5
        8200        6900          +1          -1           -  HLT_AK8PFJet40_v27
        8200        3856          +3          -3           -  HLT_AK8PFJetFwd40_v26
        8200         949           -          -1           -  HLT_AK8PFJetFwd60_v25
        8200         108           -          -1           -  HLT_AK8PFJetFwd140_v25
        8200          83          +1           -           -  HLT_PFHT430_v28
        8200         120           -           -          ~1  HLT_QuadPFJet103_88_75_15_v16
        8200           2           -           -          ~2  HLT_DoubleMediumDeepTauPFTauHPS35_L2NN_eta2p1_v11
        8200           3           -           -          ~1  HLT_QuadPFJet103_88_75_15_PNetBTag_0p4_VBF2_v5
        8200           1           -           -          ~1  HLT_QuadPFJet103_88_75_15_PNet2BTag_0p4_0p12_VBF1_v5
        8200          80           -           -          ~1  HLT_QuadPFJet105_88_75_30_v8
        8200           2           -           -          ~1  HLT_QuadPFJet105_88_75_30_PNet1CvsAll0p5_VBF3Tight_v9
        8200           0           -           -          ~1  HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v11
        8200           0           -           -          ~1  HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet75_v11
        8200           7           -           -          ~2  HLT_DoublePNetTauhPFJet30_Medium_L2NN_eta2p3_v5
        8200           6           -           -          ~2  HLT_DoublePNetTauhPFJet30_Tight_L2NN_eta2p3_v5
        8200           1           -           -          ~1  HLT_DoublePNetTauhPFJet26_L2NN_eta2p3_PFJet60_v5
        8200           1           -           -          ~1  HLT_DoublePNetTauhPFJet26_L2NN_eta2p3_PFJet75_v5
        8200        6909          +1          -1           -  Dataset_AlCaLowPtJet
        8200          10          +8         -10           -  Dataset_EventDisplay
        8200          61          +7          -5           -  Dataset_HLTMonitor
        8200        7115          +1          -1           -  Dataset_JetMET0
        8200        7115          +1          -1           -  Dataset_JetMET1

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 7, 2024

Interesting... I did no expect any changes to the CPU results, since on CPU the code runs single threaded 🤔

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 7, 2024

I've run a comparison of the HLT results using

  • CMSSW_14_0_13_patch2_MULTIARCHS
  • /cdaq/physics/Run2024/2e34/v1.4.3/HLT/V2 (from run 384202), unprescaled
  • 10k events from run 383631

and comparing

In all cases, I did not observe any differences in the HLT results running on CPU, and only small discrepancies running on GPU consistent with the usual discrepancies.

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 7, 2024

syncBlockThreads vs reference:

$ hltDiff -o reference_cpu1.root -n syncBlockThreads_cpu1.root 
Found 10000 matching events, out of which 75 have different HLT results

      Events    Accepted      Gained        Lost       Other  Trigger
       10000          10         +11          -8           -  Dataset_EventDisplay
       10000         177         +22         -24           -  Dataset_ExpressPhysics
       10000          49         +10          -5           -  Dataset_HLTMonitor
       10000           5          +4          -5           -  Dataset_ScoutingPFMonitor

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 7, 2024

#45655 vs reference:

$ hltDiff -o reference_cpu1.root -n fix_splitVertices_140x_cpu1.root
Found 10000 matching events, out of which 82 have different HLT results

      Events    Accepted      Gained        Lost       Other  Trigger
       10000          10          +9          -8           -  Dataset_EventDisplay
       10000         177         +24         -26           -  Dataset_ExpressPhysics
       10000          49         +10          -8           -  Dataset_HLTMonitor
       10000           5          +2          -5           -  Dataset_ScoutingPFMonitor

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 7, 2024

@cms-sw/reconstruction-l2 could you review and sign this PR and its backport, #45655 ?

@mandrenguyen
Copy link
Contributor

@cms-sw/reconstruction-l2 could you review and sign this PR and its backport, #45655 ?

Comparisons are clean except for this one:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisonsGPU/CMSSW_14_1_X_2024-08-06-1100+86968d/63913/validateJR/12434.402_TTbar_14TeV+2023_Patatrack_PixelOnlyAlpaka/all_RECO_step3/c_recoTracks_pixelTracks__RECO_obj_qualityMask.png

Is that expected/understood?

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 7, 2024

Mhm, I do not expect any differences from these changes (other than preventing a rare crash induced by the race condition being fixed), only the usual GPU vs GPU lack of reproducibility.

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 7, 2024

I also do not observe any impact on the HLT timing, using a subset of the HLT menu that runs the Pixel, ECAL and HCAL+PF reconstruction on GPUs on every event.

reference

Running 4 times over 20300 events with 16 jobs, each with 32 threads, 24 streams and 1 GPUs
  2129.5 ±   0.1 ev/s (20000 events, 99.0% overlap)
  2127.9 ±   0.1 ev/s (20000 events, 99.0% overlap)
  2129.3 ±   0.1 ev/s (20000 events, 98.8% overlap)
  2128.2 ±   0.1 ev/s (20000 events, 99.3% overlap)
 --------------------
  2128.7 ±   0.8 ev/s

#45655

Running 4 times over 20300 events with 16 jobs, each with 32 threads, 24 streams and 1 GPUs
  2129.5 ±   0.1 ev/s (20000 events, 99.0% overlap)
  2128.5 ±   0.1 ev/s (20000 events, 99.3% overlap)
  2129.6 ±   0.1 ev/s (20000 events, 98.7% overlap)
  2128.3 ±   0.1 ev/s (20000 events, 98.2% overlap)
 --------------------
  2129.0 ±   0.7 ev/s

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 8, 2024

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will be automatically merged.

@cmsbuild cmsbuild merged commit e0cb38f into cms-sw:master Aug 8, 2024
14 checks passed
@fwyzard fwyzard deleted the fix_splitVertices_141x branch August 12, 2024 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants