Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducibility of GPU vs CPU results at HLT #35376

Open
fwyzard opened this issue Sep 23, 2021 · 133 comments
Open

reproducibility of GPU vs CPU results at HLT #35376

fwyzard opened this issue Sep 23, 2021 · 133 comments

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Sep 23, 2021

Tests done in multiple recent releases have shown that the HLT results are not consistent when running on GPU vs on CPU.

Here are the instruction to reproduce the issue using

  • CMSSW_12_1_0_pre3
  • the HLT menu included in that release, /dev/CMSSW_12_1_0/GRun/V1
  • a recent Run-3 relval sample of TTbar, e.g.
    • from CMSSW_12_0_0_pre6: /RelValTTbar_14TeV/CMSSW_12_0_0_pre6-PU_120X_mcRun3_2021_realistic_v4_JIRA_129-v1/GEN-SIM-DIGI-RAW
    • from CMSSW_12_1_0_pre3: /store/relval/CMSSW_12_1_0_pre3/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_121X_mcRun3_2021_realistic_v2-v1/10000/0eb14c4a-e363-424a-9c0c-2688c7d32c74.root
  • the "auto" Run-3 global tag from the release, auto:phase1_2021_realistic; running on a previous release using the same global tag as the sample itself (120X_mcRun3_2021_realistic_v4) shows a similar behaviour.

setup a CMSSW working area

cmsrel CMSSW_12_1_0_pre3
cd CMSSW_12_1_0_pre3/   
cmsenv
mkdir run
cd run

extract the HLT configuration for running on GPU using the Run3 era

hltGetConfiguration /dev/CMSSW_12_1_0/GRun/V1 \
    --eras Run3 \
    --globaltag auto:phase1_2021_realistic \
    --mc \
    --unprescale \
    --output minimal \
    --customise HLTrigger/Configuration/customizeHLTforPatatrack.customizeHLTforPatatrack \
    --input /store/relval/CMSSW_12_0_0_pre6/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_120X_mcRun3_2021_realistic_v4_JIRA_129-v1/00000/79c06ed5-929b-4a57-a4f2-1ae90e6b38c5.root \
    > hlt.py

run the HLT menu on a GPU-equipped machine

# run the HLT menu without any GPUs
CUDA_VISIBLE_DEVICES= cmsRun hlt.py
mv output.root output-cpu.root

# use available GPUs
cmsRun hlt.py
mv output.root output-gpu.root

compare the results

hltDiff -o output-cpu.root -n output-gpu.root
...
Found 100 matching events, out of which 100 have different HLT results

      Events    Accepted      Gained        Lost       Other  Trigger
         100           0           -           -          ~1  HLT_AK8PFHT750_TrimMass50_v12
         100           0           -           -          ~1  HLT_DoubleEle8_CaloIdM_TrackIdM_Mass8_DZ_PFHT350_v20
         100           0           -           -          ~1  HLT_DoubleEle8_CaloIdM_TrackIdM_Mass8_PFHT350_v20
         100           2          +2           -           -  HLT_DoubleMu4_Mass3p8_DZ_PFHT350_v8
         100          17           -           -          ~2  HLT_Ele15_WPLoose_Gsf_v3
         100          17           -           -          ~2  HLT_Ele17_WPLoose_Gsf_v3
         100          17           -           -          ~1  HLT_Ele20_WPLoose_Gsf_v6
         100          17           -           -          ~1  HLT_Ele20_eta2p1_WPLoose_Gsf_v6
         100          10          +1           -           -  HLT_HT450_Beamspot_v11
         100          42          +2           -           -  HLT_HT300_Beamspot_v11
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_LooseChargedIsoPFTauHPS27_eta2p1_CrossL1_v4
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_MediumChargedIsoPFTauHPS27_eta2p1_CrossL1_v1
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_TightChargedIsoPFTauHPS27_eta2p1_CrossL1_v1
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_LooseChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_MediumChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_TightChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1
         100          81          +1           -           -  HLT_DiPFJet15_FBEta3_NoCaloMatched_v17
         100          98           -          -1           -  HLT_DiPFJetAve40_v14
         100          63           -           -          ~1  HLT_DiPFJetAve80_v13
         100           8          +1           -           -  HLT_DiPFJetAve140_v13
         100           0           -           -          ~1  HLT_DiPFJetAve220_HFJEC_v16
         100           7           -           -          ~1  HLT_AK8PFJet200_v15
         100          73          +1           -           -  HLT_PFJet80_v20
         100          68           -          -4           -  HLT_PFJetFwd25_v3
         100          74          +5          -1           -  HLT_AK8PFJetFwd40_v15
         100          15           -          -1           -  HLT_AK8PFJetFwd60_v14
         100           4          +1           -           -  HLT_AK8PFJetFwd80_v14
         100          56           -          -1           -  HLT_PFHT250_v17
         100          16           -           -          ~1  HLT_PFHT370_v17
         100           8          +1           -           -  HLT_PFHT430_v17
         100           4           -           -          ~1  HLT_PFHT510_v17
         100           4           -           -          ~1  HLT_PFHT590_v17
         100           0          +1           -           -  HLT_PFHT890_v17
         100           0           -           -          ~4  HLT_PFHT500_PFMET100_PFMHT100_IDTight_v12
         100           0           -           -          ~2  HLT_PFHT500_PFMET110_PFMHT110_IDTight_v12
         100           0           -           -          ~3  HLT_PFHT700_PFMET95_PFMHT95_IDTight_v12
         100           6           -          -1           -  HLT_PFMET120_PFMHT120_IDTight_v20
         100           2          +1           -          ~1  HLT_PFMET140_PFMHT140_IDTight_v20
         100           5          +1           -           -  HLT_PFMET100_PFMHT100_IDTight_CaloBTagDeepCSV_3p1_v8
         100           5           -          -1           -  HLT_PFMET120_PFMHT120_IDTight_CaloBTagDeepCSV_3p1_v8
         100           1          +1           -          ~1  HLT_PFMET140_PFMHT140_IDTight_CaloBTagDeepCSV_3p1_v8
         100           6           -          -1          ~1  HLT_PFMET120_PFMHT120_IDTight_PFHT60_v9
         100           8           -           -          ~1  HLT_PFMETNoMu120_PFMHTNoMu120_IDTight_PFHT60_v9
         100           6           -           -          ~1  HLT_PFMETTypeOne120_PFMHT120_IDTight_PFHT60_v9
         100           4           -           -          ~1  HLT_PFMETTypeOne140_PFMHT140_IDTight_v11
         100           8           -           -          ~1  HLT_PFMETNoMu110_PFMHTNoMu110_IDTight_v20
         100          11          +1           -           -  HLT_CaloMHT90_v4
         100           2          +1           -           -  HLT_Mu12_DoublePFJets54MaxDeta1p6_DoubleCaloBTagDeepCSV_p71_v2
         100          73           -          -1           -  HLT_DoublePFJets40_CaloBTagDeepCSV_p71_v2
         100          16           -          -1           -  HLT_DoublePFJets100_CaloBTagDeepCSV_p71_v2
         100           2          +1           -           -  HLT_DoublePFJets200_CaloBTagDeepCSV_p71_v2
         100           7           -           -          ~1  HLT_BTagMu_AK4DiJet40_Mu5_v13
         100          16          +1           -           -  HLT_BTagMu_AK4DiJet40_Mu5_noalgo_v13
         100          12          +2           -           -  HLT_HT425_v9
         100           0           -           -          ~1  HLT_HT430_DisplacedDijet40_DisplacedTrack_v13
         100           0           -           -          ~1  HLT_HT430_DisplacedDijet60_DisplacedTrack_v13
         100           0           -           -          ~2  HLT_HT400_DisplacedDijet40_DisplacedTrack_v13
         100           0           -           -          ~1  HLT_DiJet110_35_Mjj650_PFMET110_v9
         100           0           -           -          ~1  HLT_DiJet110_35_Mjj650_PFMET120_v9
         100           0           -           -          ~1  HLT_DiJet110_35_Mjj650_PFMET130_v9
         100           0           -           -          ~1  HLT_TripleJet110_35_35_Mjj650_PFMET110_v9
         100           0           -           -          ~1  HLT_TripleJet110_35_35_Mjj650_PFMET120_v9
         100           0           -           -          ~1  HLT_TripleJet110_35_35_Mjj650_PFMET130_v9
         100           1           -           -          ~1  HLT_Ele15_IsoVVVL_PFHT450_CaloBTagDeepCSV_4p5_v8
         100           0           -           -          ~1  HLT_Ele15_IsoVVVL_PFHT450_PFMET50_v16
         100           1           -           -          ~1  HLT_Ele15_IsoVVVL_PFHT450_v16
         100           1           -           -          ~1  HLT_Ele15_IsoVVVL_PFHT600_v20
         100           1          +1           -           -  HLT_Mu4_TrkIsoVVL_DiPFJet90_40_DEta3p5_MJJ750_HTT300_PFMETNoMu60_v15
         100           1          +1           -           -  HLT_Mu8_TrkIsoVVL_DiPFJet40_DEta3p5_MJJ750_HTT300_PFMETNoMu60_v16
         100           1          +1           -           -  HLT_Mu10_TrkIsoVVL_DiPFJet40_DEta3p5_MJJ750_HTT350_PFMETNoMu60_v15
         100           1           -          -1           -  HLT_Mu15_IsoVVVL_PFHT450_PFMET50_v15
         100           3           -          -1          ~1  HLT_Mu3er1p5_PFJet100er2p5_PFMET70_PFMHT70_IDTight_v2
         100           3           -          -1           -  HLT_Mu3er1p5_PFJet100er2p5_PFMET80_PFMHT80_IDTight_v2
         100           3           -           -          ~1  HLT_Mu3er1p5_PFJet100er2p5_PFMETNoMu70_PFMHTNoMu70_IDTight_v2
         100          23           -           -          ~1  HLT_Ele8_CaloIdM_TrackIdM_PFJet30_v18
         100          18           -           -          ~1  HLT_Ele17_CaloIdM_TrackIdM_PFJet30_v16
         100           5          +1          -1          ~1  HLT_PFHT330PT30_QuadPFJet_75_60_45_40_TriplePFBTagDeepCSV_4p5_v3
         100          18          +2           -           -  HLT_PFHT330PT30_QuadPFJet_75_60_45_40_v9
         100           3           -           -          ~2  HLT_PFHT400_SixPFJet32_DoublePFBTagDeepCSV_2p94_v8
         100           4           -           -          ~2  HLT_PFHT400_SixPFJet32_v8
         100           5           -           -          ~3  HLT_PFHT400_FivePFJet_100_100_60_30_30_v8
         100           3           -           -          ~3  HLT_PFHT400_FivePFJet_100_100_60_30_30_DoublePFBTagDeepCSV_4p5_v8
         100           3           -           -          ~1  HLT_PFHT400_FivePFJet_120_120_60_30_30_DoublePFBTagDeepCSV_4p5_v8
         100          18          +4           -           -  HLT_PFHT350_v19
         100          20          +1           -           -  HLT_PFHT350MinPFJet15_v9
         100           2           -           -          ~1  HLT_DiSC30_18_EIso_AND_HE_Mass70_v13
         100          84          +1          -1           -  HLT_AK4CaloJet80_v10
         100          60          +1           -           -  HLT_AK4CaloJet100_v10
         100          73          +1           -           -  HLT_AK4PFJet80_v19
         100          49          +1           -           -  HLT_AK4PFJet100_v19
         100          33          +1           -           -  HLT_AK4PFJet120_v18
         100          89          +2           -           -  MC_PFBTagDeepCSV_v10
         100          18           -           -          ~1  MC_Ele5_WPTight_Gsf_v8
         100           0           -           -          ~1  HLT_MediumChargedIsoPFTau50_Trk30_eta2p1_1pr_MET130_v8
         100           1           -           -          ~1  HLT_MediumChargedIsoPFTau180HighPtRelaxedIso_Trk50_eta2p1_1pr_v11
         100           1           -           -          ~1  HLT_MediumChargedIsoPFTau180HighPtRelaxedIso_Trk50_eta2p1_v12
         100           0          +1           -           -  HLT_Rsq0p35_v15
         100           0          +1           -           -  HLT_Rsq0p40_v15
         100           1          +1           -          ~1  HLT_RsqMR300_Rsq0p09_MR200_v15
         100           0          +1           -          ~1  HLT_RsqMR320_Rsq0p09_MR200_v15
         100           2           -           -          ~1  HLT_RsqMR300_Rsq0p09_MR200_4jet_v15
         100           1           -           -          ~1  HLT_RsqMR320_Rsq0p09_MR200_4jet_v15
         100           1           -           -          ~1  HLT_DoubleMediumChargedIsoPFTau35_Trk1_eta2p1_Reg_v12
         100           1           -           -          ~1  HLT_DoubleMediumChargedIsoPFTau35_Trk1_TightID_eta2p1_Reg_v12
         100           1           -           -          ~1  HLT_DoubleTightChargedIsoPFTau35_Trk1_eta2p1_Reg_v12
         100           1           -           -          ~1  HLT_DoubleTightChargedIsoPFTau35_Trk1_TightID_eta2p1_Reg_v12
         100           1           -           -          ~1  HLT_DoubleTightChargedIsoPFTauHPS35_Trk1_eta2p1_Reg_v1
         100           1           -           -          ~1  HLT_DoubleMediumChargedIsoPFTauHPS35_Trk1_TightID_eta2p1_Reg_v1
         100           1           -           -          ~1  HLT_DoubleMediumChargedIsoPFTauHPS35_Trk1_eta2p1_Reg_v4
         100           1           -           -          ~1  HLT_DoubleTightChargedIsoPFTauHPS35_Trk1_TightID_eta2p1_Reg_v1
         100           0           -           -          ~1  HLT_VBF_DoubleLooseChargedIsoPFTauHPS20_Trk1_eta2p1_v1
         100           0           -           -          ~1  HLT_VBF_DoubleMediumChargedIsoPFTauHPS20_Trk1_eta2p1_v1
         100           0           -           -          ~1  HLT_VBF_DoubleTightChargedIsoPFTauHPS20_Trk1_eta2p1_v1
         100           9          +1           -           -  HLT_PFMETNoMu100_PFMHTNoMu100_IDTight_PFHT60_v9
         100           8           -          -1           -  HLT_PFMETTypeOne100_PFMHT100_IDTight_PFHT60_v9
         100           3           -           -          ~2  HLT_QuadPFJet98_83_71_15_DoublePFBTagDeepCSV_1p3_7p7_VBF1_v8
         100           3           -           -          ~2  HLT_QuadPFJet103_88_75_15_DoublePFBTagDeepCSV_1p3_7p7_VBF1_v8
         100           3           -           -          ~3  HLT_QuadPFJet111_90_80_15_DoublePFBTagDeepCSV_1p3_7p7_VBF1_v8
         100           0           -           -          ~2  HLT_QuadPFJet98_83_71_15_PFBTagDeepCSV_1p3_VBF2_v8
         100           0           -           -          ~2  HLT_QuadPFJet103_88_75_15_PFBTagDeepCSV_1p3_VBF2_v8
         100           0           -           -          ~2  HLT_QuadPFJet105_88_76_15_PFBTagDeepCSV_1p3_VBF2_v8
         100           0           -           -          ~3  HLT_QuadPFJet111_90_80_15_PFBTagDeepCSV_1p3_VBF2_v8
         100          19           -          -1          ~2  HLT_QuadPFJet98_83_71_15_v5
         100          11          +2           -          ~1  HLT_QuadPFJet103_88_75_15_v5
         100          11          +2           -          ~1  HLT_QuadPFJet105_88_76_15_v5
         100           8          +2          -1          ~1  HLT_QuadPFJet111_90_80_15_v5
         100           3           -           -          ~2  HLT_QuadPFJet105_88_76_15_DoublePFBTagDeepCSV_1p3_7p7_VBF1_v8
         100          29           -          -1           -  HLT_TrkMu6NoFiltersNoVtx_v1
         100         100           -        -100           -  Status_OnCPU
         100           0        +100           -           -  Status_OnGPU

To disentangle the various effects, one can use different customisations on top of the HLT menu, running each resulting configuration with a GPU and without a GPU (that is, fully on the CPU).
Replace the customisation at the bottom of the hlt.py file

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import customizeHLTforPatatrack
process = customizeHLTforPatatrack(process)

with a more fine-grained one, described below.

legacy configuration

Run the HLT menu unchanged, adding only the Status_OnGPU and Status_OnCPU paths, without actually offloading any reconstruction to GPU:

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)

ECAL-only changes

To check the impact of running the ECAL reconstruction on GPU vs CPU, apply only the ECAL changes:

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)
process = customiseEcalLocalReconstruction(process)

HCAL-only changes

To check the impact of running the HCAL reconstruction on GPU vs CPU, apply only the HCAL changes:

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)
process = customiseHcalLocalReconstruction(process)

Pixel local reconstruction changes

To check the impact of running the Pixel local reconstruction on GPU vs CPU, apply only the Pixel changes:

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)
process = customisePixelLocalReconstruction(process)

Pixel track reconstruction changes

To check the impact of running the Pixel local reconstruction on GPU vs CPU, apply only the Pixel and Tracking changes.
Clearly, for this comparison to be meaningful, the previous one needs to be understood first.

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)
process = customisePixelLocalReconstruction(process)
process = customisePixelTrackReconstruction(process)

The ECAL-only comparison did not reveal significant differences.

The HCAL-only comparison showed significant differences in a few % of the events (order of 10% of the accepted events).

The Pixel local reconstruction comparison showed significant differences in a few % of the events (order of 10% of the accepted events), while affecting less paths than the HCAL one.

I think that looking at the Pixel track comparison makes sense only after fixing the local reconstruction one.

Updates

  • for running with recent IBs, please use Fix the HLT GPU customisation for running on the CPU #35497 .

  • the 12.1.0-pre3 relvals can also be used, for example /store/relval/CMSSW_12_1_0_pre3/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_121X_mcRun3_2021_realistic_v2-v1/10000/0eb14c4a-e363-424a-9c0c-2688c7d32c74.root .

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 23, 2021

assign hlt, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,hlt

@fwyzard,@Martin-Grunewald,@makortel,@missirol you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 23, 2021

@silviodonato now there is a GitHuib issue...

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 23, 2021

The HCAL discrepancies are partially addressed by @mariadalfonso in #35355 / #35357

@mmusich
Copy link
Contributor

mmusich commented Sep 23, 2021

assign trk-dpg

@mmusich
Copy link
Contributor

mmusich commented Sep 23, 2021

assign hcal-dpg

@cmsbuild
Copy link
Contributor

New categories assigned: hcal-dpg,trk-dpg

@mmusich,@georgia14,@tsusa,@mseidel42 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 23, 2021

thanks

@VinInn
Copy link
Contributor

VinInn commented Sep 23, 2021

what does Other mean?

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 23, 2021

Other means that the EDFilter responsible for accepting/rejecting the event (i.e. the last module that runs on a path) is different in the two cases; it's usually safe to ignore.

@VinInn
Copy link
Contributor

VinInn commented Sep 23, 2021

Than it would be better to run with 1K or even 10K events just to understand which are seriously systematic affected.
IMHO the only worth a deeper analysis from the list above are the two HLT_DoubleMu4 events and what is going on with all those HLT_QuadPFJet and their various combinatorial thresholds...
(and the bizarre behavior of HLT_PFHT350_v19 wrt all other HLT_PFHTxyz_v17 (are there other v19????)

@silviodonato
Copy link
Contributor

Than it would be better to run with 1K or even 10K events just to understand which are seriously systematic affected.
IMHO the only worth a deeper analysis from the list above are the two HLT_DoubleMu4 events and what is going on with all those HLT_QuadPFJet and their various combinatorial thresholds...
(and the bizarre behavior of HLT_PFHT350_v19 wrt all other HLT_PFHTxyz_v17 (are there other v19????)

The difference between HLT_PFHT350_v19 and HLT_PFHT370_v17 is only in the L1seeds, caloHT cut (320 vs 300), and PFHT cut (370 vs 350). All the filters cut on the same objects (hltHtMhtJet30 and hltPFHTJet30).

@tsusa
Copy link
Contributor

tsusa commented Sep 27, 2021

@fwyzard is there a similar comparison between gpu and cpu (non-legacy wf) ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 27, 2021

Hi @tsusa,

the instructions above under Pixel local reconstruction changes should compare

  • the legacy pixel tracks (on CPU) running on top of the legacy pixel local reconstruction (on CPU)
  • the legacy pixel tracks (on CPU) running on top of the new pixel local reconstruction (on GPU)

while those under Pixel track reconstruction changes should compare

  • the Patatrack pixel tracks (on CPU) running on top of the legacy pixel local reconstruction (on CPU)
  • the Patatrack pixel tracks (on GPU) running on top of the new pixel local reconstruction (on GPU)

Are you looking for something else ?

@tsusa
Copy link
Contributor

tsusa commented Sep 27, 2021

Hi @fwyzard,
yes, the comparison

  • the legacy pixel tracks (on CPU) running on top of the cpu pixel local reconstruction (on CPU)
  • the legacy pixel tracks (on CPU) running on top of the new pixel local reconstruction (on GPU)

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 27, 2021

You can make this comparison by creating the configuration for the Pixel local reconstruction changes, running it on the CPU (e.g. setting CUDA_VISIBLE_DEVICES= or running on a machine without GPUs) and comparing the results with those of the legacy reconstruction.

@VinInn
Copy link
Contributor

VinInn commented Sep 27, 2021

legacy pixel local reconstruction is NOT supposed to be identical to "new pixel local reconstruction", (and not required to)

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 27, 2021

Well, this is news to me.
Why not ?
What is different ?

@mmusich
Copy link
Contributor

mmusich commented Sep 27, 2021

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 27, 2021

ah, yes - still, how much different do we expect it to be ?

(and anyway this is outside the scope of the discrepancy discussed in this issue, which should use the new reconstruction for both CPU and GPU workflows)

@mmusich
Copy link
Contributor

mmusich commented Sep 27, 2021

ah, yes - still, how much different do we expect it to be ?

There are plots in the talk linked above. I let you judged by yourself.
Mind that the situation has been improved by #34286

@VinInn
Copy link
Contributor

VinInn commented Sep 30, 2021

run twice on gpu. Got:

Found 100 matching events, out of which 0 have different HLT results

@silviodonato
Copy link
Contributor

About scouting, 17 events, out of 1268, have a gpu - cpu met difference larger than 5 GeV.

treeCPU.Scan("double_hltScoutingPFPacker_pfMetPt_HLTCPU.obj : gpu.double_hltScoutingPFPacker_pfMetPt_HLTGPU.obj : gpu2.double_hltScoutingPFPacker_pfMetPt_HLTGPU2.obj ","abs(double_hltScoutingPFPacker_pfMetPt_HLTCPU.obj - gpu.double_hltScoutingPFPacker_pfMetPt_HLTGPU.obj)>5")
************************************************
*    Row   * double_hl *  gpu.doub *  gpu2.dou *
************************************************
*       80 * 26.076620 * 20.724564 * 20.727596 *
*       95 * 5.7123356 * 10.943406 * 10.943855 *
*      105 * 45.371674 * 63.077634 * 44.873158 *
*      276 * 28.489796 * 23.359113 * 23.370310 *
*      325 * 43.596063 * 61.206629 * 61.189337 *
*      452 * 47.897351 * 53.124322 * 53.121180 *
*      548 * 63.809485 * 57.314162 * 63.809362 *
*      553 * 16.567702 * 3.0462222 * 3.0462224 *
*      602 * 5.3528798 * 12.538358 * 12.539316 *
*      605 * 82.321715 * 72.832689 * 81.878138 *
*      660 * 3.3754865 * 9.5853813 * 9.5641984 *
*      662 * 32.959398 * 11.968470 * 11.992302 *
*      975 * 37.379707 * 24.026722 * 24.026722 *
*     1073 * 26.614738 *  21.42725 * 21.427250 *
*     1126 * 40.773425 * 54.061522 * 54.962562 *
*     1132 * 19.311899 * 12.621779 * 12.622841 *
*     1208 * 33.183377 * 27.939318 * 33.538982 *
************************************************

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 14, 2022

The issue of duplicate pixels was already discussed in this HyperNews thread https://hypernews.cern.ch/HyperNews/CMS/get/pixelOfflineSW/1485.html.

On the HN Danek wrote that it is hard to know the reason without doing a more detailed study, and that he would try to monitor it to see if there are any systematic effects.
Do you know if there was any follow up ?

The bottom line is that there is no easy way to tell what is more correct so one needs to make a choice and accept it will not always be the correct thing to do.

Unless some way to distinguish the good and bad pixel hits can be found, I would suggest one of the following:

  • if the downstream code can properly handle duplicate pixels, the unpacker could produce them all;
  • if the downstream code cannot (be made to) handle duplicate pixels, the unpacker (or an additional module) should sum the energy form the duplicate pixels.

Does anybody has a better suggestion ?

@mmusich
Copy link
Contributor

mmusich commented Mar 14, 2022

if the downstream code cannot (be made to) handle duplicate pixels, the unpacker (or an additional module) should sum the energy form the duplicate pixels.

while keeping either one of the duplicates has some (50%?) chances of doing the right thing, this is clearly wrong in all cases. Moreover if the duplicate pixels end up being one at the edge of the cluster, might screw-up the determination of the position in the CPE code.

@VinInn
Copy link
Contributor

VinInn commented Mar 14, 2022

is It possible that the problem of pixels not belonging to the event is not limited to duplicate pixels?
When duplicate we can detect. If not: we still have spurious hits around (even in a legit cluster)

@ferencek
Copy link
Contributor

ferencek commented Mar 14, 2022

I agree with @mmusich, summing up charges or taking their average is almost certainly wrong so taking only one at least has a good chance of being correct. Regarding Vincenzo's comment above, indeed, based on what Danek wrote in the HN thread, there could be spurious hits present in the event caused by the same mechanism which we would never detect as problematic if they are not duplicate. In fact, if the underlying cause is a bit flip in the pixel address as opposed to a stale content in the data buffer, we possibly end up having two problems at the same time, a missing and a fake hit in the same event.

@VinInn
Copy link
Contributor

VinInn commented Mar 14, 2022

would "a bit flip" make duplicate more likely? (a pixel in the same cluster with a bit-flipped may generate a duplicate?).
readout is not random if I remember well, so one can look to the pattern in the buffer to understand what is going on.
Could the duplicate rate used to globally estimate the effect?
Do we know if it occurs all the time and everywhere?

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 14, 2022

I agree with @mmusich, summing up charges or taking their average is almost certainly wrong so taking only one at least has a good chance of being correct.

In the presence of something completely unexpected ("duplicate" pixels), how can you be certain that taking one or the other is correct ?

Also, the "right" one could be

  • always the first one
  • always the last one
  • one at random

If it is always the first one or always the last one, and we implement the other case in the unpacker, we are effectively always making the wrong choice -- which is not any different from using the sum or the average.

@mmusich
Copy link
Contributor

mmusich commented Mar 14, 2022

In the presence of something completely unexpected ("duplicate" pixels), how can you be certain that taking one or the other is correct ?

I am not certain, but without knowing for sure (that requires more studies) summing pixels would be exchanging a possible mistake for a certain mistake.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 14, 2022

OK, I guess I see things differently, but in the end it doesn't matter: as long as we don't know, any algorithm that is fully reproducible should be an improvement over the present situation, and I'm fine with that.

The problem is that, as @VinInn pointed out "first" and "last" are not easy to implement in a parallel algorithm.

@silviodonato
Copy link
Contributor

The most significant effect is when we do not reconstruct a pixel clusters because of duplicates.
Example (see above, event 181451699, detId 303042568, x, y = 159, 292).

  • GPUs see two duplicated pixels with charge [22019, 100].
  • CPUs reconstruct no cluster, because they see only one pixel with charge 100, that is below the pixel cluster threshold.

My 2 cents: we should take the maximum of the N duplicated pixels.

Our top priority is to reconstruct correctly the pixel clusters 1) in signal events 2) close to the signal objects (jets or leptons).
Assuming that the duplicated pixels comes from a random bunch crossing, I think that we should take the maximum between the two pixel charges, because the probability to have a pixel (in a region of interest) with a large charge from a random bunch crossing should be smaller compared to the probability of a signal event.

@VinInn
Copy link
Contributor

VinInn commented Mar 15, 2022

On the other hand duplicates may be due to bit-flip: most probably the original pixel was closeby even in the same cluster.
In such a case the sum looks to me more correct.

I think a detailed study is needed. We do not even know if duplicates pixels are contiguous in the rawdata or not.
We do not know the rate, the location, the occupancy.
We do not know if correlated with luminosity, fluency or even just "wall time".

(And CMS ignored all this for 20 years)

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 15, 2022

@VinInn how does the clustering code behave in the presence of "duplicate" pixels ?

@VinInn
Copy link
Contributor

VinInn commented Mar 15, 2022

I posted the relevant line above:
on CPU just load the latest
on GPU clusterize them independently: at CPE time they will be summed.
have not checked what template does

@silviodonato
Copy link
Contributor

sorry I got confused by the code in the second copy_to_buffer (where pixels are added)

the line to fill the buffer later used by make_clusters is https://cmssdt.cern.ch/dxr/CMSSW/source/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc#294 and indeed it is set not add. So last pixel in raw-data wins.

@VinInn is there a trivial way to sum the pixels also on the GPU side?
Just to avoid that having two pixels might create some differences (second order effect).

Btw. We finally found a bug in the L2 tau reco! Now the trigger with the largest difference should is HLT_PFHT400_FivePFJet_100_100_60_30_30_DoublePFBTagDeepCSV_4p5_v8

Total Accepted OLD Accepted NEW Gained Lost trigger
1461195 261 258 13 -16 HLT_PFHT400_FivePFJet_100_100_60_30_30_DoublePFBTagDeepCSV_4p5_v8

Ganesh will update the study including the L2 tau NN workaround and using the sum of the duplicated pixels in the next days.

@VinInn
Copy link
Contributor

VinInn commented Mar 15, 2022 via email

@mmusich
Copy link
Contributor

mmusich commented Mar 16, 2022

In case people active in this thread want to contribute to the discussion, the issue about duplicate pixels is going to be tackled today at the Joint TRK DPG/POG meeting

@VinInn
Copy link
Contributor

VinInn commented Mar 16, 2022

Following Danek presentation I made a small test using this code
that detect (and remove) a duplicate pixel (the second one) in digis.
It should make GPU reproduce current CPU behaviour.
trivial to remove both.

diff --git a/DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h b/DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h
index cfadce7cf45..9f066bfbc31 100644
--- a/DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h
+++ b/DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h
@@ -55,6 +55,7 @@ namespace sipixelconstants {
     inline constexpr uint32_t getRow(uint32_t ww) { return ((ww >> ROW_shift) & ROW_mask); }
     inline constexpr uint32_t getDCol(uint32_t ww) { return ((ww >> DCOL_shift) & DCOL_mask); }
     inline constexpr uint32_t getPxId(uint32_t ww) { return ((ww >> PXID_shift) & PXID_mask); }
+    inline constexpr uint32_t removeADC(uint32_t ww) { return (ww >> ADC_bits); } // ADC_shift ==0: let's keep it simple
   }  // namespace functions
 }  // namespace sipixelconstants

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index e66e322ddff..2fd5b943906 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -440,6 +440,16 @@ namespace pixelgpudetails {
         }
       }

+     // remove duplicate pixels (for the time being keep first
+     auto noADC = sipixelconstants::removeADC(ww);
+     auto noADCm1 =  sipixelconstants::removeADC(word[gIndex-1]);  // yes reading one word before buffer
+     // auto noADCp1 =  sipixelconstants::removeADC(word[gIndex+1]);  // fine we are reading one word bejond end of buffer
+     if (noADC==noADCm1) {
+       auto globalPix = frameConversion(barrel, side, layer, rocIdInDetUnit, localPix);
+       printf("dup pix at %d %d\n",globalPix.row,globalPix.col);
+       continue;
+     }
+
       pixelgpudetails::Pixel globalPix = frameConversion(barrel, side, layer, rocIdInDetUnit, localPix);
       xx[gIndex] = globalPix.row;  // origin shifting by 1 0-159
       yy[gIndex] = globalPix.col;  // origin shifting by 1 0-415

Run 5000 events in Run 323775, Event 33310106, LumiSection 53
and got
[innocent@patatrack02 raw]$ grep dup dup.log | wc
8158 41289 153656
[innocent@patatrack02 raw]$ grep dup dup.log | sort -u | wc
3185 16424 61596

So they are there. about 1.5 per events. many in the same location. (ok I had to write the module id as well)
Of course I cannot exclude that there are also duplicate pixels that are not contiguous.

@VinInn
Copy link
Contributor

VinInn commented Mar 17, 2022

so in event Run 323775, Event 32123778, LumiSection 53
module "1" is plenty of duplicate pixels and most of them are NOT contiguous
see a (partial) dump below (those 0,0 are the duplicate removed)

notice how many times 159 208 appears...

151 208
150 208
149 208
156 208
159 208
146 208
153 208
152 208
159 208
142 208
141 208
140 208
137 208
159 208
111 208
110 209
111 208
108 208
159 208
0 0
153 208
152 208
119 209
118 209
117 209
116 209
124 208
121 209
152 208
120 209
159 208
128 208
125 208
126 208
131 208
132 208
129 208
135 208
0 0
152 208
135 208
134 208
82 211
85 211
86 211
83 211
84 211
89 211
95 211
0 0
88 211
159 208
0 0
91 211
92 211
99 210
98 210
97 208
96 210

@ferencek
Copy link
Contributor

ferencek commented Mar 17, 2022

Following Danek presentation I made a small test using this code that detect (and remove) a duplicate pixel (the second one) in digis. It should make GPU reproduce current CPU behaviour. trivial to remove both.

Isn't the CPU code keeping the last occurrence https://github.com/cms-sw/cmssw/blob/CMSSW_12_2_0/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc#L292 (assuming the charge crosses the threshold which is always the case for the ChannelThreshold of 10).

Based on Danek's investigation that duplicate pixels seems to be coming from noisy ROCs, perhaps the best strategy would be to drop such pixels.

@VinInn
Copy link
Contributor

VinInn commented Mar 17, 2022

yes @ferencek , you are right.
Ok trivial to keep the second and remove the first. Still occurrences seem to be random (see above) so detecting contiguous duplicates in not enough

@VinInn
Copy link
Contributor

VinInn commented Mar 21, 2022

So I implemented also the N^2 duplicate finding at module level (in the clusterizer)
Here the timing results

(summary 789 Hz instead of 845)

baseline

|Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nLooseTracks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nUsedCells | nZeroTrackCells ||
Counters Raw 5000 81755950 857898799 32625252 14696728 5972444 26803352 37513969 24198090 3930759 55470593 116918358
Counters Norm 5000 ||  16351.2|  171579.8|  6525.1|  5360.7|  2939.3|  1194.5|  7502.8|  4839.6|  0.005|  0.065|  0.136||


Running 4 times over 5000 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   848.0 ±   0.5 ev/s (4900 events)
   845.5 ±   0.7 ev/s (4900 events)
   843.0 ±   0.5 ev/s (4900 events)
   844.2 ±   0.5 ev/s (4900 events)
 --------------------
   845.2 ±   2.1 ev/s

 GPU activities:   11.74%  1.29890s      5000  259.78us  25.247us  982.72us  void gpuClustering::findClus<bool=0>(unsigned short const *, unsigned short const , unsigned short const , unsigned int const *, unsigned int*, unsigned int const$
                    9.00%  995.16ms      5000  199.03us  23.040us  1.2858ms  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleV$
                    0.66%  72.547ms      5000  14.509us  5.4080us  79.776us  pixelgpudetails::RawToDigi_kernel(SiPixelROCsStatusAndMapping const *, unsigned char const *, unsigned int, unsigned int const *, unsigned char const *, unsigned $
                    0.58%  64.291ms      5000  12.858us  3.4240us  57.663us  void gpuCalibPixel::calibDigis<bool=1>(unsigned short*, unsigned short const *, unsigned short const , gpuCalibPixel::calibDigis<bool=1>, SiPixelGainForHLTonGPU c$


remove just contiguous

Running 4 times over 5000 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   840.5 ±   0.5 ev/s (4900 events)
   844.1 ±   0.5 ev/s (4900 events)
   840.8 ±   0.7 ev/s (4900 events)
   846.1 ±   0.6 ev/s (4900 events)
 --------------------
   842.9 ±   2.7 ev/s


 GPU activities:   11.66%  1.31490s      5000  262.98us  25.632us  1.1196ms  void gpuClustering::findClus<bool=0>(unsigned short const *, unsigned short const , unsigned short const , unsigned int const *, unsigned int*, unsigned int const$
                    8.89%  1.00227s      5000  200.45us  23.232us  1.2877ms  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleV$
                    0.68%  76.954ms      5000  15.390us  6.3360us  105.89us  pixelgpudetails::RawToDigi_kernel(SiPixelROCsStatusAndMapping const *, unsigned char const *, unsigned int, unsigned int const *, unsigned char const *, unsigned $
                    0.59%  66.322ms      5000  13.264us  2.7200us  65.920us  void gpuCalibPixel::calibDigis<bool=1>(unsigned short*, unsigned short const *, unsigned short const , gpuCalibPixel::calibDigis<bool=1>, SiPixelGainForHLTonGPU c$


keep only last in module (N^2 algo at module level)

||Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nLooseTracks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nUsedCells | nZeroTrackCells ||
Counters Raw 5000 81754152 857886560 32624614 14696528 5972245 26802897 37513560 24197830 3930818 55469912 116915978
Counters Norm 5000 ||  16350.8|  171577.3|  6524.9|  5360.6|  2939.3|  1194.4|  7502.7|  4839.6|  0.005|  0.065|  0.136||


Running 4 times over 5000 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   792.0 ±   0.5 ev/s (4900 events)
   789.4 ±   0.7 ev/s (4900 events)
   790.0 ±   0.7 ev/s (4900 events)
   785.2 ±   0.5 ev/s (4900 events)
 --------------------
   789.1 ±   2.9 ev/s


 GPU activities:   15.44%  1.86159s      5000  372.32us  31.968us  1.3527ms  void gpuClustering::findClus<bool=0>(unsigned int*, unsigned short*, unsigned short const *, unsigned short const , unsigned int const *, gpuClustering::findClus<$
                    8.34%  1.00613s      5000  201.23us  23.136us  1.2789ms  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleV$
                    0.65%  78.570ms      5000  15.714us  5.9520us  50.591us  pixelgpudetails::RawToDigi_kernel(SiPixelROCsStatusAndMapping const *, unsigned char const *, unsigned int, unsigned int const *, unsigned char const *, unsigned $
                    0.56%  67.428ms      5000  13.485us  3.4240us  76.895us  void gpuCalibPixel::calibDigis<bool=1>(unsigned short*, unsigned short const *, unsigned short const , gpuCalibPixel::calibDigis<bool=1>, SiPixelGainForHLTonGPU c$


@VinInn
Copy link
Contributor

VinInn commented Mar 21, 2022

To those who wish to test the above please
git cms-merge-topic VinInn:DupPix
in today (or later) 12_4_X

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 21, 2022

@VinInn thank you very much for the detailed test !

As you suggested, removing only the nearby duplicates has a negligible impact, from 845 ± 2 ev/s to 843 ± 3, but doing the full scan has a huge cost, bringing the throughput down to 789 ± 3 ev/s.

And I understand that the cost would have to be paid in any case, not only in the hopefully rare runs and modules which are affected.

@VinInn
Copy link
Contributor

VinInn commented Mar 25, 2022

I spent last few days in trying to understand why the Fishbone was not reproducible on GPU (even after doing full combinatorial)

at the end was this

-          if (d[ic] != d[jc] && cos12 * cos12 >= 0.99999f * n[ic] * n[jc]) {
+          if (d[ic] != d[jc] && cos12 * cos12 >= 0.99999f * (n[ic] * n[jc]) ) {

I hope everybody appreciates the level of sophistication required to obtain bit-wise reproducibility in presence of not ordered collections

@silviodonato
Copy link
Contributor

I spent last few days in trying to understand why the Fishbone was not reproducible on GPU (even after doing full combinatorial)

at the end was this

-          if (d[ic] != d[jc] && cos12 * cos12 >= 0.99999f * n[ic] * n[jc]) {
+          if (d[ic] != d[jc] && cos12 * cos12 >= 0.99999f * (n[ic] * n[jc]) ) {

I hope everybody appreciates the level of sophistication required to obtain bit-wise reproducibility in present of not ordered collection

Many thanks Vincenzo!

@VinInn
Copy link
Contributor

VinInn commented Mar 25, 2022

I would like to point out that a similar combinatorial algorithm running on CPU using GPU output will NOT reproduce either (unless the collection is somehow sorted).

@VinInn
Copy link
Contributor

VinInn commented Mar 29, 2022

here the PR on fishbone
#37398

@silviodonato
Copy link
Contributor

#37359

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests