reproducibility of GPU vs CPU results at HLT #35376

fwyzard · 2021-09-23T08:04:25Z

Tests done in multiple recent releases have shown that the HLT results are not consistent when running on GPU vs on CPU.

Here are the instruction to reproduce the issue using

CMSSW_12_1_0_pre3
- for working with a more recent IB, it may be necessary to add Fix the HLT GPU customisation for running on the CPU #35497
the HLT menu included in that release, /dev/CMSSW_12_1_0/GRun/V1
a recent Run-3 relval sample of TTbar, e.g.
- from CMSSW_12_0_0_pre6: /RelValTTbar_14TeV/CMSSW_12_0_0_pre6-PU_120X_mcRun3_2021_realistic_v4_JIRA_129-v1/GEN-SIM-DIGI-RAW
- from CMSSW_12_1_0_pre3: /store/relval/CMSSW_12_1_0_pre3/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_121X_mcRun3_2021_realistic_v2-v1/10000/0eb14c4a-e363-424a-9c0c-2688c7d32c74.root
the "auto" Run-3 global tag from the release, auto:phase1_2021_realistic; running on a previous release using the same global tag as the sample itself (120X_mcRun3_2021_realistic_v4) shows a similar behaviour.

setup a CMSSW working area

cmsrel CMSSW_12_1_0_pre3
cd CMSSW_12_1_0_pre3/   
cmsenv
mkdir run
cd run

extract the HLT configuration for running on GPU using the Run3 era

hltGetConfiguration /dev/CMSSW_12_1_0/GRun/V1 \
    --eras Run3 \
    --globaltag auto:phase1_2021_realistic \
    --mc \
    --unprescale \
    --output minimal \
    --customise HLTrigger/Configuration/customizeHLTforPatatrack.customizeHLTforPatatrack \
    --input /store/relval/CMSSW_12_0_0_pre6/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_120X_mcRun3_2021_realistic_v4_JIRA_129-v1/00000/79c06ed5-929b-4a57-a4f2-1ae90e6b38c5.root \
    > hlt.py

run the HLT menu on a GPU-equipped machine

# run the HLT menu without any GPUs
CUDA_VISIBLE_DEVICES= cmsRun hlt.py
mv output.root output-cpu.root

# use available GPUs
cmsRun hlt.py
mv output.root output-gpu.root

compare the results

hltDiff -o output-cpu.root -n output-gpu.root
...
Found 100 matching events, out of which 100 have different HLT results

      Events    Accepted      Gained        Lost       Other  Trigger
         100           0           -           -          ~1  HLT_AK8PFHT750_TrimMass50_v12
         100           0           -           -          ~1  HLT_DoubleEle8_CaloIdM_TrackIdM_Mass8_DZ_PFHT350_v20
         100           0           -           -          ~1  HLT_DoubleEle8_CaloIdM_TrackIdM_Mass8_PFHT350_v20
         100           2          +2           -           -  HLT_DoubleMu4_Mass3p8_DZ_PFHT350_v8
         100          17           -           -          ~2  HLT_Ele15_WPLoose_Gsf_v3
         100          17           -           -          ~2  HLT_Ele17_WPLoose_Gsf_v3
         100          17           -           -          ~1  HLT_Ele20_WPLoose_Gsf_v6
         100          17           -           -          ~1  HLT_Ele20_eta2p1_WPLoose_Gsf_v6
         100          10          +1           -           -  HLT_HT450_Beamspot_v11
         100          42          +2           -           -  HLT_HT300_Beamspot_v11
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_LooseChargedIsoPFTauHPS27_eta2p1_CrossL1_v4
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_MediumChargedIsoPFTauHPS27_eta2p1_CrossL1_v1
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_TightChargedIsoPFTauHPS27_eta2p1_CrossL1_v1
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_LooseChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_MediumChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1
         100           4           -          -1           -  HLT_IsoMu20_eta2p1_TightChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1
         100          81          +1           -           -  HLT_DiPFJet15_FBEta3_NoCaloMatched_v17
         100          98           -          -1           -  HLT_DiPFJetAve40_v14
         100          63           -           -          ~1  HLT_DiPFJetAve80_v13
         100           8          +1           -           -  HLT_DiPFJetAve140_v13
         100           0           -           -          ~1  HLT_DiPFJetAve220_HFJEC_v16
         100           7           -           -          ~1  HLT_AK8PFJet200_v15
         100          73          +1           -           -  HLT_PFJet80_v20
         100          68           -          -4           -  HLT_PFJetFwd25_v3
         100          74          +5          -1           -  HLT_AK8PFJetFwd40_v15
         100          15           -          -1           -  HLT_AK8PFJetFwd60_v14
         100           4          +1           -           -  HLT_AK8PFJetFwd80_v14
         100          56           -          -1           -  HLT_PFHT250_v17
         100          16           -           -          ~1  HLT_PFHT370_v17
         100           8          +1           -           -  HLT_PFHT430_v17
         100           4           -           -          ~1  HLT_PFHT510_v17
         100           4           -           -          ~1  HLT_PFHT590_v17
         100           0          +1           -           -  HLT_PFHT890_v17
         100           0           -           -          ~4  HLT_PFHT500_PFMET100_PFMHT100_IDTight_v12
         100           0           -           -          ~2  HLT_PFHT500_PFMET110_PFMHT110_IDTight_v12
         100           0           -           -          ~3  HLT_PFHT700_PFMET95_PFMHT95_IDTight_v12
         100           6           -          -1           -  HLT_PFMET120_PFMHT120_IDTight_v20
         100           2          +1           -          ~1  HLT_PFMET140_PFMHT140_IDTight_v20
         100           5          +1           -           -  HLT_PFMET100_PFMHT100_IDTight_CaloBTagDeepCSV_3p1_v8
         100           5           -          -1           -  HLT_PFMET120_PFMHT120_IDTight_CaloBTagDeepCSV_3p1_v8
         100           1          +1           -          ~1  HLT_PFMET140_PFMHT140_IDTight_CaloBTagDeepCSV_3p1_v8
         100           6           -          -1          ~1  HLT_PFMET120_PFMHT120_IDTight_PFHT60_v9
         100           8           -           -          ~1  HLT_PFMETNoMu120_PFMHTNoMu120_IDTight_PFHT60_v9
         100           6           -           -          ~1  HLT_PFMETTypeOne120_PFMHT120_IDTight_PFHT60_v9
         100           4           -           -          ~1  HLT_PFMETTypeOne140_PFMHT140_IDTight_v11
         100           8           -           -          ~1  HLT_PFMETNoMu110_PFMHTNoMu110_IDTight_v20
         100          11          +1           -           -  HLT_CaloMHT90_v4
         100           2          +1           -           -  HLT_Mu12_DoublePFJets54MaxDeta1p6_DoubleCaloBTagDeepCSV_p71_v2
         100          73           -          -1           -  HLT_DoublePFJets40_CaloBTagDeepCSV_p71_v2
         100          16           -          -1           -  HLT_DoublePFJets100_CaloBTagDeepCSV_p71_v2
         100           2          +1           -           -  HLT_DoublePFJets200_CaloBTagDeepCSV_p71_v2
         100           7           -           -          ~1  HLT_BTagMu_AK4DiJet40_Mu5_v13
         100          16          +1           -           -  HLT_BTagMu_AK4DiJet40_Mu5_noalgo_v13
         100          12          +2           -           -  HLT_HT425_v9
         100           0           -           -          ~1  HLT_HT430_DisplacedDijet40_DisplacedTrack_v13
         100           0           -           -          ~1  HLT_HT430_DisplacedDijet60_DisplacedTrack_v13
         100           0           -           -          ~2  HLT_HT400_DisplacedDijet40_DisplacedTrack_v13
         100           0           -           -          ~1  HLT_DiJet110_35_Mjj650_PFMET110_v9
         100           0           -           -          ~1  HLT_DiJet110_35_Mjj650_PFMET120_v9
         100           0           -           -          ~1  HLT_DiJet110_35_Mjj650_PFMET130_v9
         100           0           -           -          ~1  HLT_TripleJet110_35_35_Mjj650_PFMET110_v9
         100           0           -           -          ~1  HLT_TripleJet110_35_35_Mjj650_PFMET120_v9
         100           0           -           -          ~1  HLT_TripleJet110_35_35_Mjj650_PFMET130_v9
         100           1           -           -          ~1  HLT_Ele15_IsoVVVL_PFHT450_CaloBTagDeepCSV_4p5_v8
         100           0           -           -          ~1  HLT_Ele15_IsoVVVL_PFHT450_PFMET50_v16
         100           1           -           -          ~1  HLT_Ele15_IsoVVVL_PFHT450_v16
         100           1           -           -          ~1  HLT_Ele15_IsoVVVL_PFHT600_v20
         100           1          +1           -           -  HLT_Mu4_TrkIsoVVL_DiPFJet90_40_DEta3p5_MJJ750_HTT300_PFMETNoMu60_v15
         100           1          +1           -           -  HLT_Mu8_TrkIsoVVL_DiPFJet40_DEta3p5_MJJ750_HTT300_PFMETNoMu60_v16
         100           1          +1           -           -  HLT_Mu10_TrkIsoVVL_DiPFJet40_DEta3p5_MJJ750_HTT350_PFMETNoMu60_v15
         100           1           -          -1           -  HLT_Mu15_IsoVVVL_PFHT450_PFMET50_v15
         100           3           -          -1          ~1  HLT_Mu3er1p5_PFJet100er2p5_PFMET70_PFMHT70_IDTight_v2
         100           3           -          -1           -  HLT_Mu3er1p5_PFJet100er2p5_PFMET80_PFMHT80_IDTight_v2
         100           3           -           -          ~1  HLT_Mu3er1p5_PFJet100er2p5_PFMETNoMu70_PFMHTNoMu70_IDTight_v2
         100          23           -           -          ~1  HLT_Ele8_CaloIdM_TrackIdM_PFJet30_v18
         100          18           -           -          ~1  HLT_Ele17_CaloIdM_TrackIdM_PFJet30_v16
         100           5          +1          -1          ~1  HLT_PFHT330PT30_QuadPFJet_75_60_45_40_TriplePFBTagDeepCSV_4p5_v3
         100          18          +2           -           -  HLT_PFHT330PT30_QuadPFJet_75_60_45_40_v9
         100           3           -           -          ~2  HLT_PFHT400_SixPFJet32_DoublePFBTagDeepCSV_2p94_v8
         100           4           -           -          ~2  HLT_PFHT400_SixPFJet32_v8
         100           5           -           -          ~3  HLT_PFHT400_FivePFJet_100_100_60_30_30_v8
         100           3           -           -          ~3  HLT_PFHT400_FivePFJet_100_100_60_30_30_DoublePFBTagDeepCSV_4p5_v8
         100           3           -           -          ~1  HLT_PFHT400_FivePFJet_120_120_60_30_30_DoublePFBTagDeepCSV_4p5_v8
         100          18          +4           -           -  HLT_PFHT350_v19
         100          20          +1           -           -  HLT_PFHT350MinPFJet15_v9
         100           2           -           -          ~1  HLT_DiSC30_18_EIso_AND_HE_Mass70_v13
         100          84          +1          -1           -  HLT_AK4CaloJet80_v10
         100          60          +1           -           -  HLT_AK4CaloJet100_v10
         100          73          +1           -           -  HLT_AK4PFJet80_v19
         100          49          +1           -           -  HLT_AK4PFJet100_v19
         100          33          +1           -           -  HLT_AK4PFJet120_v18
         100          89          +2           -           -  MC_PFBTagDeepCSV_v10
         100          18           -           -          ~1  MC_Ele5_WPTight_Gsf_v8
         100           0           -           -          ~1  HLT_MediumChargedIsoPFTau50_Trk30_eta2p1_1pr_MET130_v8
         100           1           -           -          ~1  HLT_MediumChargedIsoPFTau180HighPtRelaxedIso_Trk50_eta2p1_1pr_v11
         100           1           -           -          ~1  HLT_MediumChargedIsoPFTau180HighPtRelaxedIso_Trk50_eta2p1_v12
         100           0          +1           -           -  HLT_Rsq0p35_v15
         100           0          +1           -           -  HLT_Rsq0p40_v15
         100           1          +1           -          ~1  HLT_RsqMR300_Rsq0p09_MR200_v15
         100           0          +1           -          ~1  HLT_RsqMR320_Rsq0p09_MR200_v15
         100           2           -           -          ~1  HLT_RsqMR300_Rsq0p09_MR200_4jet_v15
         100           1           -           -          ~1  HLT_RsqMR320_Rsq0p09_MR200_4jet_v15
         100           1           -           -          ~1  HLT_DoubleMediumChargedIsoPFTau35_Trk1_eta2p1_Reg_v12
         100           1           -           -          ~1  HLT_DoubleMediumChargedIsoPFTau35_Trk1_TightID_eta2p1_Reg_v12
         100           1           -           -          ~1  HLT_DoubleTightChargedIsoPFTau35_Trk1_eta2p1_Reg_v12
         100           1           -           -          ~1  HLT_DoubleTightChargedIsoPFTau35_Trk1_TightID_eta2p1_Reg_v12
         100           1           -           -          ~1  HLT_DoubleTightChargedIsoPFTauHPS35_Trk1_eta2p1_Reg_v1
         100           1           -           -          ~1  HLT_DoubleMediumChargedIsoPFTauHPS35_Trk1_TightID_eta2p1_Reg_v1
         100           1           -           -          ~1  HLT_DoubleMediumChargedIsoPFTauHPS35_Trk1_eta2p1_Reg_v4
         100           1           -           -          ~1  HLT_DoubleTightChargedIsoPFTauHPS35_Trk1_TightID_eta2p1_Reg_v1
         100           0           -           -          ~1  HLT_VBF_DoubleLooseChargedIsoPFTauHPS20_Trk1_eta2p1_v1
         100           0           -           -          ~1  HLT_VBF_DoubleMediumChargedIsoPFTauHPS20_Trk1_eta2p1_v1
         100           0           -           -          ~1  HLT_VBF_DoubleTightChargedIsoPFTauHPS20_Trk1_eta2p1_v1
         100           9          +1           -           -  HLT_PFMETNoMu100_PFMHTNoMu100_IDTight_PFHT60_v9
         100           8           -          -1           -  HLT_PFMETTypeOne100_PFMHT100_IDTight_PFHT60_v9
         100           3           -           -          ~2  HLT_QuadPFJet98_83_71_15_DoublePFBTagDeepCSV_1p3_7p7_VBF1_v8
         100           3           -           -          ~2  HLT_QuadPFJet103_88_75_15_DoublePFBTagDeepCSV_1p3_7p7_VBF1_v8
         100           3           -           -          ~3  HLT_QuadPFJet111_90_80_15_DoublePFBTagDeepCSV_1p3_7p7_VBF1_v8
         100           0           -           -          ~2  HLT_QuadPFJet98_83_71_15_PFBTagDeepCSV_1p3_VBF2_v8
         100           0           -           -          ~2  HLT_QuadPFJet103_88_75_15_PFBTagDeepCSV_1p3_VBF2_v8
         100           0           -           -          ~2  HLT_QuadPFJet105_88_76_15_PFBTagDeepCSV_1p3_VBF2_v8
         100           0           -           -          ~3  HLT_QuadPFJet111_90_80_15_PFBTagDeepCSV_1p3_VBF2_v8
         100          19           -          -1          ~2  HLT_QuadPFJet98_83_71_15_v5
         100          11          +2           -          ~1  HLT_QuadPFJet103_88_75_15_v5
         100          11          +2           -          ~1  HLT_QuadPFJet105_88_76_15_v5
         100           8          +2          -1          ~1  HLT_QuadPFJet111_90_80_15_v5
         100           3           -           -          ~2  HLT_QuadPFJet105_88_76_15_DoublePFBTagDeepCSV_1p3_7p7_VBF1_v8
         100          29           -          -1           -  HLT_TrkMu6NoFiltersNoVtx_v1
         100         100           -        -100           -  Status_OnCPU
         100           0        +100           -           -  Status_OnGPU

To disentangle the various effects, one can use different customisations on top of the HLT menu, running each resulting configuration with a GPU and without a GPU (that is, fully on the CPU).
Replace the customisation at the bottom of the hlt.py file

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import customizeHLTforPatatrack
process = customizeHLTforPatatrack(process)

with a more fine-grained one, described below.

legacy configuration

Run the HLT menu unchanged, adding only the Status_OnGPU and Status_OnCPU paths, without actually offloading any reconstruction to GPU:

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)

ECAL-only changes

To check the impact of running the ECAL reconstruction on GPU vs CPU, apply only the ECAL changes:

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)
process = customiseEcalLocalReconstruction(process)

HCAL-only changes

To check the impact of running the HCAL reconstruction on GPU vs CPU, apply only the HCAL changes:

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)
process = customiseHcalLocalReconstruction(process)

Pixel local reconstruction changes

To check the impact of running the Pixel local reconstruction on GPU vs CPU, apply only the Pixel changes:

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)
process = customisePixelLocalReconstruction(process)

Pixel track reconstruction changes

To check the impact of running the Pixel local reconstruction on GPU vs CPU, apply only the Pixel and Tracking changes.
Clearly, for this comparison to be meaningful, the previous one needs to be understood first.

#User-defined customization functions
from HLTrigger.Configuration.customizeHLTforPatatrack import *
process = customiseCommon(process)
process = customisePixelLocalReconstruction(process)
process = customisePixelTrackReconstruction(process)

The ECAL-only comparison did not reveal significant differences.

The HCAL-only comparison showed significant differences in a few % of the events (order of 10% of the accepted events).

The Pixel local reconstruction comparison showed significant differences in a few % of the events (order of 10% of the accepted events), while affecting less paths than the HCAL one.

I think that looking at the Pixel track comparison makes sense only after fixing the local reconstruction one.

Updates

for running with recent IBs, please use Fix the HLT GPU customisation for running on the CPU #35497 .
the 12.1.0-pre3 relvals can also be used, for example /store/relval/CMSSW_12_1_0_pre3/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_121X_mcRun3_2021_realistic_v2-v1/10000/0eb14c4a-e363-424a-9c0c-2688c7d32c74.root .

The text was updated successfully, but these errors were encountered:

fwyzard · 2021-09-23T08:04:42Z

assign hlt, heterogeneous

cmsbuild · 2021-09-23T08:04:45Z

New categories assigned: heterogeneous,hlt

@fwyzard,@Martin-Grunewald,@makortel,@missirol you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2021-09-23T08:04:46Z

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2021-09-23T08:05:05Z

@silviodonato now there is a GitHuib issue...

fwyzard · 2021-09-23T08:07:05Z

The HCAL discrepancies are partially addressed by @mariadalfonso in #35355 / #35357

mmusich · 2021-09-23T08:13:58Z

assign trk-dpg

mmusich · 2021-09-23T08:14:07Z

assign hcal-dpg

cmsbuild · 2021-09-23T08:14:15Z

New categories assigned: hcal-dpg,trk-dpg

@mmusich,@georgia14,@tsusa,@mseidel42 you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard · 2021-09-23T08:20:58Z

thanks

VinInn · 2021-09-23T10:50:17Z

what does Other mean?

fwyzard · 2021-09-23T10:53:17Z

Other means that the EDFilter responsible for accepting/rejecting the event (i.e. the last module that runs on a path) is different in the two cases; it's usually safe to ignore.

VinInn · 2021-09-23T12:14:00Z

Than it would be better to run with 1K or even 10K events just to understand which are seriously systematic affected.
IMHO the only worth a deeper analysis from the list above are the two HLT_DoubleMu4 events and what is going on with all those HLT_QuadPFJet and their various combinatorial thresholds...
(and the bizarre behavior of HLT_PFHT350_v19 wrt all other HLT_PFHTxyz_v17 (are there other v19????)

silviodonato · 2021-09-23T16:09:38Z

Than it would be better to run with 1K or even 10K events just to understand which are seriously systematic affected.
IMHO the only worth a deeper analysis from the list above are the two HLT_DoubleMu4 events and what is going on with all those HLT_QuadPFJet and their various combinatorial thresholds...
(and the bizarre behavior of HLT_PFHT350_v19 wrt all other HLT_PFHTxyz_v17 (are there other v19????)

The difference between HLT_PFHT350_v19 and HLT_PFHT370_v17 is only in the L1seeds, caloHT cut (320 vs 300), and PFHT cut (370 vs 350). All the filters cut on the same objects (hltHtMhtJet30 and hltPFHTJet30).

tsusa · 2021-09-27T07:49:45Z

@fwyzard is there a similar comparison between gpu and cpu (non-legacy wf) ?

fwyzard · 2021-09-27T08:01:06Z

Hi @tsusa,

the instructions above under Pixel local reconstruction changes should compare

the legacy pixel tracks (on CPU) running on top of the legacy pixel local reconstruction (on CPU)
the legacy pixel tracks (on CPU) running on top of the new pixel local reconstruction (on GPU)

while those under Pixel track reconstruction changes should compare

the Patatrack pixel tracks (on CPU) running on top of the legacy pixel local reconstruction (on CPU)
the Patatrack pixel tracks (on GPU) running on top of the new pixel local reconstruction (on GPU)

Are you looking for something else ?

tsusa · 2021-09-27T08:39:38Z

Hi @fwyzard,
yes, the comparison

the legacy pixel tracks (on CPU) running on top of the cpu pixel local reconstruction (on CPU)
the legacy pixel tracks (on CPU) running on top of the new pixel local reconstruction (on GPU)

fwyzard · 2021-09-27T08:46:29Z

You can make this comparison by creating the configuration for the Pixel local reconstruction changes, running it on the CPU (e.g. setting CUDA_VISIBLE_DEVICES= or running on a machine without GPUs) and comparing the results with those of the legacy reconstruction.

VinInn · 2021-09-27T09:49:12Z

legacy pixel local reconstruction is NOT supposed to be identical to "new pixel local reconstruction", (and not required to)

fwyzard · 2021-09-27T10:08:59Z

Well, this is news to me.
Why not ?
What is different ?

mmusich · 2021-09-27T10:11:45Z

e.g. treatment of the generic CPE errors?
https://indico.cern.ch/event/968978/contributions/4078289/attachments/2134046/3594257/Pixel%20GPU%20code%20Validation.pdf#page=11

fwyzard · 2021-09-27T10:22:39Z

ah, yes - still, how much different do we expect it to be ?

(and anyway this is outside the scope of the discrepancy discussed in this issue, which should use the new reconstruction for both CPU and GPU workflows)

mmusich · 2021-09-27T10:24:30Z

ah, yes - still, how much different do we expect it to be ?

There are plots in the talk linked above. I let you judged by yourself.
Mind that the situation has been improved by #34286

VinInn · 2021-09-30T12:34:16Z

run twice on gpu. Got:

Found 100 matching events, out of which 0 have different HLT results

silviodonato · 2022-03-14T12:43:49Z

About scouting, 17 events, out of 1268, have a gpu - cpu met difference larger than 5 GeV.

treeCPU.Scan("double_hltScoutingPFPacker_pfMetPt_HLTCPU.obj : gpu.double_hltScoutingPFPacker_pfMetPt_HLTGPU.obj : gpu2.double_hltScoutingPFPacker_pfMetPt_HLTGPU2.obj ","abs(double_hltScoutingPFPacker_pfMetPt_HLTCPU.obj - gpu.double_hltScoutingPFPacker_pfMetPt_HLTGPU.obj)>5")
************************************************
*    Row   * double_hl *  gpu.doub *  gpu2.dou *
************************************************
*       80 * 26.076620 * 20.724564 * 20.727596 *
*       95 * 5.7123356 * 10.943406 * 10.943855 *
*      105 * 45.371674 * 63.077634 * 44.873158 *
*      276 * 28.489796 * 23.359113 * 23.370310 *
*      325 * 43.596063 * 61.206629 * 61.189337 *
*      452 * 47.897351 * 53.124322 * 53.121180 *
*      548 * 63.809485 * 57.314162 * 63.809362 *
*      553 * 16.567702 * 3.0462222 * 3.0462224 *
*      602 * 5.3528798 * 12.538358 * 12.539316 *
*      605 * 82.321715 * 72.832689 * 81.878138 *
*      660 * 3.3754865 * 9.5853813 * 9.5641984 *
*      662 * 32.959398 * 11.968470 * 11.992302 *
*      975 * 37.379707 * 24.026722 * 24.026722 *
*     1073 * 26.614738 *  21.42725 * 21.427250 *
*     1126 * 40.773425 * 54.061522 * 54.962562 *
*     1132 * 19.311899 * 12.621779 * 12.622841 *
*     1208 * 33.183377 * 27.939318 * 33.538982 *
************************************************

fwyzard · 2022-03-14T13:11:02Z

The issue of duplicate pixels was already discussed in this HyperNews thread https://hypernews.cern.ch/HyperNews/CMS/get/pixelOfflineSW/1485.html.

On the HN Danek wrote that it is hard to know the reason without doing a more detailed study, and that he would try to monitor it to see if there are any systematic effects.
Do you know if there was any follow up ?

The bottom line is that there is no easy way to tell what is more correct so one needs to make a choice and accept it will not always be the correct thing to do.

Unless some way to distinguish the good and bad pixel hits can be found, I would suggest one of the following:

if the downstream code can properly handle duplicate pixels, the unpacker could produce them all;
if the downstream code cannot (be made to) handle duplicate pixels, the unpacker (or an additional module) should sum the energy form the duplicate pixels.

Does anybody has a better suggestion ?

mmusich · 2022-03-14T13:16:08Z

if the downstream code cannot (be made to) handle duplicate pixels, the unpacker (or an additional module) should sum the energy form the duplicate pixels.

while keeping either one of the duplicates has some (50%?) chances of doing the right thing, this is clearly wrong in all cases. Moreover if the duplicate pixels end up being one at the edge of the cluster, might screw-up the determination of the position in the CPE code.

VinInn · 2022-03-14T13:17:04Z

is It possible that the problem of pixels not belonging to the event is not limited to duplicate pixels?
When duplicate we can detect. If not: we still have spurious hits around (even in a legit cluster)

ferencek · 2022-03-14T14:02:38Z

I agree with @mmusich, summing up charges or taking their average is almost certainly wrong so taking only one at least has a good chance of being correct. Regarding Vincenzo's comment above, indeed, based on what Danek wrote in the HN thread, there could be spurious hits present in the event caused by the same mechanism which we would never detect as problematic if they are not duplicate. In fact, if the underlying cause is a bit flip in the pixel address as opposed to a stale content in the data buffer, we possibly end up having two problems at the same time, a missing and a fake hit in the same event.

VinInn · 2022-03-14T14:12:03Z

would "a bit flip" make duplicate more likely? (a pixel in the same cluster with a bit-flipped may generate a duplicate?).
readout is not random if I remember well, so one can look to the pattern in the buffer to understand what is going on.
Could the duplicate rate used to globally estimate the effect?
Do we know if it occurs all the time and everywhere?

fwyzard · 2022-03-14T14:39:41Z

I agree with @mmusich, summing up charges or taking their average is almost certainly wrong so taking only one at least has a good chance of being correct.

In the presence of something completely unexpected ("duplicate" pixels), how can you be certain that taking one or the other is correct ?

Also, the "right" one could be

always the first one
always the last one
one at random

If it is always the first one or always the last one, and we implement the other case in the unpacker, we are effectively always making the wrong choice -- which is not any different from using the sum or the average.

mmusich · 2022-03-14T14:43:15Z

In the presence of something completely unexpected ("duplicate" pixels), how can you be certain that taking one or the other is correct ?

I am not certain, but without knowing for sure (that requires more studies) summing pixels would be exchanging a possible mistake for a certain mistake.

fwyzard · 2022-03-14T14:51:31Z

OK, I guess I see things differently, but in the end it doesn't matter: as long as we don't know, any algorithm that is fully reproducible should be an improvement over the present situation, and I'm fine with that.

The problem is that, as @VinInn pointed out "first" and "last" are not easy to implement in a parallel algorithm.

silviodonato · 2022-03-15T09:19:19Z

The most significant effect is when we do not reconstruct a pixel clusters because of duplicates.
Example (see above, event 181451699, detId 303042568, x, y = 159, 292).

GPUs see two duplicated pixels with charge [22019, 100].
CPUs reconstruct no cluster, because they see only one pixel with charge 100, that is below the pixel cluster threshold.

My 2 cents: we should take the maximum of the N duplicated pixels.

Our top priority is to reconstruct correctly the pixel clusters 1) in signal events 2) close to the signal objects (jets or leptons).
Assuming that the duplicated pixels comes from a random bunch crossing, I think that we should take the maximum between the two pixel charges, because the probability to have a pixel (in a region of interest) with a large charge from a random bunch crossing should be smaller compared to the probability of a signal event.

VinInn · 2022-03-15T09:25:39Z

On the other hand duplicates may be due to bit-flip: most probably the original pixel was closeby even in the same cluster.
In such a case the sum looks to me more correct.

I think a detailed study is needed. We do not even know if duplicates pixels are contiguous in the rawdata or not.
We do not know the rate, the location, the occupancy.
We do not know if correlated with luminosity, fluency or even just "wall time".

(And CMS ignored all this for 20 years)

fwyzard · 2022-03-15T09:36:28Z

@VinInn how does the clustering code behave in the presence of "duplicate" pixels ?

VinInn · 2022-03-15T09:57:06Z

I posted the relevant line above:
on CPU just load the latest
on GPU clusterize them independently: at CPE time they will be summed.
have not checked what template does

silviodonato · 2022-03-15T16:33:53Z

sorry I got confused by the code in the second copy_to_buffer (where pixels are added)

the line to fill the buffer later used by make_clusters is https://cmssdt.cern.ch/dxr/CMSSW/source/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc#294 and indeed it is set not add. So last pixel in raw-data wins.

@VinInn is there a trivial way to sum the pixels also on the GPU side?
Just to avoid that having two pixels might create some differences (second order effect).

Btw. We finally found a bug in the L2 tau reco! Now the trigger with the largest difference should is HLT_PFHT400_FivePFJet_100_100_60_30_30_DoublePFBTagDeepCSV_4p5_v8

Total	Accepted OLD	Accepted NEW	Gained	Lost	trigger
1461195	261	258	13	-16	HLT_PFHT400_FivePFJet_100_100_60_30_30_DoublePFBTagDeepCSV_4p5_v8

Ganesh will update the study including the L2 tau NN workaround and using the sum of the duplicated pixels in the next days.

VinInn · 2022-03-15T17:35:07Z

On 15 Mar, 2022, at 5:34 PM, Silvio Donato ***@***.***> wrote: @VinInn is there a trivial way to sum the pixels also on the GPU side? Just to avoid that having two pixels might create some differences (second order effect).

Not trival on GPU (maybe adding a N^2 algo kernel that detects duplicate and overwrites one with the sum the other with 0. (the 0 one can be removed when converting to legacy) Easier somehow when converting to legacy. (always N^2 algo) v.

mmusich · 2022-03-16T08:56:30Z

In case people active in this thread want to contribute to the discussion, the issue about duplicate pixels is going to be tackled today at the Joint TRK DPG/POG meeting

VinInn · 2022-03-16T17:55:16Z

Following Danek presentation I made a small test using this code
that detect (and remove) a duplicate pixel (the second one) in digis.
It should make GPU reproduce current CPU behaviour.
trivial to remove both.

diff --git a/DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h b/DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h
index cfadce7cf45..9f066bfbc31 100644
--- a/DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h
+++ b/DataFormats/SiPixelDigi/interface/SiPixelDigiConstants.h
@@ -55,6 +55,7 @@ namespace sipixelconstants {
     inline constexpr uint32_t getRow(uint32_t ww) { return ((ww >> ROW_shift) & ROW_mask); }
     inline constexpr uint32_t getDCol(uint32_t ww) { return ((ww >> DCOL_shift) & DCOL_mask); }
     inline constexpr uint32_t getPxId(uint32_t ww) { return ((ww >> PXID_shift) & PXID_mask); }
+    inline constexpr uint32_t removeADC(uint32_t ww) { return (ww >> ADC_bits); } // ADC_shift ==0: let's keep it simple
   }  // namespace functions
 }  // namespace sipixelconstants

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index e66e322ddff..2fd5b943906 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -440,6 +440,16 @@ namespace pixelgpudetails {
         }
       }

+     // remove duplicate pixels (for the time being keep first
+     auto noADC = sipixelconstants::removeADC(ww);
+     auto noADCm1 =  sipixelconstants::removeADC(word[gIndex-1]);  // yes reading one word before buffer
+     // auto noADCp1 =  sipixelconstants::removeADC(word[gIndex+1]);  // fine we are reading one word bejond end of buffer
+     if (noADC==noADCm1) {
+       auto globalPix = frameConversion(barrel, side, layer, rocIdInDetUnit, localPix);
+       printf("dup pix at %d %d\n",globalPix.row,globalPix.col);
+       continue;
+     }
+
       pixelgpudetails::Pixel globalPix = frameConversion(barrel, side, layer, rocIdInDetUnit, localPix);
       xx[gIndex] = globalPix.row;  // origin shifting by 1 0-159
       yy[gIndex] = globalPix.col;  // origin shifting by 1 0-415

Run 5000 events in Run 323775, Event 33310106, LumiSection 53
and got
[innocent@patatrack02 raw]$ grep dup dup.log | wc
8158 41289 153656
[innocent@patatrack02 raw]$ grep dup dup.log | sort -u | wc
3185 16424 61596

So they are there. about 1.5 per events. many in the same location. (ok I had to write the module id as well)
Of course I cannot exclude that there are also duplicate pixels that are not contiguous.

VinInn · 2022-03-17T09:41:06Z

so in event Run 323775, Event 32123778, LumiSection 53
module "1" is plenty of duplicate pixels and most of them are NOT contiguous
see a (partial) dump below (those 0,0 are the duplicate removed)

notice how many times 159 208 appears...

ferencek · 2022-03-17T11:01:22Z

Following Danek presentation I made a small test using this code that detect (and remove) a duplicate pixel (the second one) in digis. It should make GPU reproduce current CPU behaviour. trivial to remove both.

Isn't the CPU code keeping the last occurrence https://github.com/cms-sw/cmssw/blob/CMSSW_12_2_0/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc#L292 (assuming the charge crosses the threshold which is always the case for the ChannelThreshold of 10).

Based on Danek's investigation that duplicate pixels seems to be coming from noisy ROCs, perhaps the best strategy would be to drop such pixels.

VinInn · 2022-03-17T11:30:50Z

yes @ferencek , you are right.
Ok trivial to keep the second and remove the first. Still occurrences seem to be random (see above) so detecting contiguous duplicates in not enough

VinInn · 2022-03-21T16:38:56Z

So I implemented also the N^2 duplicate finding at module level (in the clusterizer)
Here the timing results

(summary 789 Hz instead of 845)

baseline

|Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nLooseTracks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nUsedCells | nZeroTrackCells ||
Counters Raw 5000 81755950 857898799 32625252 14696728 5972444 26803352 37513969 24198090 3930759 55470593 116918358
Counters Norm 5000 ||  16351.2|  171579.8|  6525.1|  5360.7|  2939.3|  1194.5|  7502.8|  4839.6|  0.005|  0.065|  0.136||


Running 4 times over 5000 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   848.0 ±   0.5 ev/s (4900 events)
   845.5 ±   0.7 ev/s (4900 events)
   843.0 ±   0.5 ev/s (4900 events)
   844.2 ±   0.5 ev/s (4900 events)
 --------------------
   845.2 ±   2.1 ev/s

 GPU activities:   11.74%  1.29890s      5000  259.78us  25.247us  982.72us  void gpuClustering::findClus<bool=0>(unsigned short const *, unsigned short const , unsigned short const , unsigned int const *, unsigned int*, unsigned int const$
                    9.00%  995.16ms      5000  199.03us  23.040us  1.2858ms  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleV$
                    0.66%  72.547ms      5000  14.509us  5.4080us  79.776us  pixelgpudetails::RawToDigi_kernel(SiPixelROCsStatusAndMapping const *, unsigned char const *, unsigned int, unsigned int const *, unsigned char const *, unsigned $
                    0.58%  64.291ms      5000  12.858us  3.4240us  57.663us  void gpuCalibPixel::calibDigis<bool=1>(unsigned short*, unsigned short const *, unsigned short const , gpuCalibPixel::calibDigis<bool=1>, SiPixelGainForHLTonGPU c$


remove just contiguous

Running 4 times over 5000 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   840.5 ±   0.5 ev/s (4900 events)
   844.1 ±   0.5 ev/s (4900 events)
   840.8 ±   0.7 ev/s (4900 events)
   846.1 ±   0.6 ev/s (4900 events)
 --------------------
   842.9 ±   2.7 ev/s


 GPU activities:   11.66%  1.31490s      5000  262.98us  25.632us  1.1196ms  void gpuClustering::findClus<bool=0>(unsigned short const *, unsigned short const , unsigned short const , unsigned int const *, unsigned int*, unsigned int const$
                    8.89%  1.00227s      5000  200.45us  23.232us  1.2877ms  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleV$
                    0.68%  76.954ms      5000  15.390us  6.3360us  105.89us  pixelgpudetails::RawToDigi_kernel(SiPixelROCsStatusAndMapping const *, unsigned char const *, unsigned int, unsigned int const *, unsigned char const *, unsigned $
                    0.59%  66.322ms      5000  13.264us  2.7200us  65.920us  void gpuCalibPixel::calibDigis<bool=1>(unsigned short*, unsigned short const *, unsigned short const , gpuCalibPixel::calibDigis<bool=1>, SiPixelGainForHLTonGPU c$


keep only last in module (N^2 algo at module level)

||Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nLooseTracks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nUsedCells | nZeroTrackCells ||
Counters Raw 5000 81754152 857886560 32624614 14696528 5972245 26802897 37513560 24197830 3930818 55469912 116915978
Counters Norm 5000 ||  16350.8|  171577.3|  6524.9|  5360.6|  2939.3|  1194.4|  7502.7|  4839.6|  0.005|  0.065|  0.136||


Running 4 times over 5000 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   792.0 ±   0.5 ev/s (4900 events)
   789.4 ±   0.7 ev/s (4900 events)
   790.0 ±   0.7 ev/s (4900 events)
   785.2 ±   0.5 ev/s (4900 events)
 --------------------
   789.1 ±   2.9 ev/s


 GPU activities:   15.44%  1.86159s      5000  372.32us  31.968us  1.3527ms  void gpuClustering::findClus<bool=0>(unsigned int*, unsigned short*, unsigned short const *, unsigned short const , unsigned int const *, gpuClustering::findClus<$
                    8.34%  1.00613s      5000  201.23us  23.136us  1.2789ms  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleV$
                    0.65%  78.570ms      5000  15.714us  5.9520us  50.591us  pixelgpudetails::RawToDigi_kernel(SiPixelROCsStatusAndMapping const *, unsigned char const *, unsigned int, unsigned int const *, unsigned char const *, unsigned $
                    0.56%  67.428ms      5000  13.485us  3.4240us  76.895us  void gpuCalibPixel::calibDigis<bool=1>(unsigned short*, unsigned short const *, unsigned short const , gpuCalibPixel::calibDigis<bool=1>, SiPixelGainForHLTonGPU c$

VinInn · 2022-03-21T17:22:12Z

To those who wish to test the above please
git cms-merge-topic VinInn:DupPix
in today (or later) 12_4_X

fwyzard · 2022-03-21T18:46:12Z

@VinInn thank you very much for the detailed test !

As you suggested, removing only the nearby duplicates has a negligible impact, from 845 ± 2 ev/s to 843 ± 3, but doing the full scan has a huge cost, bringing the throughput down to 789 ± 3 ev/s.

And I understand that the cost would have to be paid in any case, not only in the hopefully rare runs and modules which are affected.

VinInn · 2022-03-25T14:40:01Z

I spent last few days in trying to understand why the Fishbone was not reproducible on GPU (even after doing full combinatorial)

at the end was this

-          if (d[ic] != d[jc] && cos12 * cos12 >= 0.99999f * n[ic] * n[jc]) {
+          if (d[ic] != d[jc] && cos12 * cos12 >= 0.99999f * (n[ic] * n[jc]) ) {

I hope everybody appreciates the level of sophistication required to obtain bit-wise reproducibility in presence of not ordered collections

silviodonato · 2022-03-25T16:09:02Z

I spent last few days in trying to understand why the Fishbone was not reproducible on GPU (even after doing full combinatorial)

at the end was this
-          if (d[ic] != d[jc] && cos12 * cos12 >= 0.99999f * n[ic] * n[jc]) {
+          if (d[ic] != d[jc] && cos12 * cos12 >= 0.99999f * (n[ic] * n[jc]) ) {
I hope everybody appreciates the level of sophistication required to obtain bit-wise reproducibility in present of not ordered collection

Many thanks Vincenzo!

VinInn · 2022-03-25T16:16:36Z

I would like to point out that a similar combinatorial algorithm running on CPU using GPU output will NOT reproduce either (unless the collection is somehow sorted).

VinInn · 2022-03-29T14:19:51Z

here the PR on fishbone
#37398

silviodonato · 2022-04-04T17:28:48Z

#37359

cmsbuild added heterogeneous-pending hlt-pending pending-signatures labels Sep 23, 2021

cmsbuild added hcal-dpg-pending trk-dpg-pending labels Sep 23, 2021

mseidel42 mentioned this issue Sep 23, 2021

HBHEGPU: added taggedBadByDb by ChannelQuality in GPU [12_0_X] #35355

Merged

fwyzard mentioned this issue Sep 23, 2021

HBHEGPU: Channel quality gpu [master] #35357

Merged

ferencek mentioned this issue Apr 7, 2022

[DO NOT MERGE] Add duplicate pixels #37496

Closed

missirol mentioned this issue May 5, 2022

Remove duplicate pixels #37559

Merged

reproducibility of GPU vs CPU results at HLT #35376

reproducibility of GPU vs CPU results at HLT #35376

Comments

fwyzard commented Sep 23, 2021 • edited Loading

setup a CMSSW working area

extract the HLT configuration for running on GPU using the Run3 era

run the HLT menu on a GPU-equipped machine

compare the results

legacy configuration

ECAL-only changes

HCAL-only changes

Pixel local reconstruction changes

Pixel track reconstruction changes

Updates

fwyzard commented Sep 23, 2021

cmsbuild commented Sep 23, 2021

cmsbuild commented Sep 23, 2021

fwyzard commented Sep 23, 2021

fwyzard commented Sep 23, 2021 • edited Loading

mmusich commented Sep 23, 2021

mmusich commented Sep 23, 2021

cmsbuild commented Sep 23, 2021

fwyzard commented Sep 23, 2021

VinInn commented Sep 23, 2021

fwyzard commented Sep 23, 2021

VinInn commented Sep 23, 2021 • edited Loading

silviodonato commented Sep 23, 2021

tsusa commented Sep 27, 2021

fwyzard commented Sep 27, 2021

tsusa commented Sep 27, 2021

fwyzard commented Sep 27, 2021

VinInn commented Sep 27, 2021

fwyzard commented Sep 27, 2021

mmusich commented Sep 27, 2021

fwyzard commented Sep 27, 2021

mmusich commented Sep 27, 2021

VinInn commented Sep 30, 2021

silviodonato commented Mar 14, 2022

fwyzard commented Mar 14, 2022

mmusich commented Mar 14, 2022

VinInn commented Mar 14, 2022

ferencek commented Mar 14, 2022 • edited Loading

VinInn commented Mar 14, 2022 • edited Loading

fwyzard commented Mar 14, 2022

mmusich commented Mar 14, 2022

fwyzard commented Mar 14, 2022

silviodonato commented Mar 15, 2022

VinInn commented Mar 15, 2022

fwyzard commented Mar 15, 2022

VinInn commented Mar 15, 2022

silviodonato commented Mar 15, 2022

VinInn commented Mar 15, 2022 via email

mmusich commented Mar 16, 2022

VinInn commented Mar 16, 2022

VinInn commented Mar 17, 2022

ferencek commented Mar 17, 2022 • edited Loading

VinInn commented Mar 17, 2022

VinInn commented Mar 21, 2022 • edited Loading

VinInn commented Mar 21, 2022

fwyzard commented Mar 21, 2022

VinInn commented Mar 25, 2022 • edited Loading

silviodonato commented Mar 25, 2022

VinInn commented Mar 25, 2022

VinInn commented Mar 29, 2022

silviodonato commented Apr 4, 2022

fwyzard commented Sep 23, 2021 •

edited

Loading

fwyzard commented Sep 23, 2021 •

edited

Loading

VinInn commented Sep 23, 2021 •

edited

Loading

ferencek commented Mar 14, 2022 •

edited

Loading

VinInn commented Mar 14, 2022 •

edited

Loading

ferencek commented Mar 17, 2022 •

edited

Loading

VinInn commented Mar 21, 2022 •

edited

Loading

VinInn commented Mar 25, 2022 •

edited

Loading