Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepBoostedJetTagInfoProducer failure in PromptReco_Run381443_ParkingSingleMuon4 (CMSSW_14_0_7 on AMD arch) #45190

Open
gpetruc opened this issue Jun 11, 2024 · 27 comments

Comments

@gpetruc
Copy link
Contributor

gpetruc commented Jun 11, 2024

Hello,

There's another PromptReco failure that like #45189 seems to be reproducible on AMD but not on Intel.
CMS-talk thread: https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381443-parkingsinglemuon4-deepboostedjettaginfoproducer/42164

Exception:

Begin processing the 1st record. Run 381443, Event 2226011497, LumiSection 1038 on stream 0 at 11-Jun-2024 11:17:40.704 CEST
Matched new: [Fatal Exception]
    An exception of category 'InvalidReference' occurred while
       [0] Processing  Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
       [1] Running path 'dqmoffline_step'
       [2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
       [3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
       [4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
       [5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
    Exception Message:
    BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
    Please modify the calling code to test validity before dereferencing.

Recipe to reproduce it, on AMD EL8 machine

export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/DeepBoostedJet/job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.source.eventsToProcess = cms.untracked.VEventRange("381443:1038:2226011497",)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log  
@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 11, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @gpetruc.

@rappoccio, @smuzaffar, @makortel, @Dr15Jones, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

assign RecoBTag/FeatureTools

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jfernan2
Copy link
Contributor

type pf

@cmsbuild cmsbuild added the pf label Jun 11, 2024
@jfernan2
Copy link
Contributor

type btv

@cmsbuild cmsbuild added the btv label Jun 11, 2024
@mandrenguyen
Copy link
Contributor

mandrenguyen commented Jun 17, 2024

I would like to reproduce this. Anyone have a pointer for finding an AMD machine I can use interactively? Either one running EL8 or one which I can run singularity.

@mmusich
Copy link
Contributor

mmusich commented Jun 17, 2024

Anyone have a pointer for finding an AMD machine I can use interactively?

e.g on lxplus800:

[musich@lxplus800 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           16
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7313 16-Core Processor
Stepping:            1
CPU MHz:             3000.134
BogoMIPS:            6000.26
Virtualization:      AMD-V
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid arch_capabilities

I tested the reproducer above #45190 (comment) fails with:

----- Begin Fatal Exception 17-Jun-2024 19:45:02 CEST-----------------------
An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
   [3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
   [4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
   [5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
Please modify the calling code to test validity before dereferencing.
----- End Fatal Exception -------------------------------------------------

@mandrenguyen
Copy link
Contributor

The code is crashing at this line:

vtx_ass = vtx_ass_from_pfcand(*reco_cand, pv_ass_quality, pv_ass);

This appears to be fixed by conditioning that line with:
if(pv_ass.isNonnull())

No clue why this only shows up on AMD though.

@VinInn
Copy link
Contributor

VinInn commented Jun 19, 2024

in vtx_ass_from_pfcand
there is statement
if (pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7)
if the first clause fails the others SHALL not be evaluated

adding a cout before the call I get this on INTEL (lxplus806)
std::cout << ">>>> " << icand << ' ' << pv_ass_quality << ' ' << (reco_cand->trackRef().isNonnull() ? "okTk" : "noTk") << (pv_ass.isNonnull() ? "okPV" : "nullPV" )<< std::endl;

>>>> 0 6 okTkokPV
>>>> 1 0 noTknullPV
>>>> 2 2 okTkokPV
>>>> 3 0 noTknullPV
>>>> 4 7 okTkokPV
>>>> 5 6 okTkokPV
>>>> 6 2 okTkokPV
>>>> 7 6 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 2 okTkokPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 0 noTknullPV
>>>> 15 0 noTknullPV
>>>> 16 6 okTkokPV
>>>> 0 1 okTkokPV
>>>> 1 6 okTkokPV
>>>> 2 7 okTkokPV
>>>> 3 7 okTkokPV
>>>> 4 2 okTkokPV
>>>> 5 7 okTkokPV
>>>> 6 0 noTknullPV
>>>> 7 2 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 0 noTknullPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 2 okTkokPV
>>>> 15 6 okTkokPV

each time there is Tk there is aPV as well (and viceversa)

@VinInn
Copy link
Contributor

VinInn commented Jun 19, 2024

on AMD (lxplus800)

>>>> 0 6 okTkokPV
>>>> 1 0 noTknullPV
>>>> 2 2 okTkokPV
>>>> 3 0 noTknullPV
>>>> 4 7 okTkokPV
>>>> 5 6 okTkokPV
>>>> 6 2 okTkokPV
>>>> 7 6 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 2 okTkokPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 0 noTknullPV
>>>> 15 0 noTknullPV
>>>> 16 6 okTkokPV
>>>> 0 1 okTkokPV
>>>> 1 6 okTkokPV
>>>> 2 7 okTkokPV
>>>> 3 7 okTkokPV
>>>> 4 2 okTkokPV
>>>> 5 7 okTkokPV
>>>> 6 0 noTknullPV
>>>> 7 2 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 0 noTknullPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 2 okTkokPV
>>>> 15 6 okTkokPV
>>>> 16 0 okTknullPV
----- Begin Fatal Exception 19-Jun-2024 15:13:26 CEST-----------------------
An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
   [3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
   [4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
   [5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
Please modify the calling code to test validity before dereferencing.

so WHO is this 16th (actually 17th) candidate?

@VinInn
Copy link
Contributor

VinInn commented Jun 19, 2024

I printed the size of the vector and indeed on INTEL is 16 and on AMD is 17...
very fishy. It needs full debugging as it is not possible (I suspect a memory issue)
valgrind may help

@VinInn
Copy link
Contributor

VinInn commented Jun 19, 2024

The input jet seems different

@VinInn
Copy link
Contributor

VinInn commented Jun 20, 2024

In the event there are 123 jets (sic). Jet 2 has 16 constituents on Intel and 17 on AMD. all others have the same number.
A spurious pfCand or a difference in the jet algo?

@VinInn
Copy link
Contributor

VinInn commented Jun 20, 2024

Anyhow this is the protection I suggest to add

diff --git a/RecoBTag/FeatureTools/src/deep_helpers.cc b/RecoBTag/FeatureTools/src/deep_helpers.cc
index 76b443542b3..faf1649d9b8 100644
--- a/RecoBTag/FeatureTools/src/deep_helpers.cc
+++ b/RecoBTag/FeatureTools/src/deep_helpers.cc
@@ -150,7 +150,7 @@ namespace btagbtvdeep {

   float vtx_ass_from_pfcand(const reco::PFCandidate &pfcand, int pv_ass_quality, const reco::VertexRef &pv) {
     float vtx_ass = pat::PackedCandidate::PVAssociationQuality(qualityMap[pv_ass_quality]);
-    if (pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7) {
+    if (pv.isNonnull() && pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7) {
       vtx_ass = pat::PackedCandidate::UsedInFitTight;
     }
     return vtx_ass;

Of course there is plenty of possible optimization a bit everywhere

@VinInn
Copy link
Contributor

VinInn commented Jun 26, 2024

The input file is no more there

Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/ParkingSingleMuon4/RAW/v1/000/381/443/00000/95d48fcb-0633-415c-a5cf-f2caeebab628.root?eos.app=cmst0'

@VinInn
Copy link
Contributor

VinInn commented Jun 27, 2024

is there a way to recover the input file? I would really like to better understand the origin of the difference btw AMD and INTEL.

@mmusich
Copy link
Contributor

mmusich commented Jun 27, 2024

is there a way to recover the input file? I would really like to better understand the origin of the difference btw AMD and INTEL.

@germanfgv @LinaresToine please comment.

@missirol
Copy link
Contributor

Should now be available at

/eos/cms/store/data/Run2024E/ParkingSingleMuon4/RAW/v1/000/381/443/00000/95d48fcb-0633-415c-a5cf-f2caeebab628.root

@missirol
Copy link
Contributor

missirol commented Jun 30, 2024

On AMD, the generalTracks collections has 1 more track compared to the Intel case, and the track has the following properties.

pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nan

@VinInn
Copy link
Contributor

VinInn commented Jun 30, 2024

IN principle we have removed all "raw" Ofast flags that could produce a difference.
Maybe is Tensorflow.
I would tag this issue tracking-pog @slava77

@slava77
Copy link
Contributor

slava77 commented Jul 1, 2024

IN principle we have removed all "raw" Ofast flags that could produce a difference.

As I recall the evidence was that there are fewer differences between AMD and Intel; there was no evidence that the results become identical.

@slava77
Copy link
Contributor

slava77 commented Jul 1, 2024

pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nan

Is covariance(i_dsz, i_dsz) also nan or is it negative?

@missirol
Copy link
Contributor

missirol commented Jul 1, 2024

pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nan

Is covariance(i_dsz, i_dsz) also nan or is it negative?

It is negative. Patch in [*] and output below.

XXX pt=0.0130999 eta=-3.36499 phi=-0.951959 dzError=-nan vtxIdMinSignif=-1 covariance(4, 4)=-0.281146

[*]

diff --git a/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc b/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
index fad6b30333b..05042d01cca 100644
--- a/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
+++ b/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
@@ -5,6 +5,7 @@
 #include "DataFormats/Math/interface/deltaR.h"
 #include "TrackingTools/IPTools/interface/IPTools.h"
 #include "FWCore/Utilities/interface/isFinite.h"
+#include "FWCore/MessageLogger/interface/MessageLogger.h"
 
 std::pair<int, PrimaryVertexAssignment::Quality> PrimaryVertexAssignment::chargedHadronVertex(
     const reco::VertexCollection& vertices,
@@ -184,6 +185,10 @@ std::pair<int, PrimaryVertexAssignment::Quality> PrimaryVertexAssignment::charge
   // all other tracks could be non-B secondaries and we just attach them with closest Z
   if (vtxIdMinSignif >= 0)
     return {vtxIdMinSignif, PrimaryVertexAssignment::OtherDz};
+
+edm::LogPrint("AAAA") << "XXX pt=" << track->pt() << " eta=" << track->eta() << " phi=" << track->phi() << " dzError=" << track->dzError() << " vtxIdMinSignif=" << vtxIdMinSignif
+<< " covariance(4, 4)=" << track->covariance(4, 4);
+
   //If for some reason even the dz failed (when?) we consider the track not assigned
   return {-1, PrimaryVertexAssignment::Unassigned};
 }

@VinInn
Copy link
Contributor

VinInn commented Jul 1, 2024

Why only on AMD ?
(or better: why on INTEL the track is not there at all?)

@slava77
Copy link
Contributor

slava77 commented Jul 1, 2024

type tracking

@mandrenguyen
Copy link
Contributor

It seems we also now have a different failure that only occurs on AMD:
#45398
Just cross posting it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants