Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NanoAOD VertexException in PromptReco_Run381515_ParkingVBF0 (CMSSW 14_0_7, on AMD arch) #45189

Open
gpetruc opened this issue Jun 11, 2024 · 25 comments

Comments

@gpetruc
Copy link
Contributor

gpetruc commented Jun 11, 2024

A PromptReco job failure in the NanoAOD step was observed at the tier0 with the following error message
cms-talk thead: https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381515-parkingvbf0-vertexexception/42163

----- Begin Fatal Exception 11-Jun-2024 10:46:46 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 381515 lumi: 384 event: 765632765 stream: 0
   [1] Running path 'write_NANOAOD_step'
   [2] Prefetching for module PoolOutputModule/'write_NANOAOD'
   [3] Prefetching for module SimplePATMuonFlatTableProducer/'muonTable'
   [4] Calling method for module MuonBeamspotConstraintValueMapProducer/'muonBSConstrain'
Exception Message:
BasicSingleVertexState::could not invert weight matrix
----- End Fatal Exception -------------------------------------------------

The exception appears to be reproducible running on a single event, but only on AMD: the job fails at Tier0 (AMD EPYC 7763) and on my desktop (AMD Ryzen 9 5950X), but not on another Intel machine I tested (Intel Xeon Silver 4216).

Instructions to reproduce it on an EL8 AMD machine:

export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/vertexException/job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.source.eventsToProcess = cms.untracked.VEventRange("381515:384:765632765",)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log
@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 11, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @gpetruc.

@Dr15Jones, @rappoccio, @makortel, @smuzaffar, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

Assign RecoMuon/GlobalTrackingTools

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jfernan2
Copy link
Contributor

type muon

@cmsbuild cmsbuild added the muon label Jun 11, 2024
@gpetruc
Copy link
Contributor Author

gpetruc commented Jun 17, 2024

Adding @namapane as I believe the exception comes from the 14_0_X backport of #42646

@namapane
Copy link
Contributor

Adding @namapane as I believe the exception comes from the 14_0_X backport of #42646

TNX, looking into it.

@namapane
Copy link
Contributor

The problem seems to be that this event has a PV which, despite being isValid(), has a ~zero covariance() matrix:

          auto pv = pvHandle->at(0);
	  cout << pv.isValid() << endl << pv.covariance() << endl;

gives:

1
[  6.30659e-07 2.46501e-08-1.17693e-06
   2.46501e-08 5.49215e-07-3.72355e-07
  -1.17693e-06-3.72355e-07 6.69129e-06 ]

This goes through SingleTrackVertexConstraint::constrain() -> KalmanVertexTrackUpdator::update() -> KVFHelper::vertexChi2 which trivially fails at this point.
The event also has a 2.8e9 GeV muon, so there must be something pathologic.

I can't think of a way of adding a simple protection to check for the covariance matrix to be sensible, so I think the easiest solution is to catch the exception in MuonBeamspotConstraintValueMapProducer.
Let me know if you have objections or better suggestions.

@namapane
Copy link
Contributor

In the meanwhile I made a PR for this in master: #45243
I suppose it needs to be backported to 14_0_X, if so let me know.

@VinInn
Copy link
Contributor

VinInn commented Jun 26, 2024

file is no more there

Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/ParkingVBF0/RAW/v1/000/381/515/00000/05c3f64e-bfd1-4969-af4a-c91a9ccd723f.root?eos.app=cmst0'

@mmusich
Copy link
Contributor

mmusich commented Jun 27, 2024

file is no more there

@germanfgv @LinaresToine please comment.

@namapane
Copy link
Contributor

I managed to test the fix #45243 before it disappeared.
Not sure how to reproduce the problem again if you think that's not enough, unless the file can be retrieved somehow.

@mmusich
Copy link
Contributor

mmusich commented Sep 2, 2024

it seems there's another job that failed at Tier0 that crashed with similar features: https://cmsweb.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run384981_JetMET1.
I think the job was initially failing on AMD and then retried on Intel (on which it got past the crash, but somehow the job didn't finish correctly). @LinaresToine might give more details.

@namapane FYI.

@LinaresToine
Copy link

Hello all
I have saved the tarball of the latest occurrence in
/eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException
I also saved the input root file in there so the error can be reproduced.

@namapane
Copy link
Contributor

namapane commented Sep 2, 2024

Thanks @mmusich for the heads up.
I am leaving for a 1 week holiday so I can check this one only when I'm back. I'm the meanwhile, did the job include the fix #45243?

@mmusich
Copy link
Contributor

mmusich commented Sep 2, 2024

I am leaving for a 1 week holiday so I can check this one only when I'm back. I'm the meanwhile, did the job include the fix #45243?

I think so, the job was run in CMSSW_14_0_14 which should have included #45396 (entered CMSSW_14_0_12).

@LinaresToine
Copy link

LinaresToine commented Sep 2, 2024

Is this parallel to #45189 ? The new occurrence is for JetMET1, which seems to belong in the mentioned issue.

@mmusich
Copy link
Contributor

mmusich commented Sep 2, 2024

Is this parallel to #45189 ?

What do you mean? This issue is 45189.

@LinaresToine
Copy link

Thanks Marco, I meant #45520. As you mentioned in cmstalk, they refer to different modules

@mmusich
Copy link
Contributor

mmusich commented Sep 3, 2024

I have saved the tarball of the latest occurrence in
/eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException
I also saved the input root file in there so the error can be reproduced.

thanks, I can reproduce the crash (on an AMD machine, lxplus800 in my case) with the following script:

#!/bin/bash
export SCRAM_ARCH=el8_amd64_gcc12
scram p CMSSW_14_0_14
cd CMSSW_14_0_14/src
eval `scram runtime -sh`
cp /eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException/vocms0314.cern.ch-2761618-12-log.tar.gz .
tar xf vocms0314.cern.ch-2761618-12-log.tar.gz
cp -pr ./job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.source.skipEvents=cms.untracked.uint32(766)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log

This results immediately (at the first event) in:

----- Begin Fatal Exception 03-Sep-2024 09:37:22 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 384981 lumi: 572 event: 1260938254 stream: 0
   [1] Running path 'write_NANOAOD_step'
   [2] Prefetching for module PoolOutputModule/'write_NANOAOD'
   [3] Prefetching for module SimplePATMuonFlatTableProducer/'muonTable'
   [4] Calling method for module MuonBeamspotConstraintValueMapProducer/'muonBSConstrain'
Exception Message:
BasicSingleVertexState::could not invert weight matrix 
----- End Fatal Exception -------------------------------------------------

@mmusich
Copy link
Contributor

mmusich commented Sep 3, 2024

With this simple patch:

diff --git a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
index 74459f475cb..a83f3d98268 100644
--- a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
+++ b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
@@ -65,15 +65,21 @@ private:
         // Protect for mis-reconstructed beamspots (note that
         // SingleTrackVertexConstraint uses the width for the constraint,
         // not the error)
+
         if ((BeamWidthXError / BeamWidthX < 0.3) && (BeamWidthYError / BeamWidthY < 0.3)) {
-          SingleTrackVertexConstraint::BTFtuple btft =
-              stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
-          if (std::get<0>(btft)) {
-            const reco::Track& trkBS = std::get<1>(btft).track();
-            pts.push_back(trkBS.pt());
-            ptErrs.push_back(trkBS.ptError());
-            chi2s.push_back(std::get<2>(btft));
-            tbd = false;
+          try {
+            SingleTrackVertexConstraint::BTFtuple btft =
+                stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
+
+            if (std::get<0>(btft)) {
+              const reco::Track& trkBS = std::get<1>(btft).track();
+              pts.push_back(trkBS.pt());
+              ptErrs.push_back(trkBS.ptError());
+              chi2s.push_back(std::get<2>(btft));
+              tbd = false;
+            }
+          } catch (const VertexException& exc) {
+            // Update failed; give up.
           }
         }
       }

the crash that one can re-produce with the recipe at #45189 (comment) is circumvented.
I let @cms-sw/reconstruction-l2 to provide a patch to cmssw in case it is useful and correct to implement it.

@24LopezR
Copy link
Contributor

24LopezR commented Sep 4, 2024

Hi @mmusich, the patch looks good, let me test it too to double check and I will implement it in CMSSW. If I understand correctly, it needs to be backported to 14_0_X, right?

@mmusich
Copy link
Contributor

mmusich commented Sep 4, 2024

Hi @24LopezR

he patch looks good, let me test it too to double check and I will implement it in CMSSW.

Thank you.

If I understand correctly, it needs to be backported to 14_0_X, right?

correct. It needs to go in 14_2_X (master), 14_1_X (for HIon) and 14_0_X (for pp).

@jfernan2
Copy link
Contributor

jfernan2 commented Sep 6, 2024

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 6, 2024

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants