NanoAOD VertexException in PromptReco_Run381515_ParkingVBF0 (CMSSW 14_0_7, on AMD arch) #45189

gpetruc · 2024-06-11T09:14:05Z

A PromptReco job failure in the NanoAOD step was observed at the tier0 with the following error message
cms-talk thead: https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381515-parkingvbf0-vertexexception/42163

----- Begin Fatal Exception 11-Jun-2024 10:46:46 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 381515 lumi: 384 event: 765632765 stream: 0
   [1] Running path 'write_NANOAOD_step'
   [2] Prefetching for module PoolOutputModule/'write_NANOAOD'
   [3] Prefetching for module SimplePATMuonFlatTableProducer/'muonTable'
   [4] Calling method for module MuonBeamspotConstraintValueMapProducer/'muonBSConstrain'
Exception Message:
BasicSingleVertexState::could not invert weight matrix
----- End Fatal Exception -------------------------------------------------

The exception appears to be reproducible running on a single event, but only on AMD: the job fails at Tier0 (AMD EPYC 7763) and on my desktop (AMD Ryzen 9 5950X), but not on another Intel machine I tested (Intel Xeon Silver 4216).

Instructions to reproduce it on an EL8 AMD machine:

export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/vertexException/job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.source.eventsToProcess = cms.untracked.VEventRange("381515:384:765632765",)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-06-11T09:14:30Z

cms-bot internal usage

cmsbuild · 2024-06-11T09:14:31Z

A new Issue was created by @gpetruc.

@Dr15Jones, @rappoccio, @makortel, @smuzaffar, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2024-06-11T12:20:28Z

Assign RecoMuon/GlobalTrackingTools

cmsbuild · 2024-06-11T12:20:45Z

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

jfernan2 · 2024-06-11T14:47:11Z

type muon

gpetruc · 2024-06-17T08:06:37Z

Adding @namapane as I believe the exception comes from the 14_0_X backport of #42646

namapane · 2024-06-17T11:02:24Z

Adding @namapane as I believe the exception comes from the 14_0_X backport of #42646

TNX, looking into it.

namapane · 2024-06-17T13:36:42Z

The problem seems to be that this event has a PV which, despite being isValid(), has a ~zero covariance() matrix:

          auto pv = pvHandle->at(0);
	  cout << pv.isValid() << endl << pv.covariance() << endl;

gives:

1
[  6.30659e-07 2.46501e-08-1.17693e-06
   2.46501e-08 5.49215e-07-3.72355e-07
  -1.17693e-06-3.72355e-07 6.69129e-06 ]

This goes through SingleTrackVertexConstraint::constrain() -> KalmanVertexTrackUpdator::update() -> KVFHelper::vertexChi2 which trivially fails at this point.
The event also has a 2.8e9 GeV muon, so there must be something pathologic.

I can't think of a way of adding a simple protection to check for the covariance matrix to be sensible, so I think the easiest solution is to catch the exception in MuonBeamspotConstraintValueMapProducer.
Let me know if you have objections or better suggestions.

namapane · 2024-06-17T14:06:52Z

In the meanwhile I made a PR for this in master: #45243
I suppose it needs to be backported to 14_0_X, if so let me know.

VinInn · 2024-06-26T11:51:40Z

file is no more there

Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/ParkingVBF0/RAW/v1/000/381/515/00000/05c3f64e-bfd1-4969-af4a-c91a9ccd723f.root?eos.app=cmst0'

mmusich · 2024-06-27T15:17:32Z

file is no more there

@germanfgv @LinaresToine please comment.

namapane · 2024-06-27T15:21:28Z

I managed to test the fix #45243 before it disappeared.
Not sure how to reproduce the problem again if you think that's not enough, unless the file can be retrieved somehow.

mmusich · 2024-09-02T14:01:52Z

it seems there's another job that failed at Tier0 that crashed with similar features: https://cmsweb.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run384981_JetMET1.
I think the job was initially failing on AMD and then retried on Intel (on which it got past the crash, but somehow the job didn't finish correctly). @LinaresToine might give more details.

@namapane FYI.

LinaresToine · 2024-09-02T15:05:22Z

Hello all
I have saved the tarball of the latest occurrence in
/eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException
I also saved the input root file in there so the error can be reproduced.

namapane · 2024-09-02T16:56:35Z

Thanks @mmusich for the heads up.
I am leaving for a 1 week holiday so I can check this one only when I'm back. I'm the meanwhile, did the job include the fix #45243?

mmusich · 2024-09-02T17:00:25Z

I am leaving for a 1 week holiday so I can check this one only when I'm back. I'm the meanwhile, did the job include the fix #45243?

I think so, the job was run in CMSSW_14_0_14 which should have included #45396 (entered CMSSW_14_0_12).

LinaresToine · 2024-09-02T18:39:01Z

Is this parallel to #45189 ? The new occurrence is for JetMET1, which seems to belong in the mentioned issue.

mmusich · 2024-09-02T18:44:08Z

Is this parallel to #45189 ?

What do you mean? This issue is 45189.

LinaresToine · 2024-09-03T02:36:28Z

Thanks Marco, I meant #45520. As you mentioned in cmstalk, they refer to different modules

mmusich · 2024-09-03T07:39:06Z

I have saved the tarball of the latest occurrence in
/eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException
I also saved the input root file in there so the error can be reproduced.

thanks, I can reproduce the crash (on an AMD machine, lxplus800 in my case) with the following script:

#!/bin/bash
export SCRAM_ARCH=el8_amd64_gcc12
scram p CMSSW_14_0_14
cd CMSSW_14_0_14/src
eval `scram runtime -sh`
cp /eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException/vocms0314.cern.ch-2761618-12-log.tar.gz .
tar xf vocms0314.cern.ch-2761618-12-log.tar.gz
cp -pr ./job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.source.skipEvents=cms.untracked.uint32(766)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log

This results immediately (at the first event) in:

----- Begin Fatal Exception 03-Sep-2024 09:37:22 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 384981 lumi: 572 event: 1260938254 stream: 0
   [1] Running path 'write_NANOAOD_step'
   [2] Prefetching for module PoolOutputModule/'write_NANOAOD'
   [3] Prefetching for module SimplePATMuonFlatTableProducer/'muonTable'
   [4] Calling method for module MuonBeamspotConstraintValueMapProducer/'muonBSConstrain'
Exception Message:
BasicSingleVertexState::could not invert weight matrix 
----- End Fatal Exception -------------------------------------------------

mmusich · 2024-09-03T08:52:34Z

With this simple patch:

diff --git a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
index 74459f475cb..a83f3d98268 100644
--- a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
+++ b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
@@ -65,15 +65,21 @@ private:
         // Protect for mis-reconstructed beamspots (note that
         // SingleTrackVertexConstraint uses the width for the constraint,
         // not the error)
+
         if ((BeamWidthXError / BeamWidthX < 0.3) && (BeamWidthYError / BeamWidthY < 0.3)) {
-          SingleTrackVertexConstraint::BTFtuple btft =
-              stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
-          if (std::get<0>(btft)) {
-            const reco::Track& trkBS = std::get<1>(btft).track();
-            pts.push_back(trkBS.pt());
-            ptErrs.push_back(trkBS.ptError());
-            chi2s.push_back(std::get<2>(btft));
-            tbd = false;
+          try {
+            SingleTrackVertexConstraint::BTFtuple btft =
+                stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
+
+            if (std::get<0>(btft)) {
+              const reco::Track& trkBS = std::get<1>(btft).track();
+              pts.push_back(trkBS.pt());
+              ptErrs.push_back(trkBS.ptError());
+              chi2s.push_back(std::get<2>(btft));
+              tbd = false;
+            }
+          } catch (const VertexException& exc) {
+            // Update failed; give up.
           }
         }
       }

the crash that one can re-produce with the recipe at #45189 (comment) is circumvented.
I let @cms-sw/reconstruction-l2 to provide a patch to cmssw in case it is useful and correct to implement it.

24LopezR · 2024-09-04T09:39:28Z

Hi @mmusich, the patch looks good, let me test it too to double check and I will implement it in CMSSW. If I understand correctly, it needs to be backported to 14_0_X, right?

mmusich · 2024-09-04T09:46:51Z

Hi @24LopezR

he patch looks good, let me test it too to double check and I will implement it in CMSSW.

Thank you.

If I understand correctly, it needs to be backported to 14_0_X, right?

correct. It needs to go in 14_2_X (master), 14_1_X (for HIon) and 14_0_X (for pp).

jfernan2 · 2024-09-06T16:41:51Z

+1

cmsbuild · 2024-09-06T16:42:11Z

This issue is fully signed and ready to be closed.

cmsbuild added the pending-assignment label Jun 11, 2024

gpetruc mentioned this issue Jun 11, 2024

DeepBoostedJetTagInfoProducer failure in PromptReco_Run381443_ParkingSingleMuon4 (CMSSW_14_0_7 on AMD arch) #45190

Open

cmsbuild added reconstruction-pending pending-signatures and removed pending-assignment labels Jun 11, 2024

cmsbuild added the muon label Jun 11, 2024

namapane mentioned this issue Jun 17, 2024

Add protection for pathologic cases in MuonBeamspotConstraintValueMapProducer #45243

Merged

mandrenguyen mentioned this issue Jul 29, 2024

Production failures related to AMD vs Intel differences #45576

Open

This was referenced Sep 4, 2024

Patch in MuonBeamspotConstraintValueMapProducer for VertexException #45873

Merged

[141X] Patch in MuonBeamspotConstraintValueMapProducer for VertexException #45935

Merged

[140X] Patch in MuonBeamspotConstraintValueMapProducer for VertexException #45936

Merged

cmsbuild added reconstruction-approved fully-signed and removed reconstruction-pending pending-signatures labels Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NanoAOD VertexException in PromptReco_Run381515_ParkingVBF0 (CMSSW 14_0_7, on AMD arch) #45189

NanoAOD VertexException in PromptReco_Run381515_ParkingVBF0 (CMSSW 14_0_7, on AMD arch) #45189

gpetruc commented Jun 11, 2024

cmsbuild commented Jun 11, 2024 •

edited

Loading

cmsbuild commented Jun 11, 2024

Dr15Jones commented Jun 11, 2024

cmsbuild commented Jun 11, 2024

jfernan2 commented Jun 11, 2024

gpetruc commented Jun 17, 2024

namapane commented Jun 17, 2024

namapane commented Jun 17, 2024

namapane commented Jun 17, 2024

VinInn commented Jun 26, 2024

mmusich commented Jun 27, 2024

namapane commented Jun 27, 2024

mmusich commented Sep 2, 2024 •

edited

Loading

LinaresToine commented Sep 2, 2024

namapane commented Sep 2, 2024

mmusich commented Sep 2, 2024

LinaresToine commented Sep 2, 2024 •

edited

Loading

mmusich commented Sep 2, 2024

LinaresToine commented Sep 3, 2024

mmusich commented Sep 3, 2024

mmusich commented Sep 3, 2024

24LopezR commented Sep 4, 2024

mmusich commented Sep 4, 2024

jfernan2 commented Sep 6, 2024

cmsbuild commented Sep 6, 2024

NanoAOD VertexException in PromptReco_Run381515_ParkingVBF0 (CMSSW 14_0_7, on AMD arch) #45189

NanoAOD VertexException in PromptReco_Run381515_ParkingVBF0 (CMSSW 14_0_7, on AMD arch) #45189

Comments

gpetruc commented Jun 11, 2024

cmsbuild commented Jun 11, 2024 • edited Loading

cmsbuild commented Jun 11, 2024

Dr15Jones commented Jun 11, 2024

cmsbuild commented Jun 11, 2024

jfernan2 commented Jun 11, 2024

gpetruc commented Jun 17, 2024

namapane commented Jun 17, 2024

namapane commented Jun 17, 2024

namapane commented Jun 17, 2024

VinInn commented Jun 26, 2024

mmusich commented Jun 27, 2024

namapane commented Jun 27, 2024

mmusich commented Sep 2, 2024 • edited Loading

LinaresToine commented Sep 2, 2024

namapane commented Sep 2, 2024

mmusich commented Sep 2, 2024

LinaresToine commented Sep 2, 2024 • edited Loading

mmusich commented Sep 2, 2024

LinaresToine commented Sep 3, 2024

mmusich commented Sep 3, 2024

mmusich commented Sep 3, 2024

24LopezR commented Sep 4, 2024

mmusich commented Sep 4, 2024

jfernan2 commented Sep 6, 2024

cmsbuild commented Sep 6, 2024

cmsbuild commented Jun 11, 2024 •

edited

Loading

mmusich commented Sep 2, 2024 •

edited

Loading

LinaresToine commented Sep 2, 2024 •

edited

Loading