-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NanoAOD VertexException in PromptReco_Run381515_ParkingVBF0 (CMSSW 14_0_7, on AMD arch) #45189
Comments
cms-bot internal usage |
A new Issue was created by @gpetruc. @Dr15Jones, @rappoccio, @makortel, @smuzaffar, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Assign RecoMuon/GlobalTrackingTools |
New categories assigned: reconstruction @jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
type muon |
The problem seems to be that this event has a PV which, despite being isValid(), has a ~zero covariance() matrix:
gives:
This goes through I can't think of a way of adding a simple protection to check for the covariance matrix to be sensible, so I think the easiest solution is to catch the exception in MuonBeamspotConstraintValueMapProducer. |
In the meanwhile I made a PR for this in master: #45243 |
file is no more there
|
@germanfgv @LinaresToine please comment. |
I managed to test the fix #45243 before it disappeared. |
it seems there's another job that failed at Tier0 that crashed with similar features: https://cmsweb.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run384981_JetMET1. @namapane FYI. |
Hello all |
I think so, the job was run in |
Is this parallel to #45189 ? The new occurrence is for JetMET1, which seems to belong in the mentioned issue. |
What do you mean? This issue is 45189. |
Thanks Marco, I meant #45520. As you mentioned in cmstalk, they refer to different modules |
thanks, I can reproduce the crash (on an AMD machine, #!/bin/bash
export SCRAM_ARCH=el8_amd64_gcc12
scram p CMSSW_14_0_14
cd CMSSW_14_0_14/src
eval `scram runtime -sh`
cp /eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException/vocms0314.cern.ch-2761618-12-log.tar.gz .
tar xf vocms0314.cern.ch-2761618-12-log.tar.gz
cp -pr ./job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.source.skipEvents=cms.untracked.uint32(766)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log This results immediately (at the first event) in: ----- Begin Fatal Exception 03-Sep-2024 09:37:22 CEST-----------------------
An exception of category 'VertexException' occurred while
[0] Processing Event run: 384981 lumi: 572 event: 1260938254 stream: 0
[1] Running path 'write_NANOAOD_step'
[2] Prefetching for module PoolOutputModule/'write_NANOAOD'
[3] Prefetching for module SimplePATMuonFlatTableProducer/'muonTable'
[4] Calling method for module MuonBeamspotConstraintValueMapProducer/'muonBSConstrain'
Exception Message:
BasicSingleVertexState::could not invert weight matrix
----- End Fatal Exception ------------------------------------------------- |
With this simple patch: diff --git a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
index 74459f475cb..a83f3d98268 100644
--- a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
+++ b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
@@ -65,15 +65,21 @@ private:
// Protect for mis-reconstructed beamspots (note that
// SingleTrackVertexConstraint uses the width for the constraint,
// not the error)
+
if ((BeamWidthXError / BeamWidthX < 0.3) && (BeamWidthYError / BeamWidthY < 0.3)) {
- SingleTrackVertexConstraint::BTFtuple btft =
- stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
- if (std::get<0>(btft)) {
- const reco::Track& trkBS = std::get<1>(btft).track();
- pts.push_back(trkBS.pt());
- ptErrs.push_back(trkBS.ptError());
- chi2s.push_back(std::get<2>(btft));
- tbd = false;
+ try {
+ SingleTrackVertexConstraint::BTFtuple btft =
+ stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
+
+ if (std::get<0>(btft)) {
+ const reco::Track& trkBS = std::get<1>(btft).track();
+ pts.push_back(trkBS.pt());
+ ptErrs.push_back(trkBS.ptError());
+ chi2s.push_back(std::get<2>(btft));
+ tbd = false;
+ }
+ } catch (const VertexException& exc) {
+ // Update failed; give up.
}
}
} the crash that one can re-produce with the recipe at #45189 (comment) is circumvented. |
Hi @mmusich, the patch looks good, let me test it too to double check and I will implement it in CMSSW. If I understand correctly, it needs to be backported to 14_0_X, right? |
Hi @24LopezR
Thank you.
correct. It needs to go in 14_2_X (master), 14_1_X (for HIon) and 14_0_X (for pp). |
+1 |
This issue is fully signed and ready to be closed. |
A PromptReco job failure in the NanoAOD step was observed at the tier0 with the following error message
cms-talk thead: https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381515-parkingvbf0-vertexexception/42163
The exception appears to be reproducible running on a single event, but only on AMD: the job fails at Tier0 (AMD EPYC 7763) and on my desktop (AMD Ryzen 9 5950X), but not on another Intel machine I tested (Intel Xeon Silver 4216).
Instructions to reproduce it on an EL8 AMD machine:
The text was updated successfully, but these errors were encountered: