Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test aarch64 build with -mno-outline-atomics #36788

Closed
dan131riley opened this issue Jan 24, 2022 · 28 comments
Closed

Test aarch64 build with -mno-outline-atomics #36788

dan131riley opened this issue Jan 24, 2022 · 28 comments

Comments

@dan131riley
Copy link

We're seeing a large number of failures on aarch64. Most of them are uninformative exceptions of the form

----- Begin Fatal Exception 24-Jan-2022 10:42:19 CET-----------------------
An exception of category 'Vertex' occurred while
[...]
Exception Message:
Refitted track not found in list
----- End Fatal Exception -------------------------------------------------

but there's a handful of segmentation violations that seem to point to problems with atomic operations

#3  0x000040002d8d3930 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02716/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-16-2300/lib/slc7_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  __aarch64_ldadd8_acq_rel () at ../../../libgcc/config/aarch64/lse.S:265
#6  0x000040002eca7cec in FastTimerService::Measurement::measure_and_accumulate(FastTimerService::AtomicResources&) () from /cvmfs/cms-ib.cern.ch/nweek-02716/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-16-2300/lib/slc7_aarch64_gcc11/pluginHLTriggerTimerPlugins.so

The relatively new ThunderX2 nodes (olarm-201.cern.ch, olarm-202.cern.ch) support ARMv8.1 Large System Extension atomics, and gcc 10.1+ enables their use by default. I'd like a test build with -mno-outline-atomics in the gcc/g++ flags to see if that changes the behavior.

@cmsbuild
Copy link
Contributor

A new Issue was created by @dan131riley Dan Riley.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@smuzaffar
Copy link
Contributor

@dan131riley , cms-sw/cmsdist#7582 should test a full rebuild of cmssw for aarch64 with additional -mno-outline-atomics flag

@iarspider
Copy link
Contributor

Just for history: the exception is caused by trying to match track with pT = NaN. All the tracks in a given (bad) event have pT=NaN. This is probably related to the following message:

GammaContinuedFraction::a too large, ITMAX too small
GammaContinuedFraction::a too large, ITMAX too small

code, which is a bit misleading: the value of a is fine (2.0), but the value of x is NaN.

The sequence of calls leading to this is the following:

 0# GammaContinuedFraction(float, float) in /afs/cern.ch/user/c/cmsbuild/CMSSW_12_3_X_2022-01-23-0000/lib/slc7_aarch64_gcc11/libCommonToolsStatistics.so
 1# IncompleteGammaComplement::value(float, float) in /afs/cern.ch/user/c/cmsbuild/CMSSW_12_3_X_2022-01-23-0000/lib/slc7_aarch64_gcc11/libCommonToolsStatistics.so
 2# ChiSquaredProbability(double, double) in /afs/cern.ch/user/c/cmsbuild/CMSSW_12_3_X_2022-01-23-0000/lib/slc7_aarch64_gcc11/libCommonToolsStatistics.so
 3# ConversionVertexFinder::run(std::vector<reco::TransientTrack, std::allocator<reco::TransientTrack> > const&, reco::Vertex&) in /afs/cern.ch/user/c/cmsbuild/CMSSW_12_3_X_2022-01-23-0000/lib/slc7_aarch64_gcc11/libRecoEgammaEgammaPhotonAlgos.so
 4# ConversionProducer::buildCollection(edm::Event&, edm::EventSetup const&, std::multimap<float, edm::Ptr<reco::ConversionTrack>, std::less<float>, std::allocator<std::pair<float const, edm::Ptr<reco::ConversionTrack> > > > const&, std::multimap<double, edm::Ptr<reco::CaloCluster>, std::less<double>, std::allocator<std::pair<double const, edm::Ptr<reco::CaloCluster> > > > const&, std::multimap<double, edm::Ptr<reco::CaloCluster>, std::less<double>, std::allocator<std::pair<double const, edm::Ptr<reco::CaloCluster> > > > const&, reco::Vertex const&, std::vector<reco::Conversion, std::allocator<reco::Conversion> >&) in /afs/cern.ch/user/c/cmsbuild/CMSSW_12_3_X_2022-01-23-0000/lib/slc7_aarch64_gcc11/pluginRecoEgammaEgammaPhotonProducers.so
 5# ConversionProducer::produce(edm::Event&, edm::EventSetup const&) in /afs/cern.ch/user/c/cmsbuild/CMSSW_12_3_X_2022-01-23-0000/lib/slc7_aarch64_gcc11/pluginRecoEgammaEgammaPhotonProducers.so
 6# edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) in /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-23-0000/lib/slc7_aarch64_gcc11/libFWCoreFramework.so

Basically, it tries to calculate ChiSquaredProbability for vertex with ChiSquared = NaN.

@makortel
Copy link
Contributor

Adding @cms-sw/reconstruction-l2 and @cms-sw/egamma-pog-l2 for their information

@smuzaffar
Copy link
Contributor

this is the stacktrace when theChiSquared was set to nan

#0  boost::intrusive_ptr<KinematicVertex>::~intrusive_ptr (this=<optimized out>, __in_chrg=<optimized out>)
    at /data/muzaffar/CMSSW_12_3_X_2022-01-26-2300/src/RecoVertex/KinematicFitPrimitives/src/KinematicVertex.cc:22
#1  ReferenceCountingPointer<KinematicVertex>::~ReferenceCountingPointer (this=<optimized out>, __in_chrg=<optimized out>)
    at /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-26-2300/src/DataFormats/GeometrySurface/interface/ReferenceCounted.h:60
#2  KinematicVertex::KinematicVertex (this=this@entry=0xfffe94190980, state=..., totalChiSq=totalChiSq@entry=-nan(0x400000), degreesOfFr=degreesOfFr@entry=3)
    at /data/muzaffar/CMSSW_12_3_X_2022-01-26-2300/src/RecoVertex/KinematicFitPrimitives/src/KinematicVertex.cc:28
#3  0x0000ffff047b5570 in KinematicVertexFactory::vertex (degreesOfFr=3, totalChiSq=-nan(0x400000), state=<synthetic pointer>...)
    at /data/muzaffar/CMSSW_12_3_X_2022-01-26-2300/src/RecoVertex/KinematicFitPrimitives/interface/KinematicVertexFactory.h:24
#4  KinematicConstrainedVertexUpdatorT<2, 2>::update (this=0xfffe8b2b4800, inPar=..., inCov=..., lStates=..., lPoint=..., fieldValue=..., cs=cs@entry=0xffff0e346770)
    at /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-26-2300/src/RecoVertex/KinematicFit/interface/KinematicConstrainedVertexUpdatorT.h:166
#5  0x0000ffff047b66e8 in KinematicConstrainedVertexFitterT<2, 2>::fit (this=this@entry=0xffff0e346690, part=..., cs=cs@entry=0xffff0e346770, pt=pt@entry=0x0)
    at /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-26-2300/src/RecoVertex/KinematicFit/interface/KinematicConstrainedVertexFitterT.h:215
#6  0x0000ffff047ad6c8 in KinematicConstrainedVertexFitterT<2, 2>::fit (cs=0xffff0e346770, part=..., this=0xffff0e346690)
    at /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-26-2300/src/RecoVertex/KinematicFit/interface/KinematicConstrainedVertexFitterT.h:55
#7  ConversionVertexFinder::run (this=0xfffe19501198, _pair=..., the_vertex=...) at /data/muzaffar/CMSSW_12_3_X_2022-01-26-2300/src/RecoEgamma/EgammaPhotonAlgos/src/ConversionVertexFinder.cc:69
#8  0x0000ffff0488c268 in ConversionProducer::buildCollection(edm::Event&, edm::EventSetup const&, std::multimap<float, edm::Ptr<reco::ConversionTrack>, std::less<float>, std::allocator<std::pair<float const, edm::Ptr<reco::ConversionTrack> > > > const&, std::multimap<double, edm::Ptr<reco::CaloCluster>, std::less<double>, std::allocator<std::pair<double const, edm::Ptr<reco::CaloCluster> > > > const&, std::multimap<double, edm::Ptr<reco::CaloCluster>, std::less<double>, std::allocator<std::pair<double const, edm::Ptr<reco::CaloCluster> > > > const&, reco::Vertex const&, std::vector<reco::Conversion, std::allocator<reco::Conversion> >&)
    () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-26-2300/lib/slc7_aarch64_gcc11/pluginRecoEgammaEgammaPhotonProducers.so
#9  0x0000ffff0488e7e0 in ConversionProducer::produce(edm::Event&, edm::EventSetup const&) ()
   from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-26-2300/lib/slc7_aarch64_gcc11/pluginRecoEgammaEgammaPhotonProducers.so
#10 0x0000ffffbe6c3abc in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc11/cms/cmssw/CMSSW_12_3_X_2022-01-26-2300/lib/slc7_aarch64_gcc11/libFWCoreFramework.so

@smuzaffar
Copy link
Contributor

assign reconstruction

nan is generated due to divide by zero at

double fac = state.charge() / state.parameters().magneticFieldInInverseGeV(point).z();
where state.parameters().magneticFieldInInverseGeV(point).z(); is 0. I printed the various values and this is what I see

state.charge() = 1
state.parameters().magneticFieldInInverseGeV(point).z() = 0.000000
fac = inf

This is reproduceable for workflow 4.44 step3. The divide by zero is also visible for slc7_amd64_gcc10 IBs too but strangely it is not crashing for amd64 but for aarch64 it fails. You can reproduce it for CMSSW_12_3_X IB (for amd64) by either running runTheMatrix.py -i all -l 4.44 --ibeos or cmsRun /afs/cern.ch/user/m/muzaffar/public/wf4_44_step3/step3.py .

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@jpata,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@smuzaffar
Copy link
Contributor

@cms-sw/reconstruction-l2 , did any one look in to this divide by zero issue ( #36788 (comment) )

@clacaputo
Copy link
Contributor

Hi @smuzaffar , I'm following your instruction for reproducing the error

@clacaputo
Copy link
Contributor

Hi @smuzaffar , I've tried in slc7_amd64_gcc10 [1], as suggested in #36788 (comment), but I can't reproduce the problem. How can I access aarch64 machine?

[1]

scram p -n CMSSW_12_3_X_2022-02-06-2300__I36788 CMSSW CMSSW_12_3_X_2022-02-06-2300
cd CMSSW_12_3_X_2022-02-06-2300__I36788/src/
cmsenv
scram b -j8
runTheMatrix.py -i all -l 4.44 --ibeos

@smuzaffar
Copy link
Contributor

smuzaffar commented Feb 7, 2022

@clacaputo , as mentioned in #36788 (comment) , it does not crash for amd64 but divide by zero is visible. e.g. if you git cms-addpkg RecoVertex/KinematicFitPrimitives and then edit RecoVertex/KinematicFitPrimitives/src/TrackKinematicStatePropagator.cc file and print the values of

fac
state.charge()
state.parameters().magneticFieldInInverseGeV(point).z()

after

double fac = state.charge() / state.parameters().magneticFieldInInverseGeV(point).z();
then you will notice that state.parameters().magneticFieldInInverseGeV(point).z()=0 and fac=inf . This causes aarch64 tests to crash while amd64 somehow ignore it

@clacaputo
Copy link
Contributor

@smuzaffar , thanks for the clarification

@smuzaffar
Copy link
Contributor

@cms-sw/reconstruction-l2 any update here?

@clacaputo
Copy link
Contributor

Hi @smuzaffar , I'm looking at it

@smuzaffar
Copy link
Contributor

smuzaffar commented Feb 22, 2022

@clacaputo, by the way the following change allowed failing workflows to run but I am sure this might not be a correct fix.

--- a/RecoVertex/KinematicFitPrimitives/src/TrackKinematicStatePropagator.cc
+++ b/RecoVertex/KinematicFitPrimitives/src/TrackKinematicStatePropagator.cc
@@ -77,6 +77,10 @@ KinematicState TrackKinematicStatePropagator::propagateToTheTransversePCACharged
   //for helix barrel plane crossing
   FreeTrajectoryState const& fState = state.freeTrajectoryState();
   const GlobalPoint& iP = referencePoint;
+  if (fabs(fState.parameters().magneticFieldInInverseGeV(iP).z())<0.000001){
+    LogDebug("RecoVertex/TrackKinematicStatePropagator") << "z value is zero! State is invalid\n";
+    return KinematicState();
+  }
   std::pair<HelixBarrelPlaneCrossingByCircle, BoundPlane::BoundPlanePointer> cros = planeCrossing(fState, iP);

   HelixBarrelPlaneCrossingByCircle planeCrossing = cros.first;

@smuzaffar
Copy link
Contributor

any progress?

@clacaputo
Copy link
Contributor

any progress?

Still digging. But it's not clear to me why magneticFieldInInverseGeV defined here

GlobalVector magneticFieldInInverseGeV() const { return 2.99792458e-3f * cachedMagneticField; }
and called from here
const GlobalTrajectoryParameters& parameters() const { return theGlobalParameters; }
and here
double fac = state.charge() / state.parameters().magneticFieldInInverseGeV(point).z();
is returning a value of 0 for same GlobalPoint of the tracks

For example:

Point:   (1000.990479,608.831177,3858.323730) 
fac: -inf
state.charge(): -1 
state.parameters().magneticFieldInInverseGeV(point).z() = 0
//Magnetic Field at the Point
------ x = 0.000000
------ y = 0.000000
------ z = 0.000000
//Magnetic Field in GeV at the Point
------ x_gev = 0.000000
------ y_gev = 0.000000
------ z_gev = 0.000000

Does anyone have an idea why B field should be zero?

@slava77
Copy link
Contributor

slava77 commented Mar 11, 2022

Does anyone have an idea why B field should be zero?

what else would it be at Point: (1000.990479,608.831177,3858.323730) ? This is apparently at 11.7 m in r and 36 m in z

@jpata
Copy link
Contributor

jpata commented May 16, 2022

type egamma

@smuzaffar
Copy link
Contributor

any update on this? We still have over 80 workflows failing for aarch64 IBs due to this. Error message in latest tests is

Exception Message:
Refitted track not found in list.
 pt used for comparison: -nan

@slava77
Copy link
Contributor

slava77 commented Aug 24, 2022

Where is the

double fac = state.charge() / state.parameters().magneticFieldInInverseGeV(point).z();

called?
IIUC, it's not a part of the crashing stack of calls

@cms-sw/egamma-pog-l2

@swagata87
Copy link
Contributor

There is a call to ChiSquaredProbabilityfrom here:

const float chi2Prob = ChiSquaredProbability(theConversionVertex.chi2(), theConversionVertex.ndof());
, this could be related to the crash reported here. But I'm not entirely sure.

@slava77
Copy link
Contributor

slava77 commented Aug 24, 2022

There is a call to ChiSquaredProbabilityfrom here:

const float chi2Prob = ChiSquaredProbability(theConversionVertex.chi2(), theConversionVertex.ndof());

, this could be related to the crash reported here. But I'm not entirely sure.

that's the end point of the problem (the input is nan).
Perhaps the easiest is to check for a nan here?

My other idea (implied in mentioning that the particle state is way outside of the tracker) is that a check for the particle/track to be inside of the tracker done upstream would be better

@swagata87
Copy link
Contributor

swagata87 commented Aug 24, 2022

I was a bit surprised to see this:

if (theConversionVertex.isValid()) {
const float chi2Prob = ChiSquaredProbability(theConversionVertex.chi2(), theConversionVertex.ndof());

It is being checked if theConversionVertex is valid or not, and the check passes although the theConversionVertex has nan chi2() and ndof()?
Will it make sense to update vertex's isValid() check to catch nan?

Alternatively, there is a check if the vertex is good, here:

reco::Vertex theConversionVertex; //by default it is invalid
bool goodVertex = checkVertex(ttk_l, ttk_r, magField, theConversionVertex);
//bail as early as possible in case the fit didn't return a good vertex
if (!goodVertex) {
continue;

we can improve the function where this check is done..
ie, we can either improve checkVertex function to catch the nan here,
inline bool checkVertex(const reco::TransientTrack& ttk_l,
const reco::TransientTrack& ttk_r,
MagneticField const& magField,
reco::Vertex& the_vertex) {
return vertexFinder_.run({ttk_l, ttk_r}, the_vertex);
;
or improve the ConversionVertexFinder::run function, defined in https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/RecoEgamma/EgammaPhotonAlgos/src/ConversionVertexFinder.cc

Of course, Slava's idea of fixing it earlier (ensuring track to be inside of the tracker) is really nice, but it's a bit beyond egamma's expertise.

@slava77
Copy link
Contributor

slava77 commented Aug 24, 2022

@clacaputo, by the way the following change allowed failing workflows to run but I am sure this might not be a correct fix.

--- a/RecoVertex/KinematicFitPrimitives/src/TrackKinematicStatePropagator.cc
+++ b/RecoVertex/KinematicFitPrimitives/src/TrackKinematicStatePropagator.cc
@@ -77,6 +77,10 @@ KinematicState TrackKinematicStatePropagator::propagateToTheTransversePCACharged
   //for helix barrel plane crossing
   FreeTrajectoryState const& fState = state.freeTrajectoryState();
   const GlobalPoint& iP = referencePoint;
+  if (fabs(fState.parameters().magneticFieldInInverseGeV(iP).z())<0.000001){
+    LogDebug("RecoVertex/TrackKinematicStatePropagator") << "z value is zero! State is invalid\n";
+    return KinematicState();
+  }
   std::pair<HelixBarrelPlaneCrossingByCircle, BoundPlane::BoundPlanePointer> cros = planeCrossing(fState, iP);

   HelixBarrelPlaneCrossingByCircle planeCrossing = cros.first;

After staring at the issue longer, I agree with the proposed solution

@smuzaffar
Copy link
Contributor

closing this one and will open a new one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants