-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test aarch64 build with -mno-outline-atomics #36788
Comments
A new Issue was created by @dan131riley Dan Riley. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@dan131riley , cms-sw/cmsdist#7582 should test a full rebuild of cmssw for aarch64 with additional |
Just for history: the exception is caused by trying to match track with
code, which is a bit misleading: the value of The sequence of calls leading to this is the following:
Basically, it tries to calculate ChiSquaredProbability for vertex with ChiSquared = NaN. |
Adding @cms-sw/reconstruction-l2 and @cms-sw/egamma-pog-l2 for their information |
this is the stacktrace when theChiSquared was set to nan
|
assign reconstruction
state.parameters().magneticFieldInInverseGeV(point).z(); is 0 . I printed the various values and this is what I see
This is reproduceable for workflow 4.44 step3. The divide by zero is also visible for |
New categories assigned: reconstruction @slava77,@jpata,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@cms-sw/reconstruction-l2 , did any one look in to this |
Hi @smuzaffar , I'm following your instruction for reproducing the error |
Hi @smuzaffar , I've tried in [1] scram p -n CMSSW_12_3_X_2022-02-06-2300__I36788 CMSSW CMSSW_12_3_X_2022-02-06-2300
cd CMSSW_12_3_X_2022-02-06-2300__I36788/src/
cmsenv
scram b -j8
runTheMatrix.py -i all -l 4.44 --ibeos |
@clacaputo , as mentioned in #36788 (comment) , it does not crash for
after
state.parameters().magneticFieldInInverseGeV(point).z()=0 and fac=inf . This causes aarch64 tests to crash while amd64 somehow ignore it
|
@smuzaffar , thanks for the clarification |
@cms-sw/reconstruction-l2 any update here? |
Hi @smuzaffar , I'm looking at it |
@clacaputo, by the way the following change allowed failing workflows to run but I am sure this might not be a correct fix.
|
any progress? |
Still digging. But it's not clear to me why cmssw/TrackingTools/TrajectoryParametrization/interface/GlobalTrajectoryParameters.h Line 102 in 6d2f660
0 for same GlobalPoint of the tracks
For example: Point: (1000.990479,608.831177,3858.323730)
fac: -inf
state.charge(): -1
state.parameters().magneticFieldInInverseGeV(point).z() = 0
//Magnetic Field at the Point
------ x = 0.000000
------ y = 0.000000
------ z = 0.000000
//Magnetic Field in GeV at the Point
------ x_gev = 0.000000
------ y_gev = 0.000000
------ z_gev = 0.000000 Does anyone have an idea why B field should be zero? |
what else would it be at |
type egamma |
any update on this? We still have over 80 workflows failing for aarch64 IBs due to this. Error message in latest tests is
|
Where is the
called? IIUC, it's not a part of the crashing stack of calls @cms-sw/egamma-pog-l2 |
There is a call to
|
that's the end point of the problem (the input is nan). My other idea (implied in mentioning that the particle state is way outside of the tracker) is that a check for the particle/track to be inside of the tracker done upstream would be better |
I was a bit surprised to see this: cmssw/RecoEgamma/EgammaPhotonProducers/src/ConversionProducer.cc Lines 585 to 586 in 6d2f660
It is being checked if theConversionVertex is valid or not, and the check passes although the theConversionVertex has nan chi2() and ndof()?Will it make sense to update vertex's isValid() check to catch nan ?
Alternatively, there is a check if the vertex is good, here: cmssw/RecoEgamma/EgammaPhotonProducers/src/ConversionProducer.cc Lines 534 to 539 in 6d2f660
we can improve the function where this check is done.. ie, we can either improve checkVertex function to catch the nan here,cmssw/RecoEgamma/EgammaPhotonProducers/src/ConversionProducer.cc Lines 161 to 165 in 6d2f660
or improve the ConversionVertexFinder::run function, defined in https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/RecoEgamma/EgammaPhotonAlgos/src/ConversionVertexFinder.cc
Of course, Slava's idea of fixing it earlier (ensuring track to be inside of the tracker) is really nice, but it's a bit beyond egamma's expertise. |
After staring at the issue longer, I agree with the proposed solution |
closing this one and will open a new one |
We're seeing a large number of failures on aarch64. Most of them are uninformative exceptions of the form
but there's a handful of segmentation violations that seem to point to problems with atomic operations
The relatively new ThunderX2 nodes (olarm-201.cern.ch, olarm-202.cern.ch) support ARMv8.1 Large System Extension atomics, and gcc 10.1+ enables their use by default. I'd like a test build with
-mno-outline-atomics
in the gcc/g++ flags to see if that changes the behavior.The text was updated successfully, but these errors were encountered: