-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes in workflow 39434.911 #39445
Comments
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
This was due to a segmentation violation. |
In CMSSW_12_6_ROOT626_X_2022-09-18-2300 el8_amd64_gcc10 step 2 crashed with
The |
That is really weird. We disable ROOT's signal handling as part of RootInitHandler. |
@pcanal did ROOT change their signal handling in 6.26? |
In CMSSW_12_6_X_2022-09-18-0000 el8_amd64_gcc10 step 2 crashed with
|
In CMSSW_12_6_X_2022-09-17-1100 el8_amd64_gcc10 step 3 crashed with
|
In CMSSW_12_6_X_2022-09-16-2300 slc7_amd64_gcc10 step 3 crashed with
|
In CMSSW_12_6_X_2022-09-16-1100 slc7_amd64_gcc10 step 2 crashed with
|
In CMSSW_12_6_UBSAN_X_2022-09-16-1100 el8_amd64_gcc11 step 2 crashed with
I did not see any HCAL-related messages from UBSAN itself earlier in the log |
For CMSSW_12_6_X_2022-09-15-2300 el9_amd64_gcc11 step 2 the IB dashboard reports a timeout, but we did get some stack traces
|
CMSSW_12_6_X_2022-09-15-2300 seems to be the first IB where these crashes appeared. The |
assign geometry, upgrade (let's start with these, since the problem seems to be specific to Phase 2 DD4Hep workflow) |
New categories assigned: geometry,upgrade @mdhildreth,@AdrianoDee,@ianna,@Dr15Jones,@srimanob,@makortel,@bsunanda,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks |
The UBSAN log points to this line
|
I don't think so. |
Occurred in
both pointing to |
ASAN gives something helpful in CMSSW_12_6_ASAN_X_2022-09-19-1100 step 2
|
None of the PRs merged in CMSSW_12_6_X_2022-09-15-2300 (#39391, #39382, #39294, #39388) seem relevant. The earlier IB, CMSSW_12_6_X_2022-09-15-1100, has #37951 that looks like a possible cause (the crash is in HCAL code geometry code, the crashing workflow uses D88 geometry that contains C17 that was touched in that PR). @bsunanda could you take a look? |
It really looks like we're sometimes getting segfaults before InitRootHandlers runs, so we get the segfault handled by the ROOT handlers, then InitRootHandlers apparently runs, and it handles an additional segfault while exiting. It isn't consistent, so it seems like there must be a race condition there, dunno how that could be possible? |
@dan131riley could ROOT be handling a signal we are not? It seems ROOT is complaining about a floating point problem which we do not handle. |
@Dr15Jones found some discussion from three years ago at #28112 (comment) This looks like the same thing, we don't catch SIGFPE and something is enabling FPE's behind our backs. Then the buffer overflow is returning some garbage that triggers an FPE. |
Thinking about this some more, if something were turning on FPEs globally then I'd expect to get a ton of crashes, since we generate lots of NaNs and junk in tracking. So maybe something is doing a scoped manipulation of the FPE handler state, and that introduces a race condition. I'm going to add FPE handling to InitRootHandlers, as that might help identify what's resetting the FPE state if it is scoped. |
I am testing with CMSSW_12_6_X-2022-09-21-2300 - I am not sure if the PR is reverted back. But running with 39434.911 I find all 4 steps run satisfactorily. Please advise, how I can reproduce this problem, Sunanda
…________________________________
From: Dan Riley ***@***.***
Sent: 20 September 2022 18:11
To: cms-sw/cmssw
Cc: Sunanda Banerjee; Mention
Subject: Re: [cms-sw/cmssw] Crashes in workflow 39434.911 (Issue #39445)
Thinking about this some more, if something were turning on FPEs globally then I'd expect to get a ton of crashes, since we generate lots of NaNs and junk in tracking. So maybe something is doing a scoped manipulation of the FPE handler state, and that introduces a race condition. I'm going to add FPE handling to InitRootHandlers, as that might help identify what's resetting the FPE state if it is scoped.
—
Reply to this email directly, view it on GitHub<#39445 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGMZOU3DQGEEZNU7HMOBT3V7HOZJANCNFSM6AAAAAAQQDSPFY>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
In normal IBs the crashes occur randomly, so you may have to try to run many times or try to load the machine. Maybe try to run on an ASAN IB, e.g. CMSSW_12_6_ASAN_X_2022-09-21-1100? |
Possibly related, from the UBSAN IBs, unexpected(?) -1 seems like a problem:
followed by
and finallly a segfault
|
OK Matti
…________________________________
From: Matti Kortelainen ***@***.***
Sent: 20 September 2022 16:02
To: cms-sw/cmssw
Cc: Sunanda Banerjee; Mention
Subject: Re: [cms-sw/cmssw] Crashes in workflow 39434.911 (Issue #39445)
CMSSW_12_6_X_2022-09-15-2300 seems to be the first IB where these crashes appeared. The HcalDDDRecConstants::getHCID() seems to play some role in most of them.
None of the PRs merged in CMSSW_12_6_X_2022-09-15-2300 (#39391<#39391>, #39382<#39382>, #39294<#39294>, #39388<#39388>) seem relevant.
The earlier IB, CMSSW_12_6_X_2022-09-15-1100, has #37951<#37951> that looks like a possible cause (the crash is in HCAL code geometry code, the crashing workflow uses D88 geometry that contains C17 that was touched in that PR). @bsunanda<https://github.com/bsunanda> could you take a look?
—
Reply to this email directly, view it on GitHub<#39445 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGMZOSLRGKCZOJNI5QGM43V7G7YFANCNFSM6AAAAAAQQDSPFY>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
urgent |
This problem got fixed by #39967 |
@cmsbuild, please close |
Workflow 39434.911 step 3 crashed in CMSSW_12_6_X_2022-09-18-2300 el9_amd64_gcc11 in function
HcalDDDRecConstants::getHCID()
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el9_amd64_gcc11/CMSSW_12_6_X_2022-09-18-2300/pyRelValMatrixLogs/run/39434.911_TTbar_14TeV+2026D88_DD4hep+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal/step3_TTbar_14TeV+2026D88_DD4hep+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal.log#/
The text was updated successfully, but these errors were encountered: