-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probably thread related crashes in aarch64 IBs #31123
Comments
A new Issue was created by @Dr15Jones Chris Jones. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
This one is failing while trying to write a numeric value to an ostream. This std implementation is calling the underlying |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
The routine RPCSimSetUp::setRPCSetUp does a tremendous amount of output formatting which is then never seen because the resulting string is passed to LogDebug. See cmssw/SimMuon/RPCDigitizer/src/RPCSimSetUp.cc Line 106 in 5b54e3a
|
The crash happened on thread 4 which has a 'corrupted' stack trace. I saw other RelVals which also had crashes where the thread that crashed had a 'corrupted' stack trace. |
Another RelVal with a corrupted stack is which reports the following modules being run at the time of the crash
with the stack being
|
with running modules
Notice that the problem happens again in |
with running modules
|
Here we have some incredibly deep stacks (because of ROOT IO) and the crash is ROOT's thread local handling. |
did not generate a trace back but the running modules were
Again we see a crash happening in |
we have another corrupted stack. This time the only module reported running is
although the stack traces for the threads do show 3 other modules running. |
we have another corrupted stack with modules running: ``
|
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.878_RunMuonEG2018C+RunMuonEG2018C+HLTDR2_2018+RECODR2_2018reHLT_skimMuonEG_Offline+HARVEST2018/step3_RunMuonEG2018C+RunMuonEG2018C+HLTDR2_2018+RECODR2_2018reHLT_skimMuonEG_Offline+HARVEST2018.log#/ also shows a corrupted stack with crash in
|
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.873_RunHLTPhy2018C+RunHLTPhy2018C+HLTDR2_2018+RECODR2_2018reHLT_Offline+HARVEST2018/step3_RunHLTPhy2018C+RunHLTPhy2018C+HLTDR2_2018+RECODR2_2018reHLT_Offline+HARVEST2018.log#/ doesn't have a stacktrace (it timed out) but shows the crash happened in
|
didn't have a traceback (it says it timed out) and shows running modules as
|
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.826_RunMuOnia2017E+RunMuOnia2017E+HLTDR2_2017+RECODR2_2017reHLT_skimMuOnia_Prompt+HARVEST2017/step3_RunMuOnia2017E+RunMuOnia2017E+HLTDR2_2017+RECODR2_2017reHLT_skimMuOnia_Prompt+HARVEST2017.log#/ has a corrupted stack trace with crash in
|
The crashes almost invariable happen during the first 4 events so are most likely a 1st time called related problem. |
shows a corrupted stack trace with crash in
|
has a crash in TBB's internals
|
has no stack trace and shows the running modules as
|
has no stack trace and only reports one module running
|
Has no stack trace and shows the running modules as
|
Seems to be reporting multiple simultaneous crash reports. No stack traces are given
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/130.0_SinglePiPt10+SinglePiPt10+DIGI+RECO/step3_SinglePiPt10+SinglePiPt10+DIGI+RECO.log#/ seems to have the same sort of behavior. |
Doesn't have a traceback and shows the running modules as
|
Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes.
The 23h00 IB seems to be taking a while for aarch64, but so far there are no TFormula crashes, all the crashes are in onnxruntime. |
yes we have issues with one of arm nodes (disk full) that is why relval jobs were crashed. We have restarted the jobs but as we only have arm node now so it will take some time |
Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes.
Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes. (cherry picked from commit 9e104ac)
Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes. (cherry picked from commit 9e104ac)
Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes. (cherry picked from commit 9e104ac)
Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes. (cherry picked from commit 9e104ac)
Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes. (cherry picked from commit 9e104ac)
Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes. (cherry picked from commit 9e104ac)
The fix is now merged upstream in LLVM and in ROOT
@dan131riley does this also involve Cling or is this a separate issue? |
@hahnjo The aarch64 IBs are still running slow, but it looks like the CMSSW_11_3 2021-04-07-2300 slc7_aarch64_gcc9 IB has finished, and I don't see any Cling-related crashes. There are lots of onnxruntime crashes, those are unrelated to Cling and ROOT, and there's a separate issue for that at #32899. The Cling crashes were common enough that one IB is enough to convince that the problems have all been resolved and we can close this much-too-long ticket. Thanks! |
+1 The TCling issue seems to be resolved with the last fix, so let's close this issue (and open new ones for possible other crashes). |
…project#7758) Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes. (cherry picked from commit 9e104ac)
+reconstruction based on #31123 (comment)
|
+1 |
This issue is fully signed and ready to be closed. |
Backport of D27629, commit 18805ea951. Original commit message: --- Makes sure that the unwind info uses 64bits pcrel relocation if a large code model is specified and handle the corresponding relocation in the ExecutionEngine. This can happen with certain kernel configuration (the same as the one in https://reviews.llvm.org/D27609, found at least on the ArchLinux stock kernel and the one used on https://www.packet.net/) using the builtin JIT memory manager. Co-authored-by: Yichao Yu <yyc1992@gmail.com> Co-authored-by: Valentin Churavy <v.churavy@gmail.com> --- Note: The handling in ExecutionEngine was committed in a different revision and is already part of LLVM 9. We need the part about emitting relocations because eh_frame (allocated in a data section) may be more than 4Gb away from the code section it references. See the discussion in cms-sw/cmssw#31123 for context. (cherry picked from commit f481e8f)
…project#7807) Backport of D99607, commit 6415f424bc. Original commit message: --- When using the large code model with FastISel (for example via clang -O0 which adds the optnone attribute), FP constants could still be materialized using adrp + ldr. Unconditionally enable the existing path for MachO to materialize the constant in code. [...] --- See the discussion in cms-sw/cmssw#31123 for context on the observed crashes. (cherry picked from commit 9e104ac)
After switching to run the IB RelVals using multiple threads, we are seeing 'random' crashes in the aarch64 builds.
The text was updated successfully, but these errors were encountered: