Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM] Assertion failure in gpuVertexFinder in 11634.24 #37820

Open
makortel opened this issue May 5, 2022 · 17 comments
Open

[ARM] Assertion failure in gpuVertexFinder in 11634.24 #37820

makortel opened this issue May 5, 2022 · 17 comments

Comments

@makortel
Copy link
Contributor

makortel commented May 5, 2022

Workflow 11634.24 step 2 has been failing on el8_aarch64_gcc10 at least since CMSSW_12_4_X_2022-04-28-2300 with

cmsRun: /data/cmsbuild/jenkins_a/workspace/build-any-ib/w/tmp/BUILDROOT/28e97d506f1bae1e45437cea84c399e8/opt/cmssw/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h:73: void gpuVertexFinder::fitVertices(gpuVertexFinder::ZVertices*, gpuVertexFinder::WorkSpace*, float): Assertion `wv[i] > 0.f' failed.

Thread 1 (Thread 0x400040a8b730 (LWP 2277007) "cmsRun"):
#3  0x00004000430d8600 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000400040e52c74 in raise () from /lib64/libc.so.6
#6  0x0000400040e4096c in abort () from /lib64/libc.so.6
#7  0x0000400040e4c4c4 in __assert_fail_base () from /lib64/libc.so.6
#8  0x0000400040e4c530 in __assert_fail () from /lib64/libc.so.6
#9  0x00004000f25e0edc in gpuVertexFinder::Producer::make(TrackSoAHeterogeneousT<32768> const*, float, float) const () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/pluginRecoPixelVertexingPixelVertexFindingPlugins.so
#10 0x00004000f25d6100 in PixelVertexProducerCUDA::produceOnCPU(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/pluginRecoPixelVertexingPixelVertexFindingPlugins.so
#11 0x000040003ef70c50 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#12 0x000040003ef6798c in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#13 0x000040003eec3a28 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#14 0x000040003eec3d74 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#15 0x000040003eec653c in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#16 0x000040003f456ed8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreConcurrency.so

Current Modules:
Module: PixelVertexProducerCUDA:hltPixelVerticesSoA@cpu (crashed)
Module: L1TGlobalProducer:hltGtStage2ObjectMap
Module: EcalRawToDigi:hltEcalDigisLegacy
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_aarch64_gcc10/CMSSW_12_4_X_2022-05-04-2300/pyRelValMatrixLogs/run/11634.24_TTbar_14TeV+2021_0T+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+RecoNano+HARVESTNano+ALCA/step2_TTbar_14TeV+2021_0T+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+RecoNano+HARVESTNano+ALCA.log

@makortel
Copy link
Contributor Author

makortel commented May 5, 2022

assign reconstruction, heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2022

New categories assigned: heterogeneous,reconstruction

@jpata,@slava77,@fwyzard,@clacaputo,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2022

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented May 5, 2022

FYI @VinInn @AdrianoDee

@fwyzard
Copy link
Contributor

fwyzard commented May 7, 2022

This is a CPU-only workflow, right ?

@makortel
Copy link
Contributor Author

makortel commented May 8, 2022

This is a CPU-only workflow, right ?

I think so. The least the SwitchProducer is using @cpu case.

@jpata
Copy link
Contributor

jpata commented May 16, 2022

type tracking

@cmsbuild cmsbuild added the trk label May 16, 2022
@mmusich
Copy link
Contributor

mmusich commented May 18, 2022

I think the type here should be tracking and not trk (vertexing is under tracking)

@aandvalenzuela
Copy link
Contributor

aandvalenzuela commented Feb 21, 2023

Hello,
Just to keep track of this issue :)
This assertion failure is still present in the current release cycle:

cmsRun: /data/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/f4101ca38f0ff520e5922918c7986929/opt/cmssw/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-02-19-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h:70: void gpuVertexFinder::fitVertices(gpuVertexFinder::VtxSoAView&, gpuVertexFinder::WsSoAView&, float): Assertion `wv[i] > 0.f' failed.

See most recent stacktrace. And it is also present in LTO IBs since we build for ARM now.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 5, 2024

I assumed this stopped failing once the HLT menu for this workflows was moved to the "fake" menu ?

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2024

cms-bot internal usage

@makortel
Copy link
Contributor Author

makortel commented Aug 5, 2024

I assumed this stopped failing once the HLT menu for this workflows was moved to the "fake" menu ?

Quite possible. On a quick look I didn't see this particular error in the IBs of past two weeks, but I also don't recall how frequent the failure was.

@mmusich
Copy link
Contributor

mmusich commented Aug 5, 2024

I assumed this stopped failing once the HLT menu for this workflows was moved to the "fake" menu ?

I guess we can make it reappear real quick by allowing 2024 here:

return (fragment=="TTbar_13" or fragment=="TTbar_14TeV") and ('2017' in key or '2018' in key or '2021' in key) and ('FS' not in key)

@fwyzard
Copy link
Contributor

fwyzard commented Aug 5, 2024

Do you think 12834.402 should also trigger the issue ?
I can try running that by hand on lxplus-arm (ARM Neoverse-N1) to check.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 6, 2024

12834.402 dos not seem to reproduce the issue, or at least not easily: I've run its step2 over 20 times on 100 events without problems on lxplus-arm.

@mmusich
Copy link
Contributor

mmusich commented Aug 6, 2024

Do you think 12834.402 should also trigger the issue ?

12834.402 does not seem to reproduce the issue,

I don't know if it is relevant but the original workflow 11634.24 forces the magnetic field to be 0T.

@makortel
Copy link
Contributor Author

Given all the changes (CUDA-to-Alpaka, related fixes in the Alpaka code, HLT menu updates) maybe we have reached the time to close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants