Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crashes in Run 380399 #44923

Closed
trtomei opened this issue May 7, 2024 · 128 comments
Closed

HLT crashes in Run 380399 #44923

trtomei opened this issue May 7, 2024 · 128 comments

Comments

@trtomei
Copy link
Contributor

trtomei commented May 7, 2024

Crashes observed in collisions Run 380399. Stack traces:

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:33:54 CEST 2024
Thread 1 (Thread 0x7f9b37c4a640 (LWP 552137) "cmsRun"):
#0 0x00007f9b388160e1 in poll () from /lib64/libc.so.6
#1 0x00007f9b2eb792ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007f9b2eb2cafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007f9b2eb2d460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f9b38ac882b in __lll_lock_wait () from /lib64/libpthread.so.0
#6 0x00007f9b38ac1ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#7 0x00007f9ad36b5266 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt::tryReuseCachedBlock(cms::alpakatools::CachingAllocator >::BlockDescriptor&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#8 0x00007f9ad36b641f in alpaka_cuda_async::EcalRawToDigiPortable::produce(alpaka_cuda_async::device::Event&, alpaka_cuda_async::device::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#9 0x00007f9ad36b1f73 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#10 0x00007f9b3b27a47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007f9b3b25ec2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007f9b3b1e6f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007f9b3b1e74c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007f9b3b157bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007f9b39992281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f9b3b16841b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007f9b3b17224d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 0x00007f9b3b1727b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f9b3997e9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:25:57 CEST 2024
Thread 1 (Thread 0x7fcd588eb640 (LWP 2367851) "cmsRun"):
#0 0x00007fcd594b70e1 in poll () from /lib64/libc.so.6
#1 0x00007fcd4f8092ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007fcd4f7bcafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007fcd4f7bd460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fcd59765020 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#6 0x00007fcd3ae57e86 in ?? () from /lib64/libcuda.so.1
#7 0x00007fcd3ab754d7 in ?? () from /lib64/libcuda.so.1
#8 0x00007fcd3ac46009 in ?? () from /lib64/libcuda.so.1
#9 0x00007fcd3c7b41b7 in ?? () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#10 0x00007fcd3c7f0490 in cudaEventQuery () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#11 0x00007fcd464a00a5 in cms::alpakatools::EventCache<alpaka::EventUniformCudaHipRt::get(alpaka::DevUniformCudaHipRt) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#12 0x00007fcd464a436b in alpaka_cuda_async::detail::EDMetadataSentry::EDMetadataSentry(edm::StreamID) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#13 0x00007fccf4392ed5 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#14 0x00007fcd5bf1b47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fcd5beffc2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fcd5be87f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fcd5be884c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fcd5bdf8bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007fcd5a633281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#20 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#21 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#22 0x00007fcd5be0941b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00007fcd5be1324d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#24 0x00007fcd5be137b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#25 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#26 0x00007fcd5a61f9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#27 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#28 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

We tried to reproduce it with the following recipe, but it didn't reproduce.

#!/bin/bash -ex

# CMSSW_14_0_6_MULTIARCHS

hltGetConfiguration run:380399 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000130_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000145_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000225_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000211_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000280_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000281_fu-c2b01-28-01_pid552137.root'\
> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Message #8 in the first stack trace seems to point to alpaka_cuda_async::EcalRawToDigiPortable::produce() method.

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

Best regards,
Thiago (for FOG)

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2024

A new Issue was created by @trtomei.

@makortel, @smuzaffar, @sextonkennedy, @rappoccio, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@trtomei
Copy link
Contributor Author

trtomei commented May 7, 2024

Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different:
http://cmsonline.cern.ch/cms-elog/1213820

The stack traces that I posted above are the ones I got from F3mon, though.

@missirol
Copy link
Contributor

missirol commented May 7, 2024

For reference, here are the log files of the 2 jobs.
old_hlt_run380399_pid552137.log
old_hlt_run380399_pid2367851.log

As reported in the elog above, they both contain the following.

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77:
void alpaka_cuda_async::vertexFinder::fitVertices(const TAcc &, reco::ZVertexLayout<128UL,
false>::ViewTemplateFreeParams<128UL, false, true, false> &, vertexFinder::PixelVertexWSSoALayout<128UL,
false>::ViewTemplateFreeParams<128UL, false, true, false> &, float) [with TAcc =
alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>]:
block: [0,0,0], thread: [24,0,0] Assertion `wv[i] > 0.f` failed.

@makortel
Copy link
Contributor

makortel commented May 7, 2024

assign heterogeneous, reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented May 7, 2024

New categories assigned: heterogeneous,reconstruction

@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented May 7, 2024

So in both cases the assertion in

for (auto i : cms::alpakatools::uniform_elements(acc, foundClusters)) {
ALPAKA_ASSERT_ACC(wv[i] > 0.f);
zv[i] /= wv[i];
nn[i] = -1; // ndof
}

fails, after which the processes get stuck, until they gets killed 6-7 hours later. (and because of the external termination the stack traces of only "the crashed thread" are meaningless alone)

@makortel
Copy link
Contributor

makortel commented May 7, 2024

The state of cmsRun seems similar to #44769 (comment) : all the TBB threads are stuck in acquiring locks (old_hlt_run380399_pid2367851.log has two TBB threads that don't have any work to do), and the CUDA threads are doing something and presumably holding the lock.

@makortel
Copy link
Contributor

makortel commented May 7, 2024

FYI @AdrianoDee

@mmusich
Copy link
Contributor

mmusich commented May 7, 2024

Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different:

@trtomei, all circumstantial evidence points to vertex reconstruction. Can you edit the title to remove the EcalRawToDigi mention?

@trtomei trtomei changed the title HLT crashes in Run 380399 -- possibly related to EcalRawToDigi HLT crashes in Run 380399 May 7, 2024
@dan131riley
Copy link

For emphasis: external termination request means it timed out, likely due to a deadlock. The timeout signal will be delivered to thread 1, but to diagnose the deadlock we need to see the stack traces of all the threads.

@missirol
Copy link
Contributor

Looks like this happened again. One LS (run 380624, LS 411) could not be "closed" because one process was stuck and had to be terminated manually hours later. The full log (see below) shows again

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77: [..]
 block: [0,0,0], thread: [21,0,0] Assertion `wv[i] > 0.f` failed.

similarly to #44923 (comment).

http://cmsonline.cern.ch/cms-elog/1214996

old_hlt_run380624_pid2375140.log

@VinInn
Copy link
Contributor

VinInn commented May 12, 2024

There should be no assert on GPU in first place.
In my opinion this is something that was seen already in the past. in standard CUDA in CMSSW we never had the assert on.

@mmusich
Copy link
Contributor

mmusich commented May 12, 2024

maybe I missing something

@VinInn
Copy link
Contributor

VinInn commented May 12, 2024

The assert in CUDA were compiled out because they are slow.
There is another thread on the subject. I understood that (by mistake) they are back since ALPAKA.

Of course the assert is still there on CPU but it will not print
block: [0,0,0], thread: [21,0,0]
so this is coming from CUDA and in the past it would not trigger!

@fwyzard
Copy link
Contributor

fwyzard commented May 12, 2024

Currently we do have asserts in Alpaka code, because the code is new and IMHO it's better to get meaningful errors with a well defined location than random failures.

Within alpaka code:

  • assert are enabled like in host code, and so are enabled in CMSSW;
  • ALPAKA_ASSERT should behave in the same way as assert - but I don't think we use them;
  • ALPAKA_ASSERT_ACC can be conditionally enabled on CPU and disabled on GPU, but for the time being we keep them enabled also on GPU.

I understood that (by mistake) they are back since ALPAKA.

The plan is to change all assert(...) in alpaka code to ALPAKA_ASSERT_ACC(...), then measure the impact of enabling/disabling them.
I'd probably keep them enabled for a 1%-2% slowdown, disable them for a 5% slowdown.

The fact that asserts may be causing deadlocks is a different matter: in principle, a failed assert in gpu code should cause an asynchronous error, and the next CUDA runtime function should detect it and fail.

Do we have a way to reproduce the problem offline ?

@VinInn
Copy link
Contributor

VinInn commented May 12, 2024

here is the old issue on CPU
#37820

@VinInn
Copy link
Contributor

VinInn commented May 12, 2024

Did anybody reproduce the assert offline?

@gparida
Copy link
Contributor

gparida commented May 24, 2024

This crash seems to appear again in the collision run 381067. Attaching the error log to this comment
old_hlt_run381067_pid1000282.log

We were unable to reproduce the error after running the HLT config again on the error stream files.

@VinInn
Copy link
Contributor

VinInn commented May 24, 2024

Do not see any sign of assert in the log from run 381067:
can somebody post a script to reproduce it offline?

@mmusich
Copy link
Contributor

mmusich commented May 24, 2024

Do not see any sign of assert in the log from run 381067:

it's at line 36196 in the log attached:

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77: void alpaka_cuda_async::vertexFinder::fitVertices(const TAcc &, reco::ZVertexLayout<128UL, false>::ViewTemplateFreeParams<128UL, false, true, false> &, vertexFinder::PixelVertexWSSoALayout<128UL, false>::ViewTemplateFreeParams<128UL, false, true, false> &, float) [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>]: block: [0,0,0], thread: [24,0,0] Assertion `wv[i] > 0.f` failed.

@mmusich
Copy link
Contributor

mmusich commented May 24, 2024

can somebody post a script to reproduce it offline?

@gparida if you tried to reproduce offline, can you post the script you used here?

@fwyzard
Copy link
Contributor

fwyzard commented May 24, 2024

Note that (also) in this case, the job was stuck not processing data for a while, and was eventually killed.

@gparida
Copy link
Contributor

gparida commented May 24, 2024

I tried reproducing it (CMSSW_14_0_7_MULTIARCHS) offline (unsuccessfully). Used the following:

#!/bin/bash -ex

# CMSSW_14_0_7_MULTIARCHS

hltGetConfiguration run:381067 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/error_stream_root/run381067/run381067_ls0164_index000095_fu-c2b04-28-01_pid1000282.root,'\
'/store/group/tsg/FOG/error_stream_root/run381067/run381067_ls0164_index000109_fu-c2b04-28-01_pid1000282.root'&> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Also, tried to rerun the job with

process.options.numberOfThreads = 32
process.options.numberOfStreams = 32

@mmusich
Copy link
Contributor

mmusich commented Aug 5, 2024

Coming back to this suggestion, since this issue is still unsolved and causing crashes (approx. a few per week), would it make sense to integrate the following change (to have a bit more info in the log files when there is a crash) ?

as discussed at the last TSG meeting, I think that's a possible way forward.
Opened:

@missirol
Copy link
Contributor

missirol commented Aug 5, 2024

Thanks @mmusich !

cmsbuild added a commit that referenced this issue Aug 5, 2024
instrument `fitVertices` to output more information when failing assert (issue #44923)
cmsbuild added a commit that referenced this issue Aug 6, 2024
[14.0.X] instrument `fitVertices` to output more information when failing assert (issue #44923)
@fwyzard
Copy link
Contributor

fwyzard commented Aug 6, 2024

compute-sanitizer --tool=racecheck --racecheck-report=all does report potential "RAW" (read-after-write) and "WAR" (write-after-read) in RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h:

========= Error: Potential WAR hazard detected at __shared__ 0xafac in block (0,0,0) :
=========     Read Thread (479,0,0) at 0x6d30 in /data/user/fwyzard/issue44923/CMSSW_14_0_13_patch1_MULTIARCHS/src/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h:148:void alpaka_cuda_async::vertexFinder::splitVertices<alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, (unsigned long)1>, unsigned int>>(const T1 &, reco::ZVertexLayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, vertexFinder::PixelVertexWSSoALayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, float)
=========     Write Thread (352,0,0) at 0x5db0 in /data/user/fwyzard/issue44923/CMSSW_14_0_13_patch1_MULTIARCHS/src/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h:69:void alpaka_cuda_async::vertexFinder::splitVertices<alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, (unsigned long)1>, unsigned int>>(const T1 &, reco::ZVertexLayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, vertexFinder::PixelVertexWSSoALayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, float)
=========     Current Value : 0, Incoming Value : 0
=========     ...
========= 

Looking at the two lines of RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h (line 69 and line 148) I'm thinking that the problem may happen when this loop rolls over without any synchronisation, and one thread sets nq = 0 on line 69 while another is still looping over it on line 148.

Adding

diff --git a/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h b/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
index e2ba0b46b8be..be3b20563663 100644
--- a/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
+++ b/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
@@ -150,6 +150,8 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::vertexFinder {
           iv[it[k]] = igv;
       }
 
+      // synchronise the threads before starting the next iteration of the loop of the groups
+      alpaka::syncBlockThreads(acc);
     }  // loop on vertices
   }
 

before the end of the loop seems to make racecheck happy (well, it makes it complain about a different piece of code in alpaka_cuda_async::pixelClustering::FindClus)

@mmusich
Copy link
Contributor

mmusich commented Aug 13, 2024

assign hlt

  • for book-keeping purposes

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Aug 13, 2024

Proposed fixes:

#45655 was included in CMSSW_14_0_14_MULTIARCHS which was deployed on Aug 12, 2024 (see e-log: http://cmsonline.cern.ch/cms-elog/1230042) during run-384365.
No crashes of this type have been observed (so far) in the following physics fill 9996

@mmusich
Copy link
Contributor

mmusich commented Aug 15, 2024

+hlt

@makortel
Copy link
Contributor

+heterogeneous

@mmusich
Copy link
Contributor

mmusich commented Aug 19, 2024

@cms-sw/reconstruction-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue.

@jfernan2
Copy link
Contributor

jfernan2 commented Sep 4, 2024

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 4, 2024

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

makortel commented Sep 4, 2024

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Sep 4, 2024
jyoti299 pushed a commit to jyoti299/cmssw that referenced this issue Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests