Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime crash when forcing only pixel tracking+vertexing on serial_sync backend #45708

Closed
missirol opened this issue Aug 15, 2024 · 34 comments
Closed

Comments

@missirol
Copy link
Contributor

The test in [1] crashes at runtime in CMSSW_14_0_14 when running on a machine with a GPU (I did not try on a machine without one). The test modifies a recent HLT pp menu by setting the backend of the Alpaka pixel-tracks and pixel-vertices SoA producers to "serial_sync" (in other words, offloading the pixel local reconstruction to GPUs, then forcing track and vertex reconstruction to run on CPU). This mimics the setup that the HIon group plans to implement in the lead-lead trigger menu of 2024 (see CMSHLT-3284) [*].

The stack trace is in [2]. The crash does not happen if one uses options.accelerators = ['cpu'].

Is [1] supposed to work ? If so, what's going wrong ?


[*] This 'mixed' approach (pixel local reconstruction on GPU, tracking and vertexing on CPU) has already been used in the 2023 HIon run, back then using the CUDA-based implementation of the pixel reconstruction. Pixel tracking is currently not offloaded to GPUs in the HIon menu because this leads to excessive GPU memory consumption (then, runtime crashes) in lead-lead events (at least with current data-taking conditions and current HLT hardware).

[1]

#!/bin/bash

[ $# -ge 1 ] || exit 1

hltLabel=hlt
outDir="${1}"

[ ! -d "${outDir}" ] || exit 1

mkdir -p "${outDir}"
cd "${outDir}"

hltGetConfiguration /dev/CMSSW_14_0_0/GRun/V173 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --output none \
  --max-events -1 \
  --paths MC_*Tracking* \
  --input root://eoscms.cern.ch//eos/cms/store/group/tsg/STEAM/validations/GPUVsCPU/240814/raw_pickevents.root \
  > "${hltLabel}".py

cat <<@EOF >> "${hltLabel}".py
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.wantSummary = True

for foo in ['HLTAnalyzerEndpath', 'dqmOutput', 'MessageLogger']:
    if hasattr(process, foo):
        process.__delattr__(foo)

process.load('FWCore.MessageLogger.MessageLogger_cfi')

#process.options.accelerators = ['cpu']

#process.hltEcalDigisSoA.alpaka.backend = 'serial_sync'
#process.hltEcalUncalibRecHitSoA.alpaka.backend = 'serial_sync'
#process.hltHbheRecoSoA.alpaka.backend = 'serial_sync'
#process.hltParticleFlowRecHitHBHESoA.alpaka.backend = 'serial_sync'
#process.hltParticleFlowClusterHBHESoA.alpaka.backend = 'serial_sync'
#process.hltOnlineBeamSpotDevice.alpaka.backend = 'serial_sync'
#process.hltSiPixelClustersSoA.alpaka.backend = 'serial_sync'
#process.hltSiPixelRecHitsSoA.alpaka.backend = 'serial_sync'
process.hltPixelTracksSoA.alpaka.backend = 'serial_sync'
process.hltPixelVerticesSoA.alpaka.backend = 'serial_sync'
@EOF

CUDA_LAUNCH_BLOCKING=1 \
cmsRun "${hltLabel}".py &> "${hltLabel}".log

[2]

%MSG-i AlpakaService:  (NoModuleName) 15-Aug-2024 21:26:57 CEST pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
%MSG
%MSG-i CUDAService:  (NoModuleName) 15-Aug-2024 21:26:57 CEST pre-events
CUDA runtime version 12.2, driver version 12.4, NVIDIA driver version 550.54.15
CUDA device 0: Tesla T4 (sm_75)
%MSG
%MSG-i AlpakaService:  (NoModuleName) 15-Aug-2024 21:26:57 CEST pre-events
AlpakaServiceCudaAsync succesfully initialised.
Found 1 device:
  - Tesla T4
%MSG
15-Aug-2024 21:27:02 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/group/tsg/STEAM/validations/GPUVsCPU/240814/raw_pickevents.root
15-Aug-2024 21:27:04 CEST  Successfully opened file root://eoscms.cern.ch//eos/cms/store/group/tsg/STEAM/validations/GPUVsCPU/240814/raw_pickevents.root
Begin processing the 1st record. Run 381065, Event 559650924, LumiSection 307 on stream 0 at 15-Aug-2024 21:27:10.842 CEST


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Thu Aug 15 21:27:15 CEST 2024
Thread 20 (Thread 0x7f272f1ff700 (LWP 2559414) "edm async pool"):
#0  0x00007f278283b0e1 in poll () from /lib64/libc.so.6
#1  0x00007f277c96543f in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f277c91a4bc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f277c91a640 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f26f923092f in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 2ul>, unsigned int, alpaka_serial_sync::caPixelDoublets::GetDoubletsFromHisto<pixelTopology::Phase1>, alpaka_serial_sync::CACellT<pixelTopology::Phase1>*, unsigned int*, cms::alpakatools::SimpleVector<cms::alpakatools::VecArray<unsigned int, 36> >*, cms::alpakatools::SimpleVector<cms::alpakatools::VecArray<unsigned short, 48> >*, TrackingRecHitSoA<pixelTopology::Phase1>::Layout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, caStructures::OuterHitOfCellT<pixelTopology::Phase1>*, unsigned int&, unsigned int const&, alpaka_serial_sync::caPixelDoublets::CellCutsT<pixelTopology::Phase1> const&>::operator()() const () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#6  0x00007f26f922272b in alpaka_serial_sync::CAHitNtupletGeneratorKernels<pixelTopology::Phase1>::buildDoublets(TrackingRecHitSoA<pixelTopology::Phase1>::Layout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, unsigned int, alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&) () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#7  0x00007f26f922324e in alpaka_serial_sync::CAHitNtupletGenerator<pixelTopology::Phase1>::makeTuplesAsync(TrackingRecHitHost<pixelTopology::Phase1> const&, pixelCPEforDevice::ParamsOnDeviceT<pixelTopology::Phase1> const*, float, alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&) const () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#8  0x00007f26f921c654 in alpaka_serial_sync::CAHitNtupletAlpaka<pixelTopology::Phase1>::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#9  0x00007f26f92161b4 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#10 0x00007f278528beb1 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f27852707be in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00007f27851fb639 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x00007f27851fbba4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x00007f27853ae178 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#15 0x00007f27839ac95b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f26fafc3c00, waiter=..., this=0x7f2780673e80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f2780673e80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137
#18 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599
#19 0x00007f27839aeb0e in tbb::detail::r1::rml::private_worker::run (this=0x7f277ccc7100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#20 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f277ccc7100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#21 0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#22 0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x7f272ffff700 (LWP 2559413) "edm async pool"):
#0  0x00007f2782aea45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f27853aebf6 in edm::impl::WaitingThread::threadLoop() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#2  0x00007f2783178a73 in std::execute_native_thread_routine (__p=0x7f26eb16ae40) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#3  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#4  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x7f2703160700 (LWP 2559278) "cmsRun"):
#0  0x00007f2782aecda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f2782aece98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f277bf6b706 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#3  0x00007f277bf6b7b9 in RunRunnerThread () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7f2703961700 (LWP 2559277) "cmsRun"):
#0  0x00007f2782aecda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f2782aece98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f277bf6b706 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#3  0x00007f277bf6b7b9 in RunRunnerThread () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7f2704162700 (LWP 2559276) "cmsRun"):
#0  0x00007f2782aecda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f2782aece98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f277bf6b706 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#3  0x00007f277bf6b7b9 in RunRunnerThread () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7f2704963700 (LWP 2559275) "cmsRun"):
#0  0x00007f2782aee180 in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f277c05af08 in XrdSysTimer::Wait(int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277bee003c in XrdCl::TaskManager::RunTasks() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#3  0x00007f277bee0179 in RunRunnerThread () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7f2705164700 (LWP 2559274) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7f2705965700 (LWP 2559273) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7f2706166700 (LWP 2559272) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f2706967700 (LWP 2559271) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f2707168700 (LWP 2559270) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f2707969700 (LWP 2559269) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f270816a700 (LWP 2559268) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f270896b700 (LWP 2559267) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f270916c700 (LWP 2559266) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f270996d700 (LWP 2559265) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f274d3de700 (LWP 2559177) "cuda-EvtHandlr"):
#0  0x00007f278283b0e1 in poll () from /lib64/libc.so.6
#1  0x00007f2778bc389f in ?? () from /lib64/libcuda.so.1
#2  0x00007f2778c91dcf in ?? () from /lib64/libcuda.so.1
#3  0x00007f2778bbe373 in ?? () from /lib64/libcuda.so.1
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f2755037700 (LWP 2559174) "cuda0000340000e"):
#0  0x00007f278283b0e1 in poll () from /lib64/libc.so.6
#1  0x00007f2778bc389f in ?? () from /lib64/libcuda.so.1
#2  0x00007f2778c91dcf in ?? () from /lib64/libcuda.so.1
#3  0x00007f2778bbe373 in ?? () from /lib64/libcuda.so.1
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f27558dd700 (LWP 2559167) "cmsRun"):
#0  0x00007f2782aee672 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f277c918147 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f277c91a3ea in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f2783178a73 in std::execute_native_thread_routine (__p=0x7f276caab6f0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f2781c6f640 (LWP 2559147) "cmsRun"):
#0  0x00007f27828109b8 in nanosleep () from /lib64/libc.so.6
#1  0x00007f27828108be in sleep () from /lib64/libc.so.6
#2  0x00007f277c917ff0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f27263f9070 in SiPixelFedCablingMap::pathToDetUnit(unsigned int) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libCondFormatsSiPixelObjects.so
#5  0x00007f272640167c in SiPixelQuality::getBadRocPositions(unsigned int const&, TrackerGeometry const&, SiPixelFedCabling const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libCondFormatsSiPixelObjects.so
#6  0x00007f2726533d15 in MeasurementTrackerImpl::initializePixelStatus(SiPixelQuality const*, SiPixelFedCabling const*, int, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007f272651cfdd in MeasurementTrackerESProducer::produce(CkfComponentsRecord const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f27265288a2 in edm::eventsetup::CallbackBase<edm::ESProducer, edm::ESProducer::setWhatProduced<MeasurementTrackerESProducer, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >(MeasurementTrackerESProducer*, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> > (MeasurementTrackerESProducer::*)(CkfComponentsRecord const&), edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> const&, edm::es::Label const&)::{lambda(CkfComponentsRecord const&)#1}, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >::makeProduceTask<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<MeasurementTrackerESProducer, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >(MeasurementTrackerESProducer*, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> > (MeasurementTrackerESProducer::*)(CkfComponentsRecord const&), edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> const&, edm::es::Label const&)::{lambda(CkfComponentsRecord const&)#1}, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >::prefetchAsync(edm::WaitingTaskHolder, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext const&)::{lambda(auto:1&&, auto:2&&, auto:3&&, auto:4&&)#1}::operator()<tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&>(tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&) const::{lambda(CkfComponentsRecord const&)#1}>(tbb::detail::d1::task_group*, edm::ServiceWeakToken const&, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, bool, tbb::detail::d1::task_group*&)::{lambda(std::__exception_ptr::exception_ptr const*)#1}::operator()(std::__exception_ptr::exception_ptr const*) const::{lambda()#2}::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f2726528ac8 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::eventsetup::CallbackBase<edm::ESProducer, edm::ESProducer::setWhatProduced<MeasurementTrackerESProducer, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >(MeasurementTrackerESProducer*, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> > (MeasurementTrackerESProducer::*)(CkfComponentsRecord const&), edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> const&, edm::es::Label const&)::{lambda(CkfComponentsRecord const&)#1}, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >::makeProduceTask<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<MeasurementTrackerESProducer, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >(MeasurementTrackerESProducer*, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> > (MeasurementTrackerESProducer::*)(CkfComponentsRecord const&), edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> const&, edm::es::Label const&)::{lambda(CkfComponentsRecord const&)#1}, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >::prefetchAsync(edm::WaitingTaskHolder, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext const&)::{lambda(auto:1&&, auto:2&&, auto:3&&, auto:4&&)#1}::operator()<tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&>(tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&) const::{lambda(CkfComponentsRecord const&)#1}>(tbb::detail::d1::task_group*, edm::ServiceWeakToken const&, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, bool, tbb::detail::d1::task_group*&)::{lambda(std::__exception_ptr::exception_ptr const*)#1}::operator()(std::__exception_ptr::exception_ptr const*) const::{lambda()#2}>(tbb::detail::d1::task_group&, tbb::detail::d1::task_group*&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f27853af650 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#11 0x00007f27839b5281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f2780673e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#12 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f2780673e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#13 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#14 0x00007f278517ecfb in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#15 0x00007f278518866a in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x00007f2785188bc1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#17 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#18 0x00007f27839a19ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#19 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#20 0x000000000040517c in main ()

Current Modules:

Module: CAHitNtupletAlpakaPhase1@alpaka:hltPixelTracksSoA (crashed)
Module: none

A fatal system signal has occurred: segmentation violation
@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 15, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @missirol.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign heterogeneous, reconstruction, hlt

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,reconstruction,hlt

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Let's tag @cms-sw/tracking-pog-l2

@makortel
Copy link
Contributor

The test modifies a recent HLT pp menu by setting the backend of the Alpaka pixel-tracks and pixel-vertices SoA producers to "serial_sync" (in other words, offloading the pixel local reconstruction to GPUs, then forcing track and vertex reconstruction to run on CPU).

Is [1] supposed to work ?

Theoretically I'd expect it to work, at least from the framework point of view.

@slava77
Copy link
Contributor

slava77 commented Aug 15, 2024

type tracking

@mmusich
Copy link
Contributor

mmusich commented Aug 16, 2024

@AdrianoDee FYI

@makortel
Copy link
Contributor

Compiling with debug symbols points the crash to occur in

@makortel
Copy link
Contributor

Some additional information from a debugger session

pIndex = 0
kl = 31
kk = 31
khh = 17
hoff = 256
phiBinner.off.m_v[hoff+kk] = 6504
phiBinner.content.m_capacity = 29601

so theoretically the p[0] should be valid (p = &(phiBinner.content.m_v[phiBinner.off.m_v[hoff+kk]])), assuming the phiBinner.content.m_v gets set properly.

Looking then at the HitsConstView<TrackerTraits> hh from where the phiBinner is obtained from

hh.elements_ = 29601
# consistent with phiBinner.content.m_capacity

hh.phiBinnerStorageParameters_.addr_ = 0x7fff5393e580
phiBinner.content.m_v = 0x7fff5373e580
# phiBinner.content.m_v is exactly 2 MiB smaller than phiBinnerStorageParameters_.addr_ !
# ok, the "exactly 2 MiB" could be a coincidence

phiBinner.content.m_v is set here

constexpr void init(I* v, int s) {
m_v = v;
m_capacity = s;
}

called from OneToManyAssocBase<...>::initStorage()
content.init(view.contentStorage, view.contentSize);

AFAICT initStorage() is called only in zeroAndInit kernel

and launchZero kernel

I see especially the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>

static auto copyAsync(TQueue& queue, TrackingRecHitDevice<TrackerTraits, TDevice> const& deviceData) {
TrackingRecHitHost<TrackerTraits> hostData(queue, deviceData.view().metadata().size());
alpaka::memcpy(queue, hostData.buffer(), deviceData.buffer());
#ifdef GPU_DEBUG
printf("TrackingRecHitsSoACollection: I'm copying to host.\n");
alpaka::wait(queue);
assert(deviceData.nHits() == hostData.nHits());
assert(deviceData.offsetBPIX2() == hostData.offsetBPIX2());
#endif
return hostData;
}

does not call the initStorage(), or set the phiBinner.content.m_v in any other way.

I see the HistoContainer unit test does call the initStorage() after the device-to-host copy

// We cannot update the contents address of the histo container before the copy from device happened
typename HistR::View hrv;
hrv.assoc = hr.data();
hrv.offSize = -1;
hrv.offStorage = nullptr;
hrv.contentSize = N;
hrv.contentStorage = hd.data();
hr->initStorage(hrv);

before inspecting the host-side data.


I think the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits> is missing the call to initStorage(), and that leads to the phiBinner.begin() to return a pointer to device memory, and then p[pIndex] to segfault.

Given the comment

// We cannot update the contents address of the histo container before the copy from device happened

means the the copyAsync() function must synchronize with alpaka::wait() before calling initStorage(). This might be sufficient at least for subsequent testing.

For the longer term, assuming we'd want to remove this alpaka::wait() call, it could be fairly straightforward to extend the CopyToHost and CopyToDevice class templates to allow a post-copy modification operation (in a way the present CopyToHost::copyAsync() resembles the acquire() method in ExternalWork/SynchronizingEDProducer, the new function would correspond the produce() method).

missirol added a commit to missirol/cmssw that referenced this issue Aug 18, 2024
…ection<TrackerTraits>

Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>,
in order to initialise the phiBinner data member on the host side.

A more complete explanation of the issue is provided by @makortel in
cms-sw#45708 (comment)
@mmusich
Copy link
Contributor

mmusich commented Aug 18, 2024

means the the copyAsync() function must synchronize with alpaka::wait() before calling initStorage(). This might be sufficient at least for subsequent testing.

shall we have a PR for this, while a more thorough fix is developed concerning cms-sw/framework-team#989 ?
I see that Marino has a commit missirol@62620da about it (I didn't test).
Let me remind that this is in the critical path for the building of the 2024 HIon menu. @cms-sw/core-l2

@missirol
Copy link
Contributor Author

missirol@62620da is my best-guess of a patch based on the explanations in #45708 (comment) (thanks @makortel for debugging the problem), but I don't know if it's correct.

I checked that it avoids the crash, and the trigger results are the same (modulo what I think are the usual small GPU-vs-CPU discrepancies) when running pixel tracking+vertexing on CPU (as in the reproducer in the description) vs running all Alpaka modules on GPU, but so far I only tested on O(10) events.

@makortel
Copy link
Contributor

shall we have a PR for this, while a more thorough fix is developed concerning cms-sw/framework-team#989 ?
I see that Marino has a commit missirol@62620da about it (I didn't test).

Fix along missirol@62620da is needed in any case. The cms-sw/framework-team#989 will only help to remove the alpaka::wait() call in missirol@62620da.

Let me remind that this is in the critical path for the building of the 2024 HIon menu.

Could you point me to a timeline?

Also, will the HLT use 14_0_X or 14_1_X for the HI data taking? (@missirol's test used 14_0_14, but my understanding is that 14_1_X would be the HI data taking release cycle). I'm asking early, because whether or not the outcome of cms-sw/framework-team#989 needs to be backported impacts how it will be done (because in 14_1_X-only could use C++20 features).

missirol@62620da is my best-guess of a patch based on the explanations in #45708 (comment)

I'd believe the lines relate to pbv (i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.

I checked that it avoids the crash, and the trigger results are the same (modulo what I think are the usual small GPU-vs-CPU discrepancies) when running pixel tracking+vertexing on CPU (as in the reproducer in the description) vs running all Alpaka modules on GPU, but so far I only tested on O(10) events.

👍 A performance test (to see the cost of the alpaka::wait()) would also be interesting.

@mmusich
Copy link
Contributor

mmusich commented Aug 19, 2024

@makortel

Could you point me to a timeline?

please refer to this
Screenshot from 2024-08-19 15-57-08

notice that any further tracking update hinges on this ticket to enter first.

Also, will the HLT use 14_0_X or 14_1_X for the HI data taking?

HLT will use 14_1_X for actual data-taking, but we're still integrating updates in 14_0_X (and will continue doing so until we have CMSSW_14_1_0 out, when we'll move the confDB template for HLT menu development). Thus we'll need both a master PR and a backport of at least something along the lines of missirol@62620da in order to keep moving.

mmusich pushed a commit to mmusich/cmssw that referenced this issue Aug 19, 2024
…ection<TrackerTraits>

Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>,
in order to initialise the phiBinner data member on the host side.

A more complete explanation of the issue is provided by @makortel in
cms-sw#45708 (comment)
mmusich pushed a commit to mmusich/cmssw that referenced this issue Aug 19, 2024
…ection<TrackerTraits>

Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>,
in order to initialise the phiBinner data member on the host side.

A more complete explanation of the issue is provided by @makortel in
cms-sw#45708 (comment)
@AdrianoDee
Copy link
Contributor

I'd believe the lines relate to pbv (i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.

I can confirm it (see #45743 (comment)).

@missirol
Copy link
Contributor Author

Sorry in advance for my ignorance...

missirol@62620da

I'd believe the lines relate to pbv (i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.

I don't know how to remove L46; I thought the initStorage method required a PhiBinnerView as function argument.

If I remove L47-51, the reproducer crashes as follows.

cmsRun: /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/src/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h:45: void cms::alpakatools::OneToManyAssocBase<I, ONES, SIZE>::initStorage(View) [with I = unsigned int; int ONES = 2561; int SIZE = -1]: Assertion `view.assoc == this' failed.

If I remove L48-51, the reproducer crashes as follows.

cmsRun: /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/src/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h:47: void cms::alpakatools::OneToManyAssocBase<I, ONES, SIZE>::initStorage(View) [with I = unsigned int; int ONES = 2561; int SIZE = -1]: Assertion `view.contentStorage' failed.

@makortel
Copy link
Contributor

Seems like they are necessary after all, thanks for trying out.

For longer term one could ask if the TrackingRecHitsSoA is really the right place for the PhiBinner etc. #44700 has some related discussion, although this question might really belong to #43796.

@mmusich
Copy link
Contributor

mmusich commented Aug 26, 2024

@mmusich
Copy link
Contributor

mmusich commented Aug 26, 2024

EDIT: it looks like these PRs generated the issue #45834, thus removing the hlt signature.

@makortel
Copy link
Contributor

For the longer term, assuming we'd want to remove this alpaka::wait() call, it could be fairly straightforward to extend the CopyToHost and CopyToDevice class templates to allow a post-copy modification operation (in a way the present CopyToHost::copyAsync() resembles the acquire() method in ExternalWork/SynchronizingEDProducer, the new function would correspond the produce() method).

A possibility for CopyToHost<T>::postCopy() is added in #45801. (such a facility is not needed for CopyToDevice, that can do similar operation by enqueuing a kernel call to to the queue)

@fwyzard
Copy link
Contributor

fwyzard commented Sep 2, 2024

A possibility for CopyToHost<T>::postCopy() is added in #45801. (such a facility is not needed for CopyToDevice, that can do similar operation by enqueuing a kernel call to to the queue)

Thanks @makortel. I would suggest to try and adopt it for CMSSW 14.2.x, and stick to the simpler bugfix for 14.0.x/14.1.x.

@makortel
Copy link
Contributor

makortel commented Sep 3, 2024

A possibility for CopyToHost<T>::postCopy() is added in #45801. (such a facility is not needed for CopyToDevice, that can do similar operation by enqueuing a kernel call to to the queue)

Thanks @makortel. I would suggest to try and adopt it for CMSSW 14.2.x, and stick to the simpler bugfix for 14.0.x/14.1.x.

Ok.

youngwan-kim pushed a commit to youngwan-kim/cmssw that referenced this issue Sep 11, 2024
…ection<TrackerTraits>

Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>,
in order to initialise the phiBinner data member on the host side.

A more complete explanation of the issue is provided by @makortel in
cms-sw#45708 (comment)
@mmusich
Copy link
Contributor

mmusich commented Sep 12, 2024

+hlt

@jfernan2
Copy link
Contributor

+1

@makortel
Copy link
Contributor

+heterogeneous

@makortel
Copy link
Contributor

@cmsbuild, please close

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

jyoti299 pushed a commit to jyoti299/cmssw that referenced this issue Oct 1, 2024
…ection<TrackerTraits>

Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>,
in order to initialise the phiBinner data member on the host side.

A more complete explanation of the issue is provided by @makortel in
cms-sw#45708 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants