Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT Farm crashes in run 378366~378369 #44541

Open
wonpoint4 opened this issue Mar 25, 2024 · 33 comments
Open

HLT Farm crashes in run 378366~378369 #44541

wonpoint4 opened this issue Mar 25, 2024 · 33 comments

Comments

@wonpoint4
Copy link

wonpoint4 commented Mar 25, 2024

Report the large numbers of GPU-related HLT crashes yesterday (elog)

  • Related to illegal memory access
  • Special Status_OnCPU path had non-zero rate, unexpected as this once occurs when there is no GPU available
  • Not fully understood as HLT menus were unchanged with respect to the previous runs
  • In order to suppress the crashes, all HLT menus were updated to disable all GPUs (elog)
  • DAQ experts confirmed it to be late crashes from the previous runs (elog)
  • Related to illegal memory access, the special Status_OnCPU path had a non-zero rate, unexpected as this once occurs when there is no GPU available
  • Suspected to be related to GPU drivers → in contact with DAQ experts

Here's the recipe how to reproduce the crashes. (tested with CMSSW_14_0_3 on lxplus8-gpu)

#!/bin/bash -ex

hltGetConfiguration adg:/cdaq/cosmic/commissioning2024/v1.1.0/HLT/V2 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240325_run378367/files/run378367_ls0016_index000315_fu-c2b05-11-01_pid2219084.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Here's the other way to reproduce the crashes.

# log in to an online GPU development machine (or lxplus8-gpu) and create a CMSSW area for 14.0.2
cmsrel CMSSW_14_0_2
cd CMSSW_14_0_2/src
cmsenv
# copy the HLT configuration that reproduces the crash and run it
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378366 > hlt_run378366.py
cat after_menu.py >> hlt_run378366.py ### See after_menu.py below
mkdir run378366
cmsRun hlt_run378366.py &> run378366.log

vi after_menu.py

from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream',
    runNumber = 378366
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    fileListMode = True,
    fileNames = (
        '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run378366/run378366_ls0001_index000000_fu-c2b03-05-01_pid1739399.raw',
    )
)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 25, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @wonpoint4.

@antoniovilela, @smuzaffar, @rappoccio, @Dr15Jones, @sextonkennedy, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign hlt, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Running the reproducer with CUDA_LAUNCH_BLOCKING=1 shows

terminate called after throwing an instance of 'std::runtime_error'
  what():
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered

#3  0x00007f2d11fbf720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f2d18272acf in raise () from /lib64/libc.so.6
#6  0x00007f2d18245ea5 in abort () from /lib64/libc.so.6
#7  0x00007f2d18c4ea49 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#8  0x00007f2d18c5a06a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007f2d18c590d9 in __cxa_call_terminate (ue_header=0x7f2c68e82820) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#10 0x00007f2d18c597f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7f2c69ff8380) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:688
#11 0x00007f2d1881f864 in _Unwind_RaiseException_Phase2 (exc=0x7f2c68e82820, context=0x7f2c69ff8380, frames_p=0x7f2c69ff8288) at ../../../libgcc/unwind.inc:64
#12 0x00007f2d188202bd in _Unwind_Resume (exc=0x7f2c68e82820) at ../../../libgcc/unwind.inc:242
#13 0x00007f2d0e2c2f5c in cms::cuda::free_device(int, void*) [clone .cold] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libHeterogeneousCoreCUDAUtilities.so
#14 0x00007f2ca620e028 in HBHERecHitProducerGPU::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) [clone .cold] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/pluginRecoLocalCaloHcalRecProducers.so
#15 0x00007f2d1ada1959 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x00007f2d1ada8099 in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#17 0x00007f2d1ad7b412 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#18 0x00007f2d1ad7b596 in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f2d1ad18b0f in edm::Worker::AcquireTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>, void>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so

FYI @cms-sw/hcal-dpg-l2

@abdoulline
Copy link

abdoulline commented Mar 26, 2024

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

@mmusich
Copy link
Contributor

mmusich commented Mar 26, 2024

So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US

For the record in neighboring runs there have been also crashes in the online DQM, see e.g.: 378366.
It would be interesting to know if that's due to the same kind of change (in that case a protection in the CPU code might be needed as well).

@abdoulline
Copy link

abdoulline commented Mar 26, 2024

@mmusich yes, the origin of DQM crashes is the same.
It (SOI move) revealed a lack of protection in one of the HCAL reco components (signal time fit in MAHI) added at the end of 2022. Tracked down to a couple of "suboptimal" lines. Protection/workaround is being discussed.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 26, 2024

... if and when we will have a full Alpaka implementation of the HCAL reconstruction, we will have a single code base to maintain :)

@kakwok
Copy link
Contributor

kakwok commented Mar 26, 2024

I'll make sure the Alpaka implementation has some protection against different SOI/TS configurations

@syuvivida
Copy link
Contributor

Hi @abdoulline @lwang046
Is there an estimate when hcalreco DQM client (and maybe the other hcal client as well?) will be updated? Thanks!!

Eiko for DQM-DC

@abdoulline
Copy link

abdoulline commented Mar 27, 2024

Hi @syuvivida
I suppose it shouldn't be a major issue=showstopper (as it wasn't in 2023), now that HCAL Digi format is back to the regular one after the aforementioned accident. It's rather a question of implementing additional protection, right?
The HCAL Reconstruction convener ( "hcalreco" in question is used everywhere, not only in DQM), @igv4321 has been contacted.

@syuvivida
Copy link
Contributor

Hi @abdoulline
indeed I was referring to adding the protection in the client hcalreco, sorry for not being explicit earlier. It is not a major issue now but it would be nice to have the code in place before things are forgotten (as many new things may appear when 13.6 TeV collisions arrive). Thanks!!

Eiko

@abdoulline
Copy link

@syuvivida
sure, we'll report to this open issue (to eventually ask for its closure).

@saumyaphor4252
Copy link
Contributor

saumyaphor4252 commented Mar 27, 2024

@abdoulline
Copy link

abdoulline commented Mar 27, 2024

@saumyaphor4252
yes, it was kind of predictable, unfortunately...
I'm afraid all the runs in the range 378361--378467 (the first "regular" HCAL setting were back in 378468) are affected.
If we exclude the runs that don't have HCAL in global, it's 378361-378432.
Can those be excluded/invalidated, as HCAL Digi settings/configuration were "non-standard" anyway ?

@igv4321 FYI

@wonpoint4 wonpoint4 changed the title HLT Farm GPU-related crashes in run 378366~378369 HLT Farm crashes in run 378366~378369 Apr 5, 2024
@abdoulline
Copy link

Just to add explicitly @mariadalfonso

@missirol
Copy link
Contributor

missirol commented May 7, 2024

@cms-sw/hcal-dpg-l2

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

Will this be done for the CUDA implementation ?

@missirol
Copy link
Contributor

missirol commented May 7, 2024

@kakwok

I'll make sure the Alpaka implementation has some protection against different SOI/TS configurations

Is this included in #44910 ? (if so, where ? just out of curiosity)

@kakwok
Copy link
Contributor

kakwok commented May 7, 2024

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

Hi @missirol , thanks for bringing this up, it's not included in #44910 yet. The issue seems to be a mis-configuration that MAHI does not currently support. I need more information about

  • How to detect the mis-configuration in MAHI (Digisize? SOI?)
  • What is the desired behavior of MAHI (return zero energy with a warning/error?)
  • A way to reproduce the misconfiguration to test the protection/warning

Maybe @abdoulline or @mariadalfonso will have some idea about these questions? Then we can discuss weather to include these changes in #44910

@abdoulline
Copy link

abdoulline commented May 8, 2024

@kakwok @mariadalfonso
To my knowledge, MAHI (as is) can not cope with moving/changed (other than ==3) SOI position.
So, we're talking (just) about letting MAHI to die gracefully (instead of provoking segfault) with an appropriate logError.
MAHI input is QIE11DigiCollection with QIE11DataFrame constituents, having:

bool .soi()
https://cmssdt.cern.ch/lxr/source/DataFormats/HcalDigi/interface/QIE11DataFrame.h#0044
which is used for calculating

int presamples()
https://cmssdt.cern.ch/lxr/source/DataFormats/HcalDigi/interface/QIE11DataFrame.h#0079

Normally presamples == 3. Otherwise this is bad data originating from misconfigured HCAL (as it was back on March 24-25) which shouldn't happen.

@fwyzard
Copy link
Contributor

fwyzard commented May 8, 2024

@abdoulline thanks for the comments and suggestions.

IMHO there are various options that would work better than the current failure mode:

  • detecting the problem in the unpacker and producing an empty collection of digis
  • detecting the problem in the local reconstruction and producing an empty collection of rechits
  • detecting the problem in the local reconstruction and producing a collection of rechits with only "method 0" energy, and not MAHI energy

The LogError is fine - even though nobody will likely see it.

@abdoulline
Copy link

abdoulline commented May 8, 2024

@fwyzard
I agree - it's fair enough to detect unexpected SOI shift in the unpacking step before reconstruction.
Would need a configurable of the expected SOI (to compare with).
The same holds for the "entrance" of the reconstruction.

But the source of the problem was a general HCAL misconfiguration and (I'd think) it better be spotted and fixed asap rather than to be mitigated somehow on the fly?

Empty collection of RecHits would mean a large part of HCAL is out. It'd alter severely most of the triggers, I suppose.

M0 is a very poor replacement of MAHI in HE (not only absence of PU mitigation, but also could induce an energy scale difference) and it uses TS window limits from DB (so they need to be re-adjusted on the fly...).

Now (if it's not just about stopping the jobs) this issue may need to be discussed in HCAL DPG.
🤔

@fwyzard
Copy link
Contributor

fwyzard commented May 8, 2024

But the source of the problem was a general HCAL misconfiguration and (I'd think) it better be spotted and fixed asap rather than to be mitigated somehow on the fly?

I agree, but crashing the whole HLT farm is not the right way to detect the problem.

I'm happy with any solution that makes it clear the data is bad, but does not require cleaning up about 200 HLT nodes.

@mariadalfonso
Copy link
Contributor

Was this again a Phase Scan ?
For these technical runs we should have another sequence for this i.e. the CPU version.

Mahi on CPU can cope with shift and also extended number of timeslices i.e. from 8 to 10, but for the GPU-CUDA implementation is all kind of frozen.
Since is being rewritten we should solve this directly there.

@abdoulline
Copy link

abdoulline commented May 8, 2024

Hi Maria
@mariadalfonso

no there were no new instances of the issue since March 24-25 (HCAL misconfig).
It was just a return back to the pending subject...

So, the goal is (1) not stop HLT farm (2) detect (make the problem to be known) asap if it happens, to reconfigure HCAL asap.

@abdoulline
Copy link

@kakwok

just would like to draw your attention to Maria's suggestion:

Mahi on CPU can cope with shift and also extended number of timeslices i.e. from 8 to 10, but for the GPU-CUDA implementation is all kind of frozen. Since is being rewritten we should solve this directly there.

@kakwok
Copy link
Contributor

kakwok commented May 10, 2024

@abdoulline The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future.

@mmusich
Copy link
Contributor

mmusich commented Jul 24, 2024

The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future.

just for the record, mahi @ alpaka still crashes:

#!/bin/bash -ex

# List of run numbers
runs=(378366 378369)
     
# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"

# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"

# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"

# Loop over each run number
for run in "${runs[@]}"; do
  # Set the MALLOC_CONF environment variable
  # export MALLOC_CONF=junk:true

  # Construct the input directory path
  input_dir="${base_dir}${run}"

  # Find all root files in the input directory on EOS
  root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

  # Check if there are any root files found
  if [ -z "${root_files}" ]; then
    echo "No root files found for run ${run} in directory ${input_dir}."
    continue
  fi

  # Create filenames for the HLT configuration and log file
  hlt_config_file="hlt_run${run}.py"
  hlt_log_file="hlt_run${run}.log"

  # Generate the HLT configuration file
  hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \
    --globaltag ${global_tag} \
    --data \
    --eras Run3 \
    --l1-emulator uGT \
    --l1 L1Menu_Collisions2024_v1_3_0_xml \
    --no-prescale \
    --no-output \
    --max-events -1 \
    --input ${root_files} > ${hlt_config_file}

  # Append additional options to the configuration file
  cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

# Run the HLT configuration with cmsRun and redirect output to log file
cmsRun ${hlt_config_file} &> ${hlt_log_file}

done

results in:

Thread 1 (Thread 0x7fe5a92e5640 (LWP 1447698) "cmsRun"):
#0  0x00007fe5a9ec0ac1 in poll () from /lib64/libc.so.6
#1  0x00007fe5a20660cf in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fe5a201a1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFW
CoreServicesPlugins.so
#3  0x00007fe5a201a370 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe4f1e0072e in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 3ul>, unsigned int, alpaka_serial_sync::hcal::reconstruction::mahi::Kernel_prep_pulseMatrices_sameNumberOfSampl
es, float*, float*, float*, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, float*, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstView
TemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>
::ConstViewTemplateFreeParams<128ul, false, true, true> const&, signed char*, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalReco
ParamWithPulseShapeT<alpaka::DevCpu>::ConstView const&, float const&, float const&, float const&, bool const&, float const&, float const&, float const&>::operator()() const () from /cvmfs/cms.cern.ch/el8
_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#6  0x00007fe4f1e09388 in alpaka_serial_sync::hcal::reconstruction::runMahiAsync(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTemplateFreePa
rams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTem
plateFreeParams<128ul, false, true, true> const&, hcal::HcalRecHitSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, true>, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemp
lateFreeParams<128ul, false, true, true> const&, hcal::HcalSiPMCharacteristicsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalRecoParamWithPulseShapeT<alp
aka::DevCpu>::ConstView const&, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, alpaka_serial_sync::hcal::reconstruction::ConfigParameters
 const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#7  0x00007fe4f1ddde29 in alpaka_serial_sync::HBHERecHitProducerPortable::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_g
cc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#8  0x00007fe4f1de03d3 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MUL
TIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#9  0x00007fe5ac93b4cf in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12
/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#10 0x00007fe5ac91fc6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/
CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007fe5ac8a7f69 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::excepti
on_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchA
ctionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007fe5ac8a84d5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_M
ULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007fe5aca5a1d8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CM
SSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so
#14 0x00007fe5ab051281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/worksp
ace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_
pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1
-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fe5ac82942b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fe5ac83325d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007fe5ac8337c1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#21 0x00007fe5ab03d9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8
_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#22 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#23 0x000000000040517c in main ()

Current Modules:

Module: alpaka_serial_sync::HBHERecHitProducerPortable:hltHbheRecoSoASerialSync (crashed)

A fatal system signal has occurred: segmentation violation

@kakwok any plans about this?

@kakwok
Copy link
Contributor

kakwok commented Jul 24, 2024 via email

@mmusich
Copy link
Contributor

mmusich commented Jul 24, 2024

Has there been any change of Hcal configuration for number of TS in the digi recently?

I don't know, but just to be clear this is using old data (from run 378366~378369) back in March. I think the agreement was to try to protect it once we have mahi @ alpaka in release.

@kakwok
Copy link
Contributor

kakwok commented Jul 24, 2024 via email

@mmusich
Copy link
Contributor

mmusich commented Jul 24, 2024

will be added in the next iteration.

the question is about the plan (timeline) for the next iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests