HLT Farm crashes in run 378366~378369 #44541

wonpoint4 · 2024-03-25T19:36:54Z

Report the large numbers of GPU-related HLT crashes yesterday (elog)

Related to illegal memory access
Special Status_OnCPU path had non-zero rate, unexpected as this once occurs when there is no GPU available
Not fully understood as HLT menus were unchanged with respect to the previous runs
In order to suppress the crashes, all HLT menus were updated to disable all GPUs (elog)
DAQ experts confirmed it to be late crashes from the previous runs (elog)
Related to illegal memory access, the special Status_OnCPU path had a non-zero rate, unexpected as this once occurs when there is no GPU available
Suspected to be related to GPU drivers → in contact with DAQ experts

Here's the recipe how to reproduce the crashes. (tested with CMSSW_14_0_3 on lxplus8-gpu)

#!/bin/bash -ex

hltGetConfiguration adg:/cdaq/cosmic/commissioning2024/v1.1.0/HLT/V2 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240325_run378367/files/run378367_ls0016_index000315_fu-c2b05-11-01_pid2219084.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Here's the other way to reproduce the crashes.

# log in to an online GPU development machine (or lxplus8-gpu) and create a CMSSW area for 14.0.2
cmsrel CMSSW_14_0_2
cd CMSSW_14_0_2/src
cmsenv
# copy the HLT configuration that reproduces the crash and run it
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378366 > hlt_run378366.py
cat after_menu.py >> hlt_run378366.py ### See after_menu.py below
mkdir run378366
cmsRun hlt_run378366.py &> run378366.log

vi after_menu.py

from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream',
    runNumber = 378366
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    fileListMode = True,
    fileNames = (
        '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run378366/run378366_ls0001_index000000_fu-c2b03-05-01_pid1739399.raw',
    )
)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-03-25T19:37:09Z

cms-bot internal usage

cmsbuild · 2024-03-25T19:37:10Z

A new Issue was created by @wonpoint4.

@antoniovilela, @smuzaffar, @rappoccio, @Dr15Jones, @sextonkennedy, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2024-03-25T19:44:26Z

assign hlt, heterogeneous

cmsbuild · 2024-03-25T19:44:52Z

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2024-03-25T20:20:41Z

Running the reproducer with CUDA_LAUNCH_BLOCKING=1 shows

terminate called after throwing an instance of 'std::runtime_error'
  what():
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered

#3  0x00007f2d11fbf720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f2d18272acf in raise () from /lib64/libc.so.6
#6  0x00007f2d18245ea5 in abort () from /lib64/libc.so.6
#7  0x00007f2d18c4ea49 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#8  0x00007f2d18c5a06a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007f2d18c590d9 in __cxa_call_terminate (ue_header=0x7f2c68e82820) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#10 0x00007f2d18c597f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7f2c69ff8380) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:688
#11 0x00007f2d1881f864 in _Unwind_RaiseException_Phase2 (exc=0x7f2c68e82820, context=0x7f2c69ff8380, frames_p=0x7f2c69ff8288) at ../../../libgcc/unwind.inc:64
#12 0x00007f2d188202bd in _Unwind_Resume (exc=0x7f2c68e82820) at ../../../libgcc/unwind.inc:242
#13 0x00007f2d0e2c2f5c in cms::cuda::free_device(int, void*) [clone .cold] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libHeterogeneousCoreCUDAUtilities.so
#14 0x00007f2ca620e028 in HBHERecHitProducerGPU::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) [clone .cold] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/pluginRecoLocalCaloHcalRecProducers.so
#15 0x00007f2d1ada1959 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x00007f2d1ada8099 in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#17 0x00007f2d1ad7b412 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#18 0x00007f2d1ad7b596 in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f2d1ad18b0f in edm::Worker::AcquireTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>, void>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so

FYI @cms-sw/hcal-dpg-l2

abdoulline · 2024-03-26T13:24:36Z

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

mmusich · 2024-03-26T14:12:21Z

So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US

For the record in neighboring runs there have been also crashes in the online DQM, see e.g.: 378366.
It would be interesting to know if that's due to the same kind of change (in that case a protection in the CPU code might be needed as well).

abdoulline · 2024-03-26T14:41:42Z

@mmusich yes, the origin of DQM crashes is the same.
It (SOI move) revealed a lack of protection in one of the HCAL reco components (signal time fit in MAHI) added at the end of 2022. Tracked down to a couple of "suboptimal" lines. Protection/workaround is being discussed.

fwyzard · 2024-03-26T14:49:25Z

... if and when we will have a full Alpaka implementation of the HCAL reconstruction, we will have a single code base to maintain :)

kakwok · 2024-03-26T15:14:46Z

I'll make sure the Alpaka implementation has some protection against different SOI/TS configurations

syuvivida · 2024-03-27T09:09:36Z

Hi @abdoulline @lwang046
Is there an estimate when hcalreco DQM client (and maybe the other hcal client as well?) will be updated? Thanks!!

Eiko for DQM-DC

abdoulline · 2024-03-27T09:18:02Z

Hi @syuvivida
I suppose it shouldn't be a major issue=showstopper (as it wasn't in 2023), now that HCAL Digi format is back to the regular one after the aforementioned accident. It's rather a question of implementing additional protection, right?
The HCAL Reconstruction convener ( "hcalreco" in question is used everywhere, not only in DQM), @igv4321 has been contacted.

syuvivida · 2024-03-27T09:24:32Z

Hi @abdoulline
indeed I was referring to adding the protection in the client hcalreco, sorry for not being explicit earlier. It is not a major issue now but it would be nice to have the code in place before things are forgotten (as many new things may appear when 13.6 TeV collisions arrive). Thanks!!

Eiko

abdoulline · 2024-03-27T09:28:55Z

@syuvivida
sure, we'll report to this open issue (to eventually ask for its closure).

saumyaphor4252 · 2024-03-27T09:49:59Z

@abdoulline We are now also seeing some failures in T0 Prompt processing jobs with similar symptoms. See

abdoulline · 2024-03-27T10:08:14Z

@saumyaphor4252
yes, it was kind of predictable, unfortunately...
I'm afraid all the runs in the range 378361--378467 (the first "regular" HCAL setting were back in 378468) are affected.
If we exclude the runs that don't have HCAL in global, it's 378361-378432.
Can those be excluded/invalidated, as HCAL Digi settings/configuration were "non-standard" anyway ?

@igv4321 FYI

abdoulline · 2024-04-08T08:38:25Z

Just to add explicitly @mariadalfonso

missirol · 2024-05-07T12:52:28Z

@cms-sw/hcal-dpg-l2

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

Will this be done for the CUDA implementation ?

missirol · 2024-05-07T12:52:38Z

@kakwok

I'll make sure the Alpaka implementation has some protection against different SOI/TS configurations

Is this included in #44910 ? (if so, where ? just out of curiosity)

kakwok · 2024-05-07T15:32:32Z

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

Hi @missirol , thanks for bringing this up, it's not included in #44910 yet. The issue seems to be a mis-configuration that MAHI does not currently support. I need more information about

How to detect the mis-configuration in MAHI (Digisize? SOI?)
What is the desired behavior of MAHI (return zero energy with a warning/error?)
A way to reproduce the misconfiguration to test the protection/warning

Maybe @abdoulline or @mariadalfonso will have some idea about these questions? Then we can discuss weather to include these changes in #44910

abdoulline · 2024-05-08T10:51:47Z

@kakwok @mariadalfonso
To my knowledge, MAHI (as is) can not cope with moving/changed (other than ==3) SOI position.
So, we're talking (just) about letting MAHI to die gracefully (instead of provoking segfault) with an appropriate logError.
MAHI input is QIE11DigiCollection with QIE11DataFrame constituents, having:

bool .soi()
https://cmssdt.cern.ch/lxr/source/DataFormats/HcalDigi/interface/QIE11DataFrame.h#0044
which is used for calculating

int presamples()
https://cmssdt.cern.ch/lxr/source/DataFormats/HcalDigi/interface/QIE11DataFrame.h#0079

Normally presamples == 3. Otherwise this is bad data originating from misconfigured HCAL (as it was back on March 24-25) which shouldn't happen.

fwyzard · 2024-05-08T11:19:11Z

@abdoulline thanks for the comments and suggestions.

IMHO there are various options that would work better than the current failure mode:

detecting the problem in the unpacker and producing an empty collection of digis
detecting the problem in the local reconstruction and producing an empty collection of rechits
detecting the problem in the local reconstruction and producing a collection of rechits with only "method 0" energy, and not MAHI energy

The LogError is fine - even though nobody will likely see it.

abdoulline · 2024-05-08T12:24:46Z

@fwyzard
I agree - it's fair enough to detect unexpected SOI shift in the unpacking step before reconstruction.
Would need a configurable of the expected SOI (to compare with).
The same holds for the "entrance" of the reconstruction.

But the source of the problem was a general HCAL misconfiguration and (I'd think) it better be spotted and fixed asap rather than to be mitigated somehow on the fly?

Empty collection of RecHits would mean a large part of HCAL is out. It'd alter severely most of the triggers, I suppose.

M0 is a very poor replacement of MAHI in HE (not only absence of PU mitigation, but also could induce an energy scale difference) and it uses TS window limits from DB (so they need to be re-adjusted on the fly...).

Now (if it's not just about stopping the jobs) this issue may need to be discussed in HCAL DPG.
🤔

fwyzard · 2024-05-08T12:34:35Z

But the source of the problem was a general HCAL misconfiguration and (I'd think) it better be spotted and fixed asap rather than to be mitigated somehow on the fly?

I agree, but crashing the whole HLT farm is not the right way to detect the problem.

I'm happy with any solution that makes it clear the data is bad, but does not require cleaning up about 200 HLT nodes.

mariadalfonso · 2024-05-08T12:52:21Z

Was this again a Phase Scan ?
For these technical runs we should have another sequence for this i.e. the CPU version.

Mahi on CPU can cope with shift and also extended number of timeslices i.e. from 8 to 10, but for the GPU-CUDA implementation is all kind of frozen.
Since is being rewritten we should solve this directly there.

abdoulline · 2024-05-08T13:05:24Z

Hi Maria
@mariadalfonso

no there were no new instances of the issue since March 24-25 (HCAL misconfig).
It was just a return back to the pending subject...

So, the goal is (1) not stop HLT farm (2) detect (make the problem to be known) asap if it happens, to reconfigure HCAL asap.

abdoulline · 2024-05-10T04:29:41Z

@kakwok

just would like to draw your attention to Maria's suggestion:

Mahi on CPU can cope with shift and also extended number of timeslices i.e. from 8 to 10, but for the GPU-CUDA implementation is all kind of frozen. Since is being rewritten we should solve this directly there.

kakwok · 2024-05-10T16:18:48Z

@abdoulline The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future.

mmusich · 2024-07-24T18:15:06Z

The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future.

just for the record, mahi @ alpaka still crashes:

#!/bin/bash -ex

# List of run numbers
runs=(378366 378369)
     
# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"

# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"

# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"

# Loop over each run number
for run in "${runs[@]}"; do
  # Set the MALLOC_CONF environment variable
  # export MALLOC_CONF=junk:true

  # Construct the input directory path
  input_dir="${base_dir}${run}"

  # Find all root files in the input directory on EOS
  root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

  # Check if there are any root files found
  if [ -z "${root_files}" ]; then
    echo "No root files found for run ${run} in directory ${input_dir}."
    continue
  fi

  # Create filenames for the HLT configuration and log file
  hlt_config_file="hlt_run${run}.py"
  hlt_log_file="hlt_run${run}.log"

  # Generate the HLT configuration file
  hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \
    --globaltag ${global_tag} \
    --data \
    --eras Run3 \
    --l1-emulator uGT \
    --l1 L1Menu_Collisions2024_v1_3_0_xml \
    --no-prescale \
    --no-output \
    --max-events -1 \
    --input ${root_files} > ${hlt_config_file}

  # Append additional options to the configuration file
  cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

# Run the HLT configuration with cmsRun and redirect output to log file
cmsRun ${hlt_config_file} &> ${hlt_log_file}

done

results in:

Thread 1 (Thread 0x7fe5a92e5640 (LWP 1447698) "cmsRun"):
#0  0x00007fe5a9ec0ac1 in poll () from /lib64/libc.so.6
#1  0x00007fe5a20660cf in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fe5a201a1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFW
CoreServicesPlugins.so
#3  0x00007fe5a201a370 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe4f1e0072e in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 3ul>, unsigned int, alpaka_serial_sync::hcal::reconstruction::mahi::Kernel_prep_pulseMatrices_sameNumberOfSampl
es, float*, float*, float*, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, float*, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstView
TemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>
::ConstViewTemplateFreeParams<128ul, false, true, true> const&, signed char*, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalReco
ParamWithPulseShapeT<alpaka::DevCpu>::ConstView const&, float const&, float const&, float const&, bool const&, float const&, float const&, float const&>::operator()() const () from /cvmfs/cms.cern.ch/el8
_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#6  0x00007fe4f1e09388 in alpaka_serial_sync::hcal::reconstruction::runMahiAsync(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTemplateFreePa
rams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTem
plateFreeParams<128ul, false, true, true> const&, hcal::HcalRecHitSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, true>, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemp
lateFreeParams<128ul, false, true, true> const&, hcal::HcalSiPMCharacteristicsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalRecoParamWithPulseShapeT<alp
aka::DevCpu>::ConstView const&, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, alpaka_serial_sync::hcal::reconstruction::ConfigParameters
 const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#7  0x00007fe4f1ddde29 in alpaka_serial_sync::HBHERecHitProducerPortable::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_g
cc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#8  0x00007fe4f1de03d3 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MUL
TIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#9  0x00007fe5ac93b4cf in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12
/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#10 0x00007fe5ac91fc6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/
CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007fe5ac8a7f69 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::excepti
on_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchA
ctionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007fe5ac8a84d5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_M
ULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007fe5aca5a1d8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CM
SSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so
#14 0x00007fe5ab051281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/worksp
ace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_
pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1
-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fe5ac82942b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fe5ac83325d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007fe5ac8337c1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#21 0x00007fe5ab03d9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8
_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#22 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#23 0x000000000040517c in main ()

Current Modules:

Module: alpaka_serial_sync::HBHERecHitProducerPortable:hltHbheRecoSoASerialSync (crashed)

A fatal system signal has occurred: segmentation violation

@kakwok any plans about this?

kakwok · 2024-07-24T18:19:09Z

Has there been any change of Hcal configuration for number of TS in the digi recently?

…

On Wed, Jul 24, 2024, 20:15 Marco Musich ***@***.***> wrote: The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future. just for the record, mahi @ alpaka still crashes: #!/bin/bash -ex # List of run numbers runs=(378366 378369) # Base directory for input files on EOS base_dir="/store/group/tsg/FOG/error_stream_root/run" # Global tag for the HLT configuration global_tag="140X_dataRun3_HLT_v3" # EOS command (adjust this if necessary for your environment) eos_cmd="eos" # Loop over each run numberfor run in "${runs[@]}"; do # Set the MALLOC_CONF environment variable # export MALLOC_CONF=junk:true # Construct the input directory path input_dir="${base_dir}${run}" # Find all root files in the input directory on EOS root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -) # Check if there are any root files found if [ -z "${root_files}" ]; then echo "No root files found for run ${run} in directory ${input_dir}." continue fi # Create filenames for the HLT configuration and log file hlt_config_file="hlt_run${run}.py" hlt_log_file="hlt_run${run}.log" # Generate the HLT configuration file hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \ --globaltag ${global_tag} \ --data \ --eras Run3 \ --l1-emulator uGT \ --l1 L1Menu_Collisions2024_v1_3_0_xml \ --no-prescale \ --no-output \ --max-events -1 \ --input ${root_files} > ${hlt_config_file} # Append additional options to the configuration file cat ***@***.*** >> ${hlt_config_file}del process.MessageLoggerprocess.load('FWCore.MessageService.MessageLogger_cfi') process.options.wantSummary = Trueprocess.options.numberOfThreads = 1process.options.numberOfStreams = ***@***.*** # Run the HLT configuration with cmsRun and redirect output to log file cmsRun ${hlt_config_file} &> ${hlt_log_file} done results in: Thread 1 (Thread 0x7fe5a92e5640 (LWP 1447698) "cmsRun"): #0 0x00007fe5a9ec0ac1 in poll () from /lib64/libc.so.6 #1 0x00007fe5a20660cf in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so #2 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so#2> 0x00007fe5a201a1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFW CoreServicesPlugins.so #3 0x00007fe5a201a370 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so #4 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so#4> <signal handler called> #5 0x00007fe4f1e0072e in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 3ul>, unsigned int, alpaka_serial_sync::hcal::reconstruction::mahi::Kernel_prep_pulseMatrices_sameNumberOfSampl es, float*, float*, float*, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, float*, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstView TemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false> ::ConstViewTemplateFreeParams<128ul, false, true, true> const&, signed char*, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalReco ParamWithPulseShapeT<alpaka::DevCpu>::ConstView const&, float const&, float const&, float const&, bool const&, float const&, float const&, float const&>::operator()() const () from /cvmfs/cms.cern.ch/el8 _amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so #6 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#6> 0x00007fe4f1e09388 in alpaka_serial_sync::hcal::reconstruction::runMahiAsync(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTemplateFreePa rams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTem plateFreeParams<128ul, false, true, true> const&, hcal::HcalRecHitSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, true>, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemp lateFreeParams<128ul, false, true, true> const&, hcal::HcalSiPMCharacteristicsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalRecoParamWithPulseShapeT<alp aka::DevCpu>::ConstView const&, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, alpaka_serial_sync::hcal::reconstruction::ConfigParameters const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so #7 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#7> 0x00007fe4f1ddde29 in alpaka_serial_sync::HBHERecHitProducerPortable::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_g cc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so #8 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#8> 0x00007fe4f1de03d3 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MUL TIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so #9 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#9> 0x00007fe5ac93b4cf in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12 /cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so #10 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#10> 0x00007fe5ac91fc6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/ CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so #11 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#11> 0x00007fe5ac8a7f69 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::excepti on_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchA ctionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so #12 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#12> 0x00007fe5ac8a84d5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_M ULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so #13 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#13> 0x00007fe5aca5a1d8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CM SSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so #14 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so#14> 0x00007fe5ab051281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/worksp ace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322 #15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_ pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458 #16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1 -build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168 #17 0x00007fe5ac82942b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so #18 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#18> 0x00007fe5ac83325d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so #19 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#19> 0x00007fe5ac8337c1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so #20 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#20> 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const () #21 0x00007fe5ab03d9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8 _amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688 #22 0x0000000000408ed2 in main::{lambda()#1}::operator()() const () #23 0x000000000040517c in main () Current Modules: Module: alpaka_serial_sync::HBHERecHitProducerPortable:hltHbheRecoSoASerialSync (crashed) A fatal system signal has occurred: segmentation violation @kakwok <https://github.com/kakwok> any plans about this? — Reply to this email directly, view it on GitHub <#44541 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBUQPJPOZVAR52TDCY7OADZN7VMFAVCNFSM6AAAAABFHTTHFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGYZTINJYGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mmusich · 2024-07-24T18:20:27Z

Has there been any change of Hcal configuration for number of TS in the digi recently?

I don't know, but just to be clear this is using old data (from run 378366~378369) back in March. I think the agreement was to try to protect it once we have mahi @ alpaka in release.

kakwok · 2024-07-24T18:28:55Z

Ah ok, then it's expected. We concluded that was a configuration error, and agreed that protection will be added in the next iteration.

…

On Wed, Jul 24, 2024, 20:20 Marco Musich ***@***.***> wrote: Has there been any change of Hcal configuration for number of TS in the digi recently? I don't know, but just to be clear this is using old data (from run 378366~378369) back in March. I think the agreement was to try to protect it once we have mahi @ alpaka in release. — Reply to this email directly, view it on GitHub <#44541 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBUQPLLHTUQMZ7JBGMMAVTZN7WAFAVCNFSM6AAAAABFHTTHFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGY2DIMZVHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mmusich · 2024-07-24T18:30:27Z

will be added in the next iteration.

the question is about the plan (timeline) for the next iteration.

cmsbuild added the pending-assignment label Mar 25, 2024

cmsbuild added hlt-pending pending-signatures heterogeneous-pending and removed pending-assignment labels Mar 25, 2024

makortel mentioned this issue Apr 3, 2024

Assertion failure in DQMStore::initLumi() in Tier0 #44561

Open

wonpoint4 changed the title ~~HLT Farm GPU-related crashes in run 378366~378369~~ HLT Farm crashes in run 378366~378369 Apr 5, 2024

mtosi mentioned this issue May 16, 2024

Alpaka implementation of Hcal Local Reconstruction (Mahi) #44910

Merged

kakwok mentioned this issue Aug 6, 2024

MAHI-Alpaka improvements #45651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLT Farm crashes in run 378366~378369 #44541

HLT Farm crashes in run 378366~378369 #44541

wonpoint4 commented Mar 25, 2024 •

edited

Loading

cmsbuild commented Mar 25, 2024 •

edited

Loading

cmsbuild commented Mar 25, 2024

makortel commented Mar 25, 2024

cmsbuild commented Mar 25, 2024

makortel commented Mar 25, 2024

abdoulline commented Mar 26, 2024 •

edited

Loading

mmusich commented Mar 26, 2024

abdoulline commented Mar 26, 2024 •

edited

Loading

fwyzard commented Mar 26, 2024

kakwok commented Mar 26, 2024

syuvivida commented Mar 27, 2024

abdoulline commented Mar 27, 2024 •

edited

Loading

syuvivida commented Mar 27, 2024

abdoulline commented Mar 27, 2024

saumyaphor4252 commented Mar 27, 2024 •

edited

Loading

abdoulline commented Mar 27, 2024 •

edited

Loading

abdoulline commented Apr 8, 2024

missirol commented May 7, 2024

missirol commented May 7, 2024

kakwok commented May 7, 2024

abdoulline commented May 8, 2024 •

edited

Loading

fwyzard commented May 8, 2024

abdoulline commented May 8, 2024 •

edited

Loading

fwyzard commented May 8, 2024

mariadalfonso commented May 8, 2024

abdoulline commented May 8, 2024 •

edited

Loading

abdoulline commented May 10, 2024

kakwok commented May 10, 2024

mmusich commented Jul 24, 2024

kakwok commented Jul 24, 2024 via email

mmusich commented Jul 24, 2024

kakwok commented Jul 24, 2024 via email

mmusich commented Jul 24, 2024

HLT Farm crashes in run 378366~378369 #44541

HLT Farm crashes in run 378366~378369 #44541

Comments

wonpoint4 commented Mar 25, 2024 • edited Loading

cmsbuild commented Mar 25, 2024 • edited Loading

cmsbuild commented Mar 25, 2024

makortel commented Mar 25, 2024

cmsbuild commented Mar 25, 2024

makortel commented Mar 25, 2024

abdoulline commented Mar 26, 2024 • edited Loading

mmusich commented Mar 26, 2024

abdoulline commented Mar 26, 2024 • edited Loading

fwyzard commented Mar 26, 2024

kakwok commented Mar 26, 2024

syuvivida commented Mar 27, 2024

abdoulline commented Mar 27, 2024 • edited Loading

syuvivida commented Mar 27, 2024

abdoulline commented Mar 27, 2024

saumyaphor4252 commented Mar 27, 2024 • edited Loading

abdoulline commented Mar 27, 2024 • edited Loading

abdoulline commented Apr 8, 2024

missirol commented May 7, 2024

missirol commented May 7, 2024

kakwok commented May 7, 2024

abdoulline commented May 8, 2024 • edited Loading

fwyzard commented May 8, 2024

abdoulline commented May 8, 2024 • edited Loading

fwyzard commented May 8, 2024

mariadalfonso commented May 8, 2024

abdoulline commented May 8, 2024 • edited Loading

abdoulline commented May 10, 2024

kakwok commented May 10, 2024

mmusich commented Jul 24, 2024

kakwok commented Jul 24, 2024 via email

mmusich commented Jul 24, 2024

kakwok commented Jul 24, 2024 via email

mmusich commented Jul 24, 2024

wonpoint4 commented Mar 25, 2024 •

edited

Loading

cmsbuild commented Mar 25, 2024 •

edited

Loading

abdoulline commented Mar 26, 2024 •

edited

Loading

abdoulline commented Mar 26, 2024 •

edited

Loading

abdoulline commented Mar 27, 2024 •

edited

Loading

saumyaphor4252 commented Mar 27, 2024 •

edited

Loading

abdoulline commented Mar 27, 2024 •

edited

Loading

abdoulline commented May 8, 2024 •

edited

Loading

abdoulline commented May 8, 2024 •

edited

Loading

abdoulline commented May 8, 2024 •

edited

Loading