HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477

Sam-Harper · 2024-07-17T06:34:10Z

There were wide spread crashes (~900) in cosmics run 383219.

No changes to the HLT / release had been made around this time and no other runs had this issue either immediately preceding nor after. It should be noted that HCAL had just come back into global after doing tests. Thus seems it plausible that HCAL came back in a werid state and this is the cause of the crashes. Thus I think HCAL experts should probably review to this to event (and this run) to ensure they were sending good data to us.

The crash is fully reproducible on the hilton and also on my local CPU only machine. The crash happens if the PFClustering is run, if this is not run, the crash does not happen.

An example event which crashes is at
/eos/cms/store/group/tsg/FOG/debug/240715_run383219/run383219_ls013_29703372.root

The cosmics menu run is at
/eos/cms/store/group/tsg/FOG/debug/240715_run38219/hlt.py

A minimal menu with just the HCAL reco is
/eos/cms/store/group/tsg/FOG/debug/240715_run383219/hltMinimal.py

The release was CMSSW_14_0_11_MULTIARCHS but is also reproduced in CMSSW_14_0_11

The error on CPU is

At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
----- Begin Fatal Exception 17-Jul-2024 08:23:05 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383219 lumi: 13 event: 29703372 stream: 0
   [1] Running path 'DQM_HcalReconstruction_CPU_v8'
   [2] Calling method for module PFClusterSoAProducer@alpaka/'hltParticleFlowClusterHBHESoA'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception -----------------------------------

The error on GPU is

At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
Out of range index in ViewTemplateFreeParams::operator[]
(repeat above line 2253 times so 2255 in total)
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/mem/buf/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/mem/buf/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
----- Begin Fatal Exception 17-Jul-2024 06:26:11 -----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383219 lumi: 13 event: 29703372 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v8'
   [2] Calling method for module alpaka_serial_sync::PFClusterSoAProducer/'hltParticleFlowClusterHBHESoASerialSync'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception -------------------------------------------------

gpuCrash.log

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
@cms-sw/hcal-dpg-l2 FYI

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-07-17T06:34:33Z

cms-bot internal usage

cmsbuild · 2024-07-17T06:34:33Z

A new Issue was created by @Sam-Harper.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2024-07-17T06:36:48Z

@cms-sw/pf-l2 FYI

fwyzard · 2024-07-17T06:40:49Z

@waredjeb @jsamudio FYI

swagata87 · 2024-07-17T07:54:17Z

type pf

abdoulline · 2024-07-17T08:44:32Z

(in the meantime)

"HCAL had just come back into global after doing tests" -

according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done.

HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question.

jsamudio · 2024-07-17T10:00:50Z

Allocation of PF rechit fraction SoA is currently the number of rechits nRH * 250. In this particular event, nRH = 9577, so the maximum index of the rechit fraction SoA is 2394250 and we are needing 2428285. Number of seeds (1899) and number of topological clusters (257) all seem reasonable. In my mind this is just the same situation as #44634. Dynamic allocation of the rechit fraction SoA would probably alleviate this in a way that does not abuse the GPU memory.

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard? Otherwise the "safest" configuration is nRH*nRH and this would be unrealistic.

fwyzard · 2024-07-17T10:12:28Z

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard?

It depends on what is needed for the dynamic allocation.

If the only requirement is to change a configuration value with a runtime value, I don't expect any impact.

If it also requires splitting a kernel in two, it may add some overhead.

jsamudio · 2024-07-17T10:56:03Z

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard?

It depends on what is needed for the dynamic allocation.

If the only requirement is to change a configuration value with a runtime value, I don't expect any impact.

If it also requires splitting a kernel in two, it may add some overhead.

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number. These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka. Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

Sam-Harper · 2024-07-17T11:09:24Z

(in the meantime)

"HCAL had just come back into global after doing tests" -

according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done.

HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question.

Than you Salavat for the correction. Indeed there was a small game of telephone which lead to my misunderstanding that the laser alignment tests were ongoing when they started just after this run.

abdoulline · 2024-07-17T11:12:13Z

@Sam-Harper
my apologies, Sam...
In fact HCAL OPS has already realized/admitted:

laser test was performed in local and then the laser was accidentally left on (-> firing) during the aforementioned Cosmics run (in question), apparently because laser testing colleagues weren't properly notified about HCAL move to the global Cosmics run... ☹️

Update:
DQM colleagues did confirm that HCAL barrel occupancy in the problematic event pointed out by Sam in the intro
is ~90% ( ~8k hits above ZeroSuppresion), while in pp collisions it's kept at < 30% (and naturally lower in regular Cosmics).

fwyzard · 2024-07-17T11:40:47Z

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number.

These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka.

Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the memcpy.

makortel · 2024-07-17T13:35:01Z

assign hlt, reconstruction

cmsbuild · 2024-07-17T13:35:21Z

New categories assigned: hlt,reconstruction

@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2024-07-17T13:39:13Z

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number.

These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka.

Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the memcpy.

Just to clarify, given that PFClusterSoAProducer (this is the module in question, no?) is a stream::EDProducer<>, the "splitting to two" would be about changing the base class to stream::SynchronizingEDProducer, moving the first part of the code up to the device-to-host memcpy() to be called from the acquire() member function, and leaving the rest to the produce() member function.

jfernan2 · 2024-09-30T08:06:14Z

+1
Solved by #46135

mmusich · 2024-10-05T08:51:37Z

proposed solutions:

Add runtime allocation of PF rechit fraction SoA #46135 (merged Oct 3rd, 2024, will enter CMSSW_14_2_0_pre2)
[14_1_X] Add runtime allocation of PF rechit fraction SoA #46136 (merged Oct 5th, 2024, to enter CMSSW_14_1_1)

mmusich · 2024-10-05T08:51:52Z

+hlt

cmsbuild · 2024-10-05T08:51:58Z

This issue is fully signed and ready to be closed.

makortel · 2024-10-07T13:40:02Z

@cmsbuild, please close

cmsbuild added the pending-assignment label Jul 17, 2024

cmsbuild added the pf label Jul 17, 2024

cmsbuild added reconstruction-pending hlt-pending pending-signatures and removed pending-assignment labels Jul 17, 2024

This was referenced Sep 26, 2024

Add runtime allocation of PF rechit fraction SoA #46135

Merged

[14_1_X] Add runtime allocation of PF rechit fraction SoA #46136

Merged

cmsbuild added reconstruction-approved and removed reconstruction-pending labels Sep 30, 2024

cmsbuild removed hlt-pending pending-signatures labels Oct 5, 2024

cmsbuild added hlt-approved fully-signed labels Oct 5, 2024

cmsbuild closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477

HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477

Sam-Harper commented Jul 17, 2024

cmsbuild commented Jul 17, 2024 •

edited

Loading

cmsbuild commented Jul 17, 2024

fwyzard commented Jul 17, 2024

fwyzard commented Jul 17, 2024

swagata87 commented Jul 17, 2024

abdoulline commented Jul 17, 2024 •

edited

Loading

jsamudio commented Jul 17, 2024

fwyzard commented Jul 17, 2024

jsamudio commented Jul 17, 2024

Sam-Harper commented Jul 17, 2024

abdoulline commented Jul 17, 2024 •

edited

Loading

fwyzard commented Jul 17, 2024

makortel commented Jul 17, 2024

cmsbuild commented Jul 17, 2024

makortel commented Jul 17, 2024

jfernan2 commented Sep 30, 2024

mmusich commented Oct 5, 2024

mmusich commented Oct 5, 2024

cmsbuild commented Oct 5, 2024

makortel commented Oct 7, 2024

HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477

HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477

Comments

Sam-Harper commented Jul 17, 2024

cmsbuild commented Jul 17, 2024 • edited Loading

cmsbuild commented Jul 17, 2024

fwyzard commented Jul 17, 2024

fwyzard commented Jul 17, 2024

swagata87 commented Jul 17, 2024

abdoulline commented Jul 17, 2024 • edited Loading

jsamudio commented Jul 17, 2024

fwyzard commented Jul 17, 2024

jsamudio commented Jul 17, 2024

Sam-Harper commented Jul 17, 2024

abdoulline commented Jul 17, 2024 • edited Loading

fwyzard commented Jul 17, 2024

makortel commented Jul 17, 2024

cmsbuild commented Jul 17, 2024

makortel commented Jul 17, 2024

jfernan2 commented Sep 30, 2024

mmusich commented Oct 5, 2024

mmusich commented Oct 5, 2024

cmsbuild commented Oct 5, 2024

makortel commented Oct 7, 2024

cmsbuild commented Jul 17, 2024 •

edited

Loading

abdoulline commented Jul 17, 2024 •

edited

Loading

abdoulline commented Jul 17, 2024 •

edited

Loading