Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477

Closed
Sam-Harper opened this issue Jul 17, 2024 · 20 comments
Closed

HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477

Sam-Harper opened this issue Jul 17, 2024 · 20 comments

Comments

@Sam-Harper
Copy link
Contributor

There were wide spread crashes (~900) in cosmics run 383219.

No changes to the HLT / release had been made around this time and no other runs had this issue either immediately preceding nor after. It should be noted that HCAL had just come back into global after doing tests. Thus seems it plausible that HCAL came back in a werid state and this is the cause of the crashes. Thus I think HCAL experts should probably review to this to event (and this run) to ensure they were sending good data to us.

The crash is fully reproducible on the hilton and also on my local CPU only machine. The crash happens if the PFClustering is run, if this is not run, the crash does not happen.

An example event which crashes is at
/eos/cms/store/group/tsg/FOG/debug/240715_run383219/run383219_ls013_29703372.root

The cosmics menu run is at
/eos/cms/store/group/tsg/FOG/debug/240715_run38219/hlt.py

A minimal menu with just the HCAL reco is
/eos/cms/store/group/tsg/FOG/debug/240715_run383219/hltMinimal.py

The release was CMSSW_14_0_11_MULTIARCHS but is also reproduced in CMSSW_14_0_11

The error on CPU is

At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
----- Begin Fatal Exception 17-Jul-2024 08:23:05 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383219 lumi: 13 event: 29703372 stream: 0
   [1] Running path 'DQM_HcalReconstruction_CPU_v8'
   [2] Calling method for module PFClusterSoAProducer@alpaka/'hltParticleFlowClusterHBHESoA'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception -----------------------------------

The error on GPU is

At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
Out of range index in ViewTemplateFreeParams::operator[]
(repeat above line 2253 times so 2255 in total)
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/mem/buf/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/mem/buf/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
----- Begin Fatal Exception 17-Jul-2024 06:26:11 -----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383219 lumi: 13 event: 29703372 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v8'
   [2] Calling method for module alpaka_serial_sync::PFClusterSoAProducer/'hltParticleFlowClusterHBHESoASerialSync'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception -------------------------------------------------

gpuCrash.log

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
@cms-sw/hcal-dpg-l2 FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 17, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @Sam-Harper.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Jul 17, 2024

@cms-sw/pf-l2 FYI

@fwyzard
Copy link
Contributor

fwyzard commented Jul 17, 2024

@waredjeb @jsamudio FYI

@swagata87
Copy link
Contributor

type pf

@cmsbuild cmsbuild added the pf label Jul 17, 2024
@abdoulline
Copy link

abdoulline commented Jul 17, 2024

(in the meantime)

"HCAL had just come back into global after doing tests" -

according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done.

HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question.

@jsamudio
Copy link
Contributor

Allocation of PF rechit fraction SoA is currently the number of rechits nRH * 250. In this particular event, nRH = 9577, so the maximum index of the rechit fraction SoA is 2394250 and we are needing 2428285. Number of seeds (1899) and number of topological clusters (257) all seem reasonable. In my mind this is just the same situation as #44634. Dynamic allocation of the rechit fraction SoA would probably alleviate this in a way that does not abuse the GPU memory.

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard? Otherwise the "safest" configuration is nRH*nRH and this would be unrealistic.

@fwyzard
Copy link
Contributor

fwyzard commented Jul 17, 2024

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard?

It depends on what is needed for the dynamic allocation.

If the only requirement is to change a configuration value with a runtime value, I don't expect any impact.

If it also requires splitting a kernel in two, it may add some overhead.

@jsamudio
Copy link
Contributor

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard?

It depends on what is needed for the dynamic allocation.

If the only requirement is to change a configuration value with a runtime value, I don't expect any impact.

If it also requires splitting a kernel in two, it may add some overhead.

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number. These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka. Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

@Sam-Harper
Copy link
Contributor Author

(in the meantime)

"HCAL had just come back into global after doing tests" -

according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done.

HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question.

Than you Salavat for the correction. Indeed there was a small game of telephone which lead to my misunderstanding that the laser alignment tests were ongoing when they started just after this run.

@abdoulline
Copy link

abdoulline commented Jul 17, 2024

@Sam-Harper
my apologies, Sam...
In fact HCAL OPS has already realized/admitted:

  • laser test was performed in local and then the laser was accidentally left on (-> firing) during the aforementioned Cosmics run (in question), apparently because laser testing colleagues weren't properly notified about HCAL move to the global Cosmics run... ☹️

Update:
DQM colleagues did confirm that HCAL barrel occupancy in the problematic event pointed out by Sam in the intro
is ~90% ( ~8k hits above ZeroSuppresion), while in pp collisions it's kept at < 30% (and naturally lower in regular Cosmics).

@fwyzard
Copy link
Contributor

fwyzard commented Jul 17, 2024

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number.

These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka.

Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the memcpy.

@makortel
Copy link
Contributor

assign hlt, reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,reconstruction

@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number.

These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka.

Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the memcpy.

Just to clarify, given that PFClusterSoAProducer (this is the module in question, no?) is a stream::EDProducer<>, the "splitting to two" would be about changing the base class to stream::SynchronizingEDProducer, moving the first part of the code up to the device-to-host memcpy() to be called from the acquire() member function, and leaving the rest to the produce() member function.

@jfernan2
Copy link
Contributor

+1
Solved by #46135

@mmusich
Copy link
Contributor

mmusich commented Oct 5, 2024

proposed solutions:

@mmusich
Copy link
Contributor

mmusich commented Oct 5, 2024

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 5, 2024

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

makortel commented Oct 7, 2024

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants