Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT GPU crash observed during the Heavy Ion test Run #40623

Open
denerslemos opened this issue Jan 26, 2023 · 16 comments
Open

HLT GPU crash observed during the Heavy Ion test Run #40623

denerslemos opened this issue Jan 26, 2023 · 16 comments

Comments

@denerslemos
Copy link

Hi all,

During the HI test run in the end of 2022 we observed some crash's happening (1 in every ~2000 events) that looks like are coming from HLT GPU's.

I have reproduced some of the errors using Run 362317 which still available in the stream errors (/store/error_stream/run362317). We follow the instructions in https://twiki.cern.ch/twiki/bin/view/CMS/HLTReportingFarmCrashes, but using for EvFDaqDirector and FedRawDataInputSource the following:

process.EvFDaqDirector.buBaseDir = '/store/error_stream'
process.EvFDaqDirector.runNumber = 362317

process.source.fileListMode = True
process.source.fileNames = cms.untracked.vstring('/store/error_stream/run362317/run362317_ls0011_index000031_fu-c2b02-21-01_pid3489238.raw')

I copy and paste the part of error here:

%MSG-w EvFDaqDirector:  DQMFileSaverPB:hltDQMFileSaverPB@beginRun  18-Jan-2023 14:50:51  Run: 362317
Transfer system mode definitions missing for -: streamDQMHistograms (permissive mode)
%MSG
too many pixels in module 44: 6932 > 6000
too many pixels in module 47: 6990 > 6000
too many pixels in module 45: 7330 > 6000
too many pixels in module 52: 6538 > 6000
too many pixels in module 54: 6788 > 6000
too many pixels in module 53: 6316 > 6000
too many pixels in module 40: 7670 > 6000
too many pixels in module 43: 7204 > 6000
too many pixels in module 42: 7392 > 6000
too many pixels in module 41: 7682 > 6000
too many pixels in module 51: 7362 > 6000
too many pixels in module 49: 7182 > 6000
too many pixels in module 48: 7334 > 6000
----- Begin Fatal Exception 18-Jan-2023 14:51:02 -----------------------
An exception of category 'CUDAError' occurred while
   [0] Processing  Event run: 362317 lumi: 11 event: 5176582 stream: 0
   [1] Running path 'AlCa_LumiPixelsCounts_ZeroBias_v4'
   [2] Calling method for module SiPixelRawToClusterCUDA/'hltSiPixelClustersGPU'
Exception Message:
Callback of CUDA stream 0x7fb8d7015670 in device 0 error cudaErrorIllegalAddress: an illegal memory access was encountered
----- End Fatal Exception -------------------------------------------------
%MSG-w FastMonitoringService:  PostProcessPath 18-Jan-2023 14:51:02   Run: 362317 Event: 5176582
 STREAM 0 earlyTermination -: ID:run: 362317 lumi: 11 event: 5176582 LS:11  FromThisContext
%MSG
terminate called after throwing an instance of 'std::runtime_error'
  what():
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_2-el8_amd64_gcc10/build/CMSSW_12_5_2-build/tmp/BUILDROOT/031342ca2bb2896e4fe0fed19213b336/opt/cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/src/CalibTracker/SiPixelESProducers/src/SiPixelGainCalibrationForHLTGPU.cc, line 77:
cudaCheck(cudaFreeHost(gainForHLTonHost_));
cudaErrorIllegalAddress: an illegal memory access was encountered

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Wed Jan 18 14:51:03  2023
Thread 8 (Thread 0x7fb8d7fff700 (LWP 1076578) "cmsRun"):
#0  0x00007fb90fa64d98 in nanosleep () from /lib64/libc.so.6
#1  0x00007fb90fa64c9e in sleep () from /lib64/libc.so.6
#2  0x00007fb905e9d3a0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fb90fa9355d in syscall () from /lib64/libc.so.6
#5  0x00007fb910bcd2c7 in tbb::detail::r1::futex_wait (comparand=2, futex=0x7fb90a8e912c) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/semaphore.h:103
#6  tbb::detail::r1::binary_semaphore::P (this=0x7fb90a8e912c) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/semaphore.h:290
#7  0x00007fb910bdf8f1 in tbb::detail::r1::rml::internal::thread_monitor::commit_wait (c=..., this=0x7fb90a8e9120) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/rml_thread_monitor.h:243
#8  tbb::detail::r1::rml::private_worker::run (this=0x7fb90a8e9100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/private_server.cpp:274
#9  tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fb90a8e9100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/private_server.cpp:221
#10 0x00007fb90fd6b17a in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb90fa98df3 in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7fb843ef0700 (LWP 1076577) "cmsRun"):
#0  0x00007fb90fd73cd6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fb90fd73dc8 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fb90286e0a2 in ?? () from /lib64/libcuda.so.1

.
.
.

Current Modules:

Module: none (crashed)
Module: none

A fatal system signal has occurred: abort signal

Since it is HI collisions a lot of tracks are produces and looks like the number of pixels is high than some threshold which I think crashes the hltSiPixelClustersGPU. Would be good if we can solve this issue.

Thank you in advance,

Best regards,
Dener Lemos

@FHead
@missirol
@fwyzard
@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

@cmsbuild
Copy link
Contributor

A new Issue was created by @denerslemos Dener Lemos.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign heterogeneous, hlt, reconstruction

FYI @cms-sw/trk-dpg-l2, @cms-sw/tracking-pog-l2

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,hlt,reconstruction

@mandrenguyen,@missirol,@fwyzard,@clacaputo,@makortel,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Does the job use multiple threads/streams? If it does, the actual error could happen elsewhere than hltSiPixelClustersGPU and that module just happens to be the first one to catch and report it (because in CUDA errors in asynchronous processing are reported by all calls to CUDA API issued after the error).

@denerslemos
Copy link
Author

I have run using the default, which is think is multi thread. Should I test it again using single thread?

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Feb 1, 2023

This goes away with something like

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 48dfa98839d..6e3d50e6d5b 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -656,7 +656,7 @@ namespace pixelgpudetails {
           digis_d.view().moduleInd(), clusters_d.moduleStart(), digis_d.view().clus(), wordCounter);
       cudaCheck(cudaGetLastError());
 
-      threadsPerBlock = 256 + 128;  /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
+      threadsPerBlock = 256 + 128 + 128;  /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
       blocks = phase1PixelTopology::numberOfModules;
 #ifdef GPU_DEBUG
       std::cout << "CUDA findClus kernel launch with " << blocks << " blocks of " << threadsPerBlock << " threads\n";
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
index ed3510e4918..fe36d22ab46 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
@@ -141,7 +141,7 @@ namespace gpuClustering {
 
       //init hist  (ymax=416 < 512 : 9bits)
       //6000 max pixels required for HI operations with no measurable impact on pp performance
-      constexpr uint32_t maxPixInModule = 6000;
+      constexpr uint32_t maxPixInModule = 10000;
       constexpr auto nbins = isPhase2 ? 1024 : phase1PixelTopology::numColsInModule + 2;  //2+2;
       constexpr auto nbits = isPhase2 ? 10 : 9;                                           //2+2;
       using Hist = cms::cuda::HistoContainer<uint16_t, nbins, maxPixInModule, nbits, uint16_t>;
[adiflori@fu-c2a02-37-02 (gpu-c2a02-37-02) src]$ vi RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu 
[adiflori@fu-c2a02-37-02 (gpu-c2a02-37-02) src]$ git diff
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 48dfa98839d..e5d59b1540b 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -656,7 +656,7 @@ namespace pixelgpudetails {
           digis_d.view().moduleInd(), clusters_d.moduleStart(), digis_d.view().clus(), wordCounter);
       cudaCheck(cudaGetLastError());
 
-      threadsPerBlock = 256 + 128;  /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
+      threadsPerBlock = 256 + 128 + 128 + 128;  /// should be larger than 10000/16 aka (maxPixInModule/maxiter in the kernel)
       blocks = phase1PixelTopology::numberOfModules;
 #ifdef GPU_DEBUG
       std::cout << "CUDA findClus kernel launch with " << blocks << " blocks of " << threadsPerBlock << " threads\n";
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
index ed3510e4918..fe36d22ab46 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
@@ -141,7 +141,7 @@ namespace gpuClustering {
 
       //init hist  (ymax=416 < 512 : 9bits)
       //6000 max pixels required for HI operations with no measurable impact on pp performance
-      constexpr uint32_t maxPixInModule = 6000;
+      constexpr uint32_t maxPixInModule = 10000;
       constexpr auto nbins = isPhase2 ? 1024 : phase1PixelTopology::numColsInModule + 2;  //2+2;
       constexpr auto nbits = isPhase2 ? 10 : 9;                                           //2+2;
       using Hist = cms::cuda::HistoContainer<uint16_t, nbins, maxPixInModule, nbits, uint16_t>;

in 12_5_3. It would be useful to have a distribution for the number of pixel in a module. When checking with RelValHydjetQ_B12_5020GeV_2021 (e.g.) I see that the distribution is well below 6000 (see below for 5000 events). Is there a more representative sample?

image

@mandrenguyen
Copy link
Contributor

The "B12" in the sample name means its peripheral events, so pretty light.
A MinBias Hydjet sample is here:
https://cmsweb.cern.ch/das/request?input=/MinBias_Hydjet_Drum5F_5p02TeV/Run3Winter22PbPbNoMixRECOMiniAOD-122X_mcRun3_2021_realistic_HI_v10-v3/MINIAODSIM

There's also real data from the test run, but the MB trigger was pretty noisy.
Maybe best to stick with MC.

@AdrianoDee
Copy link
Contributor

@mandrenguyen thanks! I'd check that then.

@missirol
Copy link
Contributor

Any chance that a fix for this could converge in time for 13_0_0 ?

@mmusich
Copy link
Contributor

mmusich commented Feb 24, 2023

Any chance that a fix for this could converge in time for 13_0_0 ?

The fix seem to be pretty HI-dependent (I don't think we can attain that occupancy in pp).
Then, why the urgency? My understanding is that the next HI run will be processed in 13_2_x (as per this).
Does HLT plan to stick to 13_0_X instead?

@missirol
Copy link
Contributor

(I don't think we can attain that occupancy in pp)

I didn't know that. I agree it's not urgent, I asked as a way to understand what the status of the fix is.

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Feb 25, 2023

In principle this could be disentangled from pp (and that's my plan for the final fix). I have a PR basically ready for this but with the whole Alpaka migration happening in the background I would wait for that to happen to have this fix on top.

@missirol
Copy link
Contributor

missirol commented Aug 7, 2023

@AdrianoDee , was this issue resolved by #41632 ?

@AdrianoDee
Copy link
Contributor

@missirol yes, using the HIonPhase1 modules in place of the standard "pp" ones.

@mmusich
Copy link
Contributor

mmusich commented Oct 14, 2023

+hlt

@mandrenguyen
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants