-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set max channels separately EE and EB for ECAL #517
Set max channels separately EE and EB for ECAL #517
Conversation
Validation summaryReference release CMSSW_11_1_0 at b7ad279 Validation plots/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
It looks like there are still problems in the ECAL code, as reported in the TTbar step3.log. To reproduce, just run the TTbar step3.py using CMSSW_11_1_X plus this PR. Some changes to make the debugging simpler: # process one event at a time
process.options.numberOfThreads = cms.untracked.uint32( 1 )
process.options.numberOfStreams = cms.untracked.uint32( 1 )
# skip the first 95 events
process.source.skipEvents = cms.untracked.uint32(95)
# silence the EcalDQM messages
process.MessageLogger.categories.append("EcalDQM")
process.MessageLogger.cerr.EcalDQM = cms.untracked.PSet(
limit = cms.untracked.int32(0)
) Running under
followed by many more errors, and eventually a segmentation fault. @amassiro @vkhristenko could you have a look ? |
@amassiro, it’s in rechit
…On Sat, 18 Jul 2020 at 13:20, Andrea Bocci ***@***.***> wrote:
It looks like there are still problems in the ECAL code, as reported in
the TTbar step3.log
<https://patatrack.web.cern.ch/patatrack/validation/pulls/3dd4b1cc050826346d8527b3ec41c9817ef678b4/RelValTTbar_14TeV-CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/testing-11634.512-step3.log>
.
To reproduce, just run the TTbar step3.py
<https://patatrack.web.cern.ch/patatrack/validation/pulls/3dd4b1cc050826346d8527b3ec41c9817ef678b4/RelValTTbar_14TeV-CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/testing-11634.512-step3.py>
using CMSSW_11_1_X plus this PR.
Some changes to make the debugging simpler:
# process one event at a timeprocess.options.numberOfThreads = cms.untracked.uint32( 1 )process.options.numberOfStreams = cms.untracked.uint32( 1 )
# skip the first 95 eventsprocess.source.skipEvents = cms.untracked.uint32(95)
# silence the EcalDQM messagesprocess.MessageLogger.categories.append("EcalDQM")process.MessageLogger.cerr.EcalDQM = cms.untracked.PSet(
limit = cms.untracked.int32(0)
)
Running under cuda-memcheck with those changes, I got
18-Jul-2020 13:11:07 CEST Initiating request to open file file:/gpu_data/store/relval/CMSSW_11_1_0_pre8/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_111X_mcRun3_2021_realistic_v4-v1/20000/6767846A-04AA-AD40-BDAB-407450210E53.root
18-Jul-2020 13:11:09 CEST Successfully opened file file:/gpu_data/store/relval/CMSSW_11_1_0_pre8/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_111X_mcRun3_2021_realistic_v4-v1/20000/6767846A-04AA-AD40-BDAB-407450210E53.root
Begin processing the 1st record. Run 1, Event 6200, LumiSection 62 on stream 0 at 18-Jul-2020 13:11:20.606 CEST
ebdigis.size: 1440
eedigis.size: 748
Begin processing the 2nd record. Run 1, Event 6198, LumiSection 62 on stream 0 at 18-Jul-2020 13:11:22.404 CEST
ebdigis.size: 2145
eedigis.size: 654
Begin processing the 3rd record. Run 1, Event 6195, LumiSection 62 on stream 0 at 18-Jul-2020 13:11:22.958 CEST
ebdigis.size: 1804
eedigis.size: 1077
Begin processing the 4th record. Run 1, Event 6197, LumiSection 62 on stream 0 at 18-Jul-2020 13:11:23.542 CEST
ebdigis.size: 2661
eedigis.size: 818
Begin processing the 5th record. Run 1, Event 6199, LumiSection 62 on stream 0 at 18-Jul-2020 13:11:24.142 CEST
========= Invalid __global__ read of size 8
========= at 0x00000530 in /data/user/fwyzard/patatrack/validation/run_517.7dOQfOmvHU/testing/src/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitBuilderKernels.cu:215:ecal::rechit::kernel_create_ecal_rehit(int const *, unsigned int, bool, bool, bool, bool, bool, bool, bool, float, float, float, float, int const *, unsigned int const *, unsigned int const *, unsigned int, unsigned int, float const *, float const *, unsigned short const *, float const *, float const *, float const *, float const *, float const *, __int64 const *, __int64 const *, __int64 const *, float const *, float const *, float const *, __int64 const *, __int64 const *, __int64 const *, __int64, unsigned int const *, unsigned int const *, float const *, float const *, float const *, float const *, float const *, float const *, unsigned int const *, unsigned int const *, unsigned int*, unsigned int*, float*, float*, float*, float*, float*, float*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, int, unsigned int, unsigned int)
========= by thread (12,0,0) in block (80,0,0)
========= Address 0x7fc2c17fcdf0 is out of bounds
========= Device Frame:/data/user/fwyzard/patatrack/validation/run_517.7dOQfOmvHU/testing/src/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitBuilderKernels.cu:215:ecal::rechit::kernel_create_ecal_rehit(int const *, unsigned int, bool, bool, bool, bool, bool, bool, bool, float, float, float, float, int const *, unsigned int const *, unsigned int const *, unsigned int, unsigned int, float const *, float const *, unsigned short const *, float const *, float const *, float const *, float const *, float const *, __int64 const *, __int64 const *, __int64 const *, float const *, float const *, float const *, __int64 const *, __int64 const *, __int64 const *, __int64, unsigned int const *, unsigned int const *, float const *, float const *, float const *, float const *, float const *, float const *, unsigned int const *, unsigned int const *, unsigned int*, unsigned int*, float*, float*, float*, float*, float*, float*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, int, unsigned int, unsigned int) (ecal::rechit::kernel_crea
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2c74be]
========= Host Frame:/data/user/fwyzard/patatrack/validation/run_517.7dOQfOmvHU/testing/external/slc7_amd64_gcc820/lib/libcudart.so.11.0 [0xf62b]
========= Host Frame:/data/user/fwyzard/patatrack/validation/run_517.7dOQfOmvHU/testing/external/slc7_amd64_gcc820/lib/libcudart.so.11.0 (cudaLaunchKernel + 0x1c1) [0x4f5b1]
========= Host Frame:/data/user/fwyzard/patatrack/validation/run_517.7dOQfOmvHU/testing/lib/slc7_amd64_gcc820/pluginRecoLocalCaloEcalRecProducersPlugins.so (_Z201__device_stub__ZN4ecal6rechit24kernel_create_ecal_rehitEPKijbbbbbbbffffS2_PKjS4_jjPKfS6_PKtS6_S6_S6_S6_S6_PKySA_SA_S6_S6_S6_SA_SA_SA_yS4_S4_S6_S6_S6_S6_S6_S6_S4_S4_PjSB_PfSC_SC_SC_SC_SC_SB_SB_SB_SB_ijjPKijbbbbbbbffffS0_PKjS2_jjPKfS4_PKtS4_S4_S4_S4_S4_PKyS8_S8_S4_S4_S4_S8_S8_S8_yS2_S2_S4_S4_S4_S4_S4_S4_S2_S2_PjS9_PfSA_SA_SA_SA_SA_S9_S9_S9_S9_ijj + 0x582) [0x1e90f2]
...
followed by many more errors, and eventually a segmentation fault.
@amassiro <https://github.com/amassiro> @vkhristenko
<https://github.com/vkhristenko> could you have a look ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#517 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSFUCPXBOWEGTE2P532KLLR4GAQ5ANCNFSM4O6PLSWQ>
.
|
I'm looking at it ... but so far I could not find the error. One question: what does "process.validation_step" do? If I remove it, it runs with no errors, once I get it back I have this error message:
|
More typesafe for sure!
…On Sat, 18 Jul 2020 at 18:51, Andrea Bocci ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In RecoLocalCalo/EcalRecProducers/plugins/DeclsForKernels.h
<#517 (comment)>:
> MYMALLOC(tcState, size);
- //cudaCheck(cudaMalloc((void**)&tcState, size * sizeof(TimeComputationState)));
}
I can add that.
Actually I was thinking of replacing the #define with a lambda - would it
make more sense ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#517 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSFUCN2MEYTL4DPNBBE62TR4HHJNANCNFSM4O6PLSWQ>
.
|
It runs the ECAL-only validation, which is somehow adapted from the standard ECAL validation. If you remove it, does the |
@amassiro I've tried removing the
This was with |
@amassiro where does the cmssw/EventFilter/EcalRawToDigi/plugins/EcalCPUDigisProducer.cc Lines 136 to 141 in 88d42f7
|
sorry, I meant I left everything up to the validation step (excluded) |
It should be the 10 digits per channel: each channel has 10 int (10 sampled points from the electronics pulse shape) |
EcalDataFrame has static const for that, can be replaced with that, like
its done in other places
…On Sun, 19 Jul 2020 at 16:04, Andrea Massironi ***@***.***> wrote:
@amassiro <https://github.com/amassiro> where does the * 10 comes from ?
https://github.com/cms-patatrack/cmssw/blob/88d42f7fd2ccd59844e67d4d8b48b3128052f842/EventFilter/EcalRawToDigi/plugins/EcalCPUDigisProducer.cc#L136-L141
It should be the 10 digits per channel: each channel has 10 int (10
sampled points from the electronics pulse shape)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#517 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSFUCKOE3DGZV6B3N3YMVDR4L4NPANCNFSM4O6PLSWQ>
.
|
…massiro/cmssw into amassiro-ecal-maxchannels-ebee-11-1-v2
The fix in 56536bc should deal with the error shown before. |
@@ -40,11 +41,10 @@ namespace ecal { | |||
// FIXME: we should separate max channels parameter for eb and ee | |||
// FIXME: replace hardcoded values | |||
void allocate(ConfigurationParameters const &config, cudaStream_t cudaStream) { | |||
digisEB.data = cms::cuda::make_device_unique<uint16_t[]>(config.maxChannels, cudaStream); | |||
digisEE.data = cms::cuda::make_device_unique<uint16_t[]>(config.maxChannels, cudaStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the missing *10 here is also why in the past we were not able to validate fully the MC workflow?
... btw, not it should be fixed.
thanks @amassiro indeed I can re-run the failing workflows without crashes, and |
I'll re-run the tests one last time, and merge if they don't show additional failures. |
Validation summaryReference release CMSSW_11_1_0 at b7ad279 Validation plots/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
🚧 Validation running at fu-c2a02-35-02:/data/user/fwyzard/patatrack/validation/run_517.RyT4TIWegb ... |
Validation summaryReference release CMSSW_11_1_0 at b7ad279 Validation plots/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53🚧 Validation running at fu-c2a02-35-02:/data/user/fwyzard/patatrack/validation/run_517.RyT4TIWegb ... |
Validation summaryReference release CMSSW_11_1_0 at b7ad279 Validation plots/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
Validation summaryReference release CMSSW_11_1_0 at b7ad279 Validation plots/RelValTTbar_14TeV/CMSSW_11_1_0_pre8-PU_111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZEE_14/CMSSW_11_1_0_pre8-111X_mcRun3_2021_realistic_v4-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
So, making progress... There still is an issue under
However I think it makes sense to merge this PR, and look into this issue separately. |
Set max channels separately EE and EB for ECAL
Similar to #516, but now for 11_1_X release