Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use the ECAL calibrated rechits from the GPU workflow #592

Conversation

fwyzard
Copy link

@fwyzard fwyzard commented Dec 14, 2020

The ECAL calibrated rechits produced on the GPU are not yet correct.
Disable using them in the gpu workflows until they are working and validated.

@fwyzard
Copy link
Author

fwyzard commented Dec 18, 2020

Validation summary

Reference release CMSSW_11_2_0_pre10 at 6c149b2
Development branch cms-patatrack/CMSSW_11_2_X_Patatrack at 6a192be
Testing branch cms-patatrack/CMSSW_11_2_X_Patatrack at 6a192be with PRs:

Validation plots

/RelValTTbar_14TeV/CMSSW_11_2_0_pre7-PU_112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 11634.5
  • tracking validation plots and summary for workflow 11634.501
  • tracking validation plots and summary for workflow 11634.502
  • tracking validation plots and summary for workflow 11634.505
  • tracking validation plots and summary for workflow 11634.506

/RelValZMM_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v2/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 11634.5
  • tracking validation plots and summary for workflow 11634.501
  • tracking validation plots and summary for workflow 11634.502
  • tracking validation plots and summary for workflow 11634.505
  • tracking validation plots and summary for workflow 11634.506

/RelValZEE_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 11634.5
  • tracking validation plots and summary for workflow 11634.501
  • tracking validation plots and summary for workflow 11634.502
  • tracking validation plots and summary for workflow 11634.505
  • tracking validation plots and summary for workflow 11634.506

Validation plots (CPU vs GPU)

/RelValTTbar_14TeV/CMSSW_11_2_0_pre7-PU_112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflows 11634.502 and 11634.501
  • tracking validation plots and summary for workflows 11634.506 and 11634.505

/RelValZMM_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v2/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflows 11634.502 and 11634.501
  • tracking validation plots and summary for workflows 11634.506 and 11634.505

/RelValZEE_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflows 11634.502 and 11634.501
  • tracking validation plots and summary for workflows 11634.506 and 11634.505

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

scan-136.885502.png
zoom-136.885502.png
scan-136.885512.png
zoom-136.885512.png
scan-136.885522.png
zoom-136.885522.png

logs and nvprof/nvvp profiles

/RelValTTbar_14TeV/CMSSW_11_2_0_pre7-PU_112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 11634.5
  • development release, workflow 11634.5
  • development release, workflow 11634.501
  • development release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.505
  • development release, workflow 11634.506
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.511
  • development release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • development release, workflow 11634.521
  • development release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • development release, workflow 136.885512
  • development release, workflow 136.885522
  • testing release, workflow 11634.5
  • testing release, workflow 11634.501
  • testing release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 11634.505
  • testing release, workflow 11634.506
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 11634.511
  • testing release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • testing release, workflow 11634.521
  • testing release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • cuda-memcheck --tool initcheck (report, log) found 0 errors
    • cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) found 0 errors
    • cuda-memcheck --tool synccheck (report, log) found 0 errors
  • testing release, workflow 136.885502
  • testing release, workflow 136.885512
  • testing release, workflow 136.885522

/RelValZMM_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v2/GEN-SIM-DIGI-RAW

  • reference release, workflow 11634.5
  • development release, workflow 11634.5
  • development release, workflow 11634.501
  • development release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.505
  • development release, workflow 11634.506
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.511
  • development release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • development release, workflow 11634.521
  • development release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • development release, workflow 136.885512
  • development release, workflow 136.885522
  • testing release, workflow 11634.5
  • testing release, workflow 11634.501
  • testing release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 11634.505
  • testing release, workflow 11634.506
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 11634.511
  • testing release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • testing release, workflow 11634.521
  • testing release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.885502
  • testing release, workflow 136.885512
  • testing release, workflow 136.885522

/RelValZEE_14/CMSSW_11_2_0_pre7-112X_mcRun3_2021_realistic_v8-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 11634.5
  • development release, workflow 11634.5
  • development release, workflow 11634.501
  • development release, workflow 11634.502
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.505
  • development release, workflow 11634.506
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 11634.511
  • development release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • development release, workflow 11634.521
  • development release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.885502
  • development release, workflow 136.885512
  • development release, workflow 136.885522
  • testing release, workflow 11634.5
  • testing release, workflow 11634.501
  • testing release, workflow 11634.502
    • step3.py: log
    • profile.py: log, profile and summary are missing, see the full log for more information
    • ⚠️ cuda-memcheck --tool initcheck did not run
    • ⚠️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all did not run
    • ⚠️ cuda-memcheck --tool synccheck did not run
  • testing release, workflow 11634.505
  • testing release, workflow 11634.506
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 11634.511
  • testing release, workflow 11634.512
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
  • testing release, workflow 11634.521
  • testing release, workflow 11634.522
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.885502
  • testing release, workflow 136.885512
  • testing release, workflow 136.885522

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/4a188869c781252b40b258ed9e5e9128eddef122/log .

@thomreis
Copy link

Hi @fwyzard are there any special permissions needed to see the validation plots? I get 404 errors or "not found".

@fwyzard
Copy link
Author

fwyzard commented Dec 18, 2020 via email

@thomreis
Copy link

thomreis commented Dec 18, 2020

In my test a comparison of uncalibrated RecHits shows agreement between CPU and GPU:
CPU:
EcalUncalibratedRecHitsSorted_ecalMultiFitUncalibRecHit_EcalUncalibRecHitsEB_amplitude_cpu
GPU:
EcalUncalibratedRecHitsSorted_ecalMultiFitUncalibRecHit_EcalUncalibRecHitsEB_amplitude_gpu

Comparing the RecHits shows differences. More RecHits are found for the CPU version (This includes PR #592 so the same RecHit producer should run for the CPU and GPU WFs):
CPU:
EcalRecHitsSorted_ecalRecHit_EcalRecHitsEB_energy_cpu
GPU:
EcalRecHitsSorted_ecalRecHit_EcalRecHitsEB_energy_gpu

@thomreis
Copy link

The trigger report for the GPU configuration is not what I was expecting though. It seems as if the CPU module also runs for the uncalibrated RecHits:

TrigReport        200        100        100          0          0 ecalMultiFitUncalibRecHit
TrigReport        200        100        100          0          0 ecalMultiFitUncalibRecHitGPU
TrigReport        200        100        100          0          0 ecalMultiFitUncalibRecHitSoA

So perhaps the agreement in the post above actually comes from comparing CPU outputs with CPU outputs.

For the RecHits the GPU modules do not process events though as expected.

TrigReport        200        100        100          0          0 ecalRecHit
TrigReport          0          0          0          0          0 ecalRecHitGPU
TrigReport          0          0          0          0          0 ecalRecHitSoA

@thomreis
Copy link

Looking closer at the configuration it seems that ecalMultiFitUncalibRecHit is a conversion module from GPU to CPU. This seems to be OK then.

@thomreis
Copy link

Since the RecHitProducer is the same for CPU and GPU, the differences in the RecHit energy plot probably come from the inputs to the module. Looking a bit closer at the UncalibRecHits there are some variables that do show differences between the CPU and the GPU version.
Agreement is seen for amplitude, pedestal, while differences are seen for amplitudeError (0 for GPU), jitter (0 for GPU), chi2 (very small), OOTamplitudes, OOTchi2, flags, and aux (0 for GPU).
Which of these variables are used by the RecHitProducer?

@thomreis
Copy link

Hi @fwyzard what does the error in cuda-memcheck --tool synccheck for the .512 WFs mean? Some issue with the synchronisation?

@fwyzard
Copy link
Author

fwyzard commented Dec 18, 2020

hi @thomreis sorry about that - you can disregard the synccheck errors, I believe that they are false positives

@fwyzard
Copy link
Author

fwyzard commented Dec 18, 2020

Agreement is seen for amplitude, pedestal, while differences are seen for amplitudeError (0 for GPU), jitter (0 for GPU), chi2 (very small), OOTamplitudes, OOTchi2, flags, and aux (0 for GPU).
Which of these variables are used by the RecHitProducer?

No idea ...

@fwyzard fwyzard merged commit 266112f into cms-patatrack:CMSSW_11_2_X_Patatrack Dec 21, 2020
@fwyzard fwyzard deleted the dont_use_ecalRecHit_from_GPU branch December 21, 2020 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix ECAL ECAL-related developments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants