Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign all GPU workflows to detect if a GPU is present, and fall back to CPU otherwise #33428

Merged
merged 11 commits into from
May 11, 2021

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Apr 13, 2021

PR description:

Redesign the GPU workflows:

  • the CPU (*e.g. ###.501) and GPU (###.502) workflows should now be as close as possible;
  • the implementation of the CPU and GPU workflows has been simplified;
  • all GPU workflows use the SwitchProducerCUDA mechanism to detect if a GPU is available and offload a module or task to the GPU; if not, they automatically fall back to the equivalent CPU modules and tasks;
  • when the "gpu" modifier is used, the pixel local reconstruction workflow used the "HLT" payload type both on the CPU and on the GPU, for better consistency of the results;
  • the "Patatrack" pixel tracks reconstruction on CPU is based on a modifier (pixelNtupletFit) instead of a customisation, in line with the other workflows;
  • the HCAL-only workflows should follow more closely the implementation of the general reconstruction sequence, both for Run 2 (2018) and Run 3 scenarios.

Some changes to the relevant EDProducers have made the definition of the workflows easier:

  • the SoA-to-legacy HCAL rechit producer has been updated to make the production of the SoA and/or legacy collections optional;
  • the legacy ECAL unpacker has been updated to declare only the event products it will actually produce;
  • the default labels used in many modules have been updated to reflect the labels used in the configuration.

Some other general changes and code clean up:

  • remove some no-longer-used files as well as some commented-out code
  • always clone() a module used in a SwitchProducerCUDA
  • move the implementation of the gpuVertexFinder kernels from gpuVertexFinderImpl.h to gpuVertexFinder.cc

The update has been presented here: https://indico.cern.ch/event/1033022/#47-gpu-workflows .

PR validation:

The GPU workflows (e.g. ###.502) now work also without a GPU:

CUDA_VISIBLE_DEVICES= runTheMatrix.py -w upgrade -j 16 -l 10824.501,10824.502,10824.505,10824.506,10824.511,10824.512,10824.521,10824.522,11634.501,11634.502,11634.505,11634.506,11634.511,11634.512,11634.521,11634.522
...
10824.501_TTbar_13+2018_Patatrack_PixelOnlyCPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:11 2021-date Sat Apr 24 08:21:54 2021; exit: 0 0 0 0
10824.502_TTbar_13+2018_Patatrack_PixelOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:08 2021-date Sat Apr 24 08:21:55 2021; exit: 0 0 0 0
10824.505_TTbar_13+2018_Patatrack_PixelOnlyTripletsCPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:12 2021-date Sat Apr 24 08:21:55 2021; exit: 0 0 0 0
10824.506_TTbar_13+2018_Patatrack_PixelOnlyTripletsGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:08 2021-date Sat Apr 24 08:21:56 2021; exit: 0 0 0 0
10824.511_TTbar_13+2018_Patatrack_ECALOnlyCPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:03 2021-date Sat Apr 24 08:21:56 2021; exit: 0 0 0 0
10824.512_TTbar_13+2018_Patatrack_ECALOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:03 2021-date Sat Apr 24 08:21:57 2021; exit: 0 0 0 0
10824.521_TTbar_13+2018_Patatrack_HCALOnlyCPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:03 2021-date Sat Apr 24 08:21:57 2021; exit: 0 0 0 0
10824.522_TTbar_13+2018_Patatrack_HCALOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:03 2021-date Sat Apr 24 08:21:58 2021; exit: 0 0 0 0
11634.501_TTbar_14TeV+2021_Patatrack_PixelOnlyCPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:20 2021-date Sat Apr 24 08:21:58 2021; exit: 0 0 0 0
11634.502_TTbar_14TeV+2021_Patatrack_PixelOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:17 2021-date Sat Apr 24 08:21:59 2021; exit: 0 0 0 0
11634.505_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsCPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:20 2021-date Sat Apr 24 08:21:59 2021; exit: 0 0 0 0
11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:17 2021-date Sat Apr 24 08:22:00 2021; exit: 0 0 0 0
11634.511_TTbar_14TeV+2021_Patatrack_ECALOnlyCPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:13 2021-date Sat Apr 24 08:22:00 2021; exit: 0 0 0 0
11634.512_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:13 2021-date Sat Apr 24 08:22:01 2021; exit: 0 0 0 0
11634.521_TTbar_14TeV+2021_Patatrack_HCALOnlyCPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:12 2021-date Sat Apr 24 08:22:01 2021; exit: 0 0 0 0
11634.522_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:12 2021-date Sat Apr 24 08:22:02 2021; exit: 0 0 0 0
16 16 16 16 tests passed, 0 0 0 0 failed

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33428/22092

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

Configuration/PyReleaseValidation
RecoLocalCalo/EcalRecProducers
RecoLocalCalo/HcalRecProducers
RecoPixelVertexing/Configuration
RecoPixelVertexing/PixelTrackFitting
RecoPixelVertexing/PixelVertexFinding

@perrotta, @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob, @slava77, @jpata can you please review it and eventually sign? Thanks.
@fabiocos, @makortel, @felicepantaleo, @abdoulline, @GiacomoSguazzoni, @JanFSchulte, @rovere, @argiro, @Martin-Grunewald, @apsallid, @rchatter, @thomreis, @simonepigazzini, @ebrondol, @VinInn, @mtosi, @dgulhan, @slomeo, @mariadalfonso this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2021

code checks

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2021

please test

@fwyzard fwyzard marked this pull request as ready for review April 14, 2021 18:02
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2021

@fwyzard marked this pull request as ready for review now

It's not actually completed, but I wanted to ask the bot to test it so far ...

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33428/22108

@cmsbuild
Copy link
Contributor

Pull request #33428 was updated. @perrotta, @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @srimanob, @slava77, @jpata can you please check and sign again.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2021

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2021

@smuzaffar @silviodonato is there a way to ask the bot to run the 11634.502 on cpu ?

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a96ae4/14239/summary.html
COMMIT: b955402
CMSSW: CMSSW_11_3_X_2021-04-14-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/33428/14239/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1 differences found in the comparisons
  • DQMHistoTests: Total files compared: 38
  • DQMHistoTests: Total histograms compared: 2864426
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2864403
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
  • Checked 160 log files, 37 edm output root files, 38 DQM output files
  • TriggerResults: no differences found

@srimanob
Copy link
Contributor

srimanob commented May 6, 2021

+Upgrade

@dpiparo
Copy link
Contributor

dpiparo commented May 10, 2021

Maybe a question for @qliphy : do we know why this PR is not merged yet given the consensus about the changes it brings?

@qliphy
Copy link
Contributor

qliphy commented May 10, 2021

@cms-sw/alca-l2 @cms-sw/pdmv-l2 Do you have any comment?

@qliphy
Copy link
Contributor

qliphy commented May 10, 2021

Maybe a question for @qliphy : do we know why this PR is not merged yet given the consensus about the changes it brings?

Several signatures are missing. This could be done quickly, otherwise we can discuss this at tomorrow's ORP.

@dpiparo
Copy link
Contributor

dpiparo commented May 10, 2021

Thanks!

@jordan-martins
Copy link
Contributor

+1

@yuanchao
Copy link
Contributor

+1

@qliphy
Copy link
Contributor

qliphy commented May 11, 2021

+operations

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented May 11, 2021

+1

@cmsbuild cmsbuild merged commit bb6b6ce into cms-sw:master May 11, 2021
@fwyzard fwyzard deleted the auto_gpu_workflows branch August 18, 2021 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.