Code structure

Framework

The framework is a crude approximation of the CMS data processing software, CMSSW. CMSSW is a rather generic framework to process independent chunks of data. In CMS these chunks of data correspond to triggered proton-proton collisions, and are called events. The events are processed by modules, that in this project are all "producers" that can read objects from the event, and insert new objects to the event (in CMSSW there are, in addition, analyzers, that can only read objects from the event, and filters, that can decide to stop processing of the event).

The modules form a DAG based on their data dependencies. The modules are implemented in C++ (C++17 in general, CUDA code is in C++14). A CMSSW job is configured in Python (this project does not provide configuration mechanism).

The CMSSW framework is multi-threaded using Threading Building Blocks (TBB). An integral part of the multi-threading is a concept of "concurrent event processor" that we call "a stream" (to disambiguate from CUDA streams, these streams are called "EDM streams" from now on). An EDM stream processes one event at a time ("processing" meaning that each module in the DAG is run on the event in some order respecting the data dependencies). A job may have multiple EDM streams, in which case the EDM streams process their events concurrently. Furthermore, modules processing the same event that are independent in the DAG are also run concurrently. All this potential concurrency is exposed as tasks to be run by the TBB task scheduler. We do not make any assumptions on how TBB runs these tasks in threads (e.g. number of EDM streams and number of threads may be different). For more information on CMSSW framework see e.g.

CMS TWiki pages
- SWGuideFrameWork
- MultithreadedFrameworkDesignDiscussions
- The CMS Offline WorkBook WorkBook
- The CMS Offline SW Guide SWGuide
Papers

The processing time and memory requirements can vary a lot across events. In addition, the filtering capability may affect which modules in the DAG can be run.

The main approximations of the framework in this project with respect to CMSSW are

producer modules only, and only stream and stream-ExternalWork producers
modules and the event products have no labels, and therefore the event can hold only at most one product of each C++ type
no run time configuration mechanism
input data are fully read in memory at the beginning of the job
EventSetup system has a single (implicit) record, one IOV, all products are read from binary dumps at the beginning of the job.

Overall view on the use of CUDA

Our overall aims are to avoid blocking synchronization as much as possible, and keep all processing units (CPU cores, GPUs) as busy as we can doing useful work. It follows that we try to have all operations (memory transfers, kernel calls) asynchronous with the use of CUDA streams, and that we use callback functions (cudaStreamAddCallback()) to notify the CMSSW framework when the asynchronous work has finished.

We use a "caching allocator" (based on the one from CUB library) for both device and pinned host memory allocations. This approach allows us to amortize the cost of the cudaMalloc()/cudaFree() etc, while being able to easily re-use device/pinned host memory regions for temporary workspaces, to avoid "conservative" overallocation of memory, and to avoid constraining the scheduling of modules to multiple devices.

We use one CUDA stream for each EDM stream ("concurrent event") and each linear chain of GPU modules that pass data from one to the other in the device memory. In case of branches in the DAG of modules, additional CUDA streams are used since there is sub-event concurrency in the DAG that we want to expose to CUDA runtime.

For more information see cms-sw/cmssw:HeterogeneousCore/CUDACore/README.md.

Application structure

The pixel tracking GPU prototype consists of five modules that are run according to their data dependencies (roughly in the following order)

BeamSpotToCUDA (A)
SiPixelRawToClusterCUDA (B)
SiPixelRecHitCUDA (C)
CAHitNtupletCUDA (D)
PixelVertexProducerCUDA (E)

The data dependencies of the modules form the following DAG

In addition, there are two modules to transfer the tracks and vertices back to CPU (enabled only with --transfer parameter)

PixelTrackSoAFromCUDA (F)
PixelVertexSoAFromCUDA (G)

With the transfers the module DAG becomes

The application reads uncompressed raw data for just the pixel detector (about 250 kB/event). This configuration is somewhat artificial, e.g. almost nothing is transferred back to CPU, at the moment there are no modules that would consume the data in the format produced by this workflow, and in offline data processing the input data is compressed.

For more information on the application see

arXiv:2008.13461 A Bocci et al.: Heterogeneous reconstruction of tracks and primary vertices with the CMS pixel tracker
https://patatrack.web.cern.ch/patatrack/wiki/
CERN EP Software Seminar 2020-07-01: D. Vom Bruch and F. Pantaleo: Real-time heterogeneous event reconstruction with GPUs at CMS and LHCb during LHC Run-3
CHEP 2019: A. Bocci: Heterogeneous online reconstruction at CMS
ACAT 2019: A. Bocci: Towards a heterogeneous High Level Trigger farm for CMS
Connecting The Dots 2019: F. Pantaleo: Patatrack: accelerated Pixel Track reconstruction in CMS
F. Pantaleo: New Track Seeding Techniques for the CMS Experiment (PhD thesis)

BeamSpot transfer (`BeamSpotToCUDA`)

This module transfers information about the average beam collision region ("beam spot") from the host to the device for each event.

Operation	Description
memcpy host-to-device 44 B	Transfer `BeamSpotCUDA::Data` for position and other information about the beam spot

These essentially only transfer data from CPU to GPU.

Raw-to-cluster (`SiPixelRawToClusterCUDA`)

This module that unpacks and reformats the raw to into something usable to downstream, applies some calibration, and forms clusters of pixels on each pixel detector module.

The following memory transfers are done once per job on the first event from SiPixelRawToClusterCUDA.cc

Operation	Description
memcpy host-to-device 1.44 MB	Transfer `SiPixelFedCablingMapGPU` for pixel detector cabling map
memcpy host-to-device 57.6 kB	Transfer an array of module indices to be unpacked. This is to support regional unpacking in CMSSW, even though this functionality is not used in this project.
memcpy host-to-device 3.09 MB	Transfer gain calibration data
memcpy host-to-device 24.05 kB	Transfer `SiPixelGainForHLTonGPU` for gain calibration
memcpy host-to-device 8 B	Set the gain calibration data pointer in `SiPixelGainForHLTonGPU` struct

The following CUDA operations are issued for each event from SiPixelRawToClusterGPUKernel.cu

Operation	Description
memcpy host-to-device 40 B	Transfer `SiPixelDigisCUDA::DeviceConstView` for SoA of pixel digis (= unpacked raw data)
memset 3.6 MB	Zero unpacking error array
memcpy host-to-device 16 B	Transfer `GPU::SimpleVector<PixelErrorcompact>` to provide `std::vector`-like interface for the unpacking error array
memcpy host-to-device 32 B	Transfer `SiPixelClustersCUDA::DeviceConstView` for SoA of pixel clusters
memcpy host-to-device X B	Transfer raw data words
memcpy host-to-device X B	Transfer IDs for FEDs that provided data
`pixelgpudetails::RawToDigi_kernel()`	Kernel to unpack the raw data into data structure usable for subsequent kernels
memcpy device-to-host 16 B	Transfer `GPU::SimpleVector<PixelErrorcompact>`, i.e. essentially the number of errors, to host
`gpuCalibPixel::calibDigis()`	Calibrate pixel digis (ADC counts)
`gpuClustering::countModules()`	Fills starting index into the ADC (etc) arrays for each active module
memcpy device-to-host 4 B	Transfer number of active modules to host
`gpuClustering::findClus()`	Cluster digis on each pixel module
`gpuClustering::clusterChargeCut()`	Select clusters whose aggregated electric charge is above a given threshold
`pixelgpudetails::fillHitsModuleStart()`	find the "location" of the first hit of each module
memcpy device-to-host 4 B	Transfer number of pixel clusters to host

Kernel timings

Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.

Kernel	V100
`pixelgpudetails::RawToDigi_kernel()`	4.8-8.9 (mean 6.1 stdev 0.5) us
`gpuCalibPixel::calibDigis()`	3.1-4.6 (mean 3.7 stdev 0.2) us
`gpuClustering::countModules()`	2.9-4.1 (mean 3.3 stdev 0.2) us
`gpuClustering::findClus()`	43-490 (mean 82 stdev 34) us
`gpuClustering::clusterChargeCut()`	12-15 (mean 12.4 stdev 0.6 ) us
`pixelgpudetails::fillHitsModuleStart()`	5.4-6.8 (mean 5.8 stdev 0.3) us

RecHits `SiPixelRecHitCUDA`

This module computes the 3D position estimate for each cluster.

The following memory transfers are done once per job on the first event from SiPixelRecHitCUDA

Operation	Description
memcpy host-to-device 32 B	Transfer `pixelCPEforGPU::ParamsOnGPU` main `struct` of detector module parameters
memcpy host-to-device 16 B	Transfer `pixelCPEforGPU::ParamsOnGPU` parameters common to all modules (thickness ,pitch)
memcpy host-to-device 3.56 kB	Transfer `phase1PixelTopology::AverageGeometry` parameters of module ladders in barrel, and z positions of first forward disks
memcpy host-to-device 160 B	Transfer `pixelCPEforGPU::LayerGeometry` layer indices
memcpy host-to-device 208 kB	Transfer `pixelCPEforGPU::DetParams` detector module parameters

The following CUDA operations are issued for each event from PixelRecHits.cu

Operation	Description
memcpy host-to-device 152 B	Transfer `TrackingRecHit2DSOAView` for SoA of pixel hits
`gpuPixelRecHits::getHits()`	Calculates 3D position for each cluster
`setHitsLayerStart()`	Set index of the first hit for each pixel detector layer
`cudautils::countFromVector()`	First kernel of four to fill a phi-binned histogram of the hits. This kernel counts the number of elements for each bin
`cub::DeviceScanInitKernel()`	Part of `cub::DeviceScan::InclusiveSum()` called from `cms::cuda::launchFinalize()`
`cub::DeviceScanKernel()`	Part of `cub::DeviceScan::InclusiveSum()` called from `cms::cuda::launchFinalize()`
`cudautils::fillFromVector()`	Last kernel of four to fill a phi-binned histogram of the hits. This kernel fills each bin with the hits.

Kernel timings

Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.

Kernel	V100
`gpuPixelRecHits::getHits()`	35-39 (mean 39 stdev 2) us
`setHitsLayerStart()`	1.7-3.8 (mean 1.8 stdev 0.1) us
`cudautils::countFromVector()`	2.4-4.2 (mean 3.1 stdev 0.3) us
`cub::DeviceScanInitKernel()`	1.1-1.5 (mean 1.18 stdev 0.06) us
`cub::DeviceScanKernel()`	3.1-4.1 (mean 3.5 stdev 0.2) us
`cudautils::fillFromVector()`	2.8-4.5 (mean .3. stdev 0.3) us

Pattern recognition (with Cellular Automaton) (`CAHitNtupletCUDA`)

This module performs the "pattern recognition":

create pairs of pixel hits on adjacent layers
connect the pairs to form ntuplets (triplets or quadruplets)
fit the ntuplets with a helix

The following CUDA operations are issued for each event from CAHitNtupletGeneratorKernels.cu and BrokenLineFitOnGPU.cu

Operation	Description
memset 4 B	Zero the number of doublets (CA cells)
memset 36 B	Zero counter of ntuplets
memset 197 kB	Zero association structure from hits to ntuplets
`gpuPixelDoublets::initDoublets()`	Initialize pair/doublet finding data structure
`gpuPixelDoublets::getDoubletsFromHisto()`	Create the hit pairs/doublets
memset 131 kB	Zero ntuplet finding data structure
`kernel_connect()`	Connect compatible pairs/doublets
`gpuPixelDoublets::fishbone()`	Identify duplicate ntuplets (a single particle can induce multiple hits per layer because of redundancies in the detector)
`kernel_find_ntuplets`	Find ntuplets from the doublet connection graph (the actual Cellular Automaton)
`cudautils::finalizeBulk()`	finalize data structures (Histogrammer/OnetoManyAssoc) filled in the previous kernel
`kernel_earlyDuplicateRemover()`	Clean duplicate ntuplets
`kernel_countMultiplicity()`	Count the number of ntuplets with different numbers of hits
`cub::DeviceScanInitKernel()`	Part of `cub::DeviceScan::InclusiveSum()` called from `cms::cuda::launchFinalize()`
`cub::DeviceScanKernel()`	Part of `cub::DeviceScan::InclusiveSum()` called from `cms::cuda::launchFinalize()`
`kernel_fillMultiplicity()`	Fills a nhit-binned histogram of the ntuplets.
`kernel_fillHitDetIndices`	fill for each ht-tuple a tuple with just the det-indices.
`kernelBLFastFit<3>()`	First step of fitting triplets
`kernelBLFit<3>()`	Second step of fitting triplets
`kernelBLFastFit<4>()`	First step of fitting quadruplets
`kernelBLFit<4>()`	Second step of fitting quadruplets
`kernelBLFastFit<4>()`	First step of fitting pentuplets (only first 4 hits)
`kernelBLFit<4>()`	Second step of fitting pentuplets (only first 4 hits)
`kernel_classifyTracks`	Classify fitted tracks according to certain quality criteria
`kernel_fastDuplicateRemover`	Identify duplicate tracks
`kernel_countHitInTracks`	Count the number of times each hit is used in any ntuplet
`cub::DeviceScanInitKernel()`	Part of `cub::DeviceScan::InclusiveSum()` called from `cms::cuda::launchFinalize()`
`cub::DeviceScanKernel()`	Part of `cub::DeviceScan::InclusiveSum()` called from `cms::cuda::launchFinalize()`
`kernel_fillHitInTracks`	Fill a histogram per hit with ntuplets using that hit
`kernel_tripletCleaner`	Clean triplets

Kernel timings

Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.

Kernel	V100
`gpuPixelDoublets::initDoublets()`	1.6-8.2 (mean 4 stdev 1) us
`gpuPixelDoublets::getDoubletsFromHisto()`	28.2-195 (mean 85 stdev 27) us
`kernel_connect()`	34.0-248 (mean 83 stdev 31) us
`gpuPixelDoublets::fishbone()`	12.1-101 (mean 47 stdev 14) us
`kernel_find_ntuplets`	53.6-366 (mean 177 stdev 51) us
`cudautils::finalizeBulk()`	1.8-2.4 (mean 2.0 stdev 0.1) us
`kernel_earlyDuplicateRemover()`	6.1-27 (mean 15 stdev 3) us
`kernel_countMultiplicity()`	2-10 (mean 4.5 stdev 1.4)
`cub::DeviceScanInitKernel()`	0.9-1.7 (mean 1.07 stdev 0.09) us
`cub::DeviceScanKernel()`	1.6-3.9 (mean 1.7 stdev 0.1) us
`kernel_fillMultiplicity()`	2.3-10 (mean 4.5 stdev 1.2) us
`kernel_fillHitDetIndices`	3.0-3.7 (mean 3.2 stdev 0.2) us
`kernelBLFastFit<3>()`	6.1-12 (mean 7.1 stdev 0.5) us
`kernelBLFit<3>()`	47-66 (mean 50 stdev 3) us
`kernelBLFastFit<4>()`	6.4-10 (mean 7.7 stdev 0.7) us
`kernelBLFit<4>()`	60-76 (mean 62 stdev 2) us
`kernelBLFastFit<4>()`	5.0-8.7 (mean 5.8 stdev 0.4) us
`kernelBLFit<4>()`	21-27 (mean 23 stdev 1) us
`kernel_classifyTracks`	3.2-6.3 (mean 3.9 stdev 0.4) us
`kernel_fastDuplicateRemover`	8.8-33 (mean 21 stdev 4) us
`kernel_countHitInTracks`	2.2-4.2 (mean 3.5 stdev 0.2) us
`cub::DeviceScanInitKernel()`	1.1-1.7 (mean 1.2 stdev 0.1) us
`cub::DeviceScanKernel()`	3.8-5.0 (mean 4.2 stdev 0.2) us
`kernel_fillHitInTracks`	3.2-4.6 (mean 2.6 sedev 0.2) us
`kernel_tripletCleaner`	3.7-11 (mean 6.4 stdev 1.1 us)

Vertexing (`PixelVertexProducerCUDA`)

This module reconstruct vertices, i.e. finds clusters of track "origin points" that likely correspond to proton-proton collision points.

The following CUDA operations are issued for each event from gpuVertexFinderImpl.h

Operation	Description
`gpuVertexFinder::init()`	Zero data structures
`gpuVertexFinder::loadTracks()`	Fill vertexing data structures with information from tracks
`gpuVertexFinder::vertexFinderOneKernel()`	Vertex reconstruction split into cluster tracks to vertices, Fit vertex parameters (first pass), split vertices that are likely to be merged from multiple proton-proton interactions, fit vertex parameters (second pass, after splitting), sort vertices by the sum of the square of the transverse momentum of the tracks contained by a vertex

Kernel timings

Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.

Kernel	V100
`gpuVertexFinder::init()`	1.1-1.7 (mean 1.13 stdev 0.07) us
`gpuVertexFinder::loadTracks()`	42.7-4.1 (mean 3.4 stdev 0.2) us
`gpuVertexFinder::vertexFinderOneKernel()`	63.9-237 (mean 100 stdev 21) us

Transfer tracks to host (`PixelTrackSoAFromCUDA`)

This module transfers the pixel tracks from the device to the host.

Operation	Description
memcpy device-to-host 3.97 MB	Transfer `pixeltrack::TrackSoA`

Transfer vertices to host (`PixelVertexSoAFromCUDA`)

This module transfers the pixel vertices from the device to the host.

Operation	Description
memcpy device-to-host 117 kB	Transfer `ZVertexSoA`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeStructure.md

CodeStructure.md

Code structure

Framework

Overall view on the use of CUDA

Application structure

BeamSpot transfer (`BeamSpotToCUDA`)

Raw-to-cluster (`SiPixelRawToClusterCUDA`)

Kernel timings

RecHits `SiPixelRecHitCUDA`

Kernel timings

Pattern recognition (with Cellular Automaton) (`CAHitNtupletCUDA`)

Kernel timings

Vertexing (`PixelVertexProducerCUDA`)

Kernel timings

Transfer tracks to host (`PixelTrackSoAFromCUDA`)

Transfer vertices to host (`PixelVertexSoAFromCUDA`)

Files

CodeStructure.md

Latest commit

History

CodeStructure.md

File metadata and controls

Code structure

Framework

Overall view on the use of CUDA

Application structure

BeamSpot transfer (BeamSpotToCUDA)

Raw-to-cluster (SiPixelRawToClusterCUDA)

Kernel timings

RecHits SiPixelRecHitCUDA

Kernel timings

Pattern recognition (with Cellular Automaton) (CAHitNtupletCUDA)

Kernel timings

Vertexing (PixelVertexProducerCUDA)

Kernel timings

Transfer tracks to host (PixelTrackSoAFromCUDA)

Transfer vertices to host (PixelVertexSoAFromCUDA)

BeamSpot transfer (`BeamSpotToCUDA`)

Raw-to-cluster (`SiPixelRawToClusterCUDA`)

RecHits `SiPixelRecHitCUDA`

Pattern recognition (with Cellular Automaton) (`CAHitNtupletCUDA`)

Vertexing (`PixelVertexProducerCUDA`)

Transfer tracks to host (`PixelTrackSoAFromCUDA`)

Transfer vertices to host (`PixelVertexSoAFromCUDA`)