Skip to content

Latest commit

 

History

History
323 lines (266 loc) · 28.4 KB

CodeStructure.md

File metadata and controls

323 lines (266 loc) · 28.4 KB

Code structure

Framework

The framework is a crude approximation of the CMS data processing software, CMSSW. CMSSW is a rather generic framework to process independent chunks of data. In CMS these chunks of data correspond to triggered proton-proton collisions, and are called events. The events are processed by modules, that in this project are all "producers" that can read objects from the event, and insert new objects to the event (in CMSSW there are, in addition, analyzers, that can only read objects from the event, and filters, that can decide to stop processing of the event).

The modules form a DAG based on their data dependencies. The modules are implemented in C++ (C++17 in general, CUDA code is in C++14). A CMSSW job is configured in Python (this project does not provide configuration mechanism).

The CMSSW framework is multi-threaded using Threading Building Blocks (TBB). An integral part of the multi-threading is a concept of "concurrent event processor" that we call "a stream" (to disambiguate from CUDA streams, these streams are called "EDM streams" from now on). An EDM stream processes one event at a time ("processing" meaning that each module in the DAG is run on the event in some order respecting the data dependencies). A job may have multiple EDM streams, in which case the EDM streams process their events concurrently. Furthermore, modules processing the same event that are independent in the DAG are also run concurrently. All this potential concurrency is exposed as tasks to be run by the TBB task scheduler. We do not make any assumptions on how TBB runs these tasks in threads (e.g. number of EDM streams and number of threads may be different). For more information on CMSSW framework see e.g.

The processing time and memory requirements can vary a lot across events. In addition, the filtering capability may affect which modules in the DAG can be run.

The main approximations of the framework in this project with respect to CMSSW are

  • producer modules only, and only stream and stream-ExternalWork producers
  • modules and the event products have no labels, and therefore the event can hold only at most one product of each C++ type
  • no run time configuration mechanism
  • input data are fully read in memory at the beginning of the job
  • EventSetup system has a single (implicit) record, one IOV, all products are read from binary dumps at the beginning of the job.

Overall view on the use of CUDA

Our overall aims are to avoid blocking synchronization as much as possible, and keep all processing units (CPU cores, GPUs) as busy as we can doing useful work. It follows that we try to have all operations (memory transfers, kernel calls) asynchronous with the use of CUDA streams, and that we use callback functions (cudaStreamAddCallback()) to notify the CMSSW framework when the asynchronous work has finished.

We use a "caching allocator" (based on the one from CUB library) for both device and pinned host memory allocations. This approach allows us to amortize the cost of the cudaMalloc()/cudaFree() etc, while being able to easily re-use device/pinned host memory regions for temporary workspaces, to avoid "conservative" overallocation of memory, and to avoid constraining the scheduling of modules to multiple devices.

We use one CUDA stream for each EDM stream ("concurrent event") and each linear chain of GPU modules that pass data from one to the other in the device memory. In case of branches in the DAG of modules, additional CUDA streams are used since there is sub-event concurrency in the DAG that we want to expose to CUDA runtime.

For more information see cms-sw/cmssw:HeterogeneousCore/CUDACore/README.md.

Application structure

The pixel tracking GPU prototype consists of five modules that are run according to their data dependencies (roughly in the following order)

The data dependencies of the modules form the following DAG Module DAG

In addition, there are two modules to transfer the tracks and vertices back to CPU (enabled only with --transfer parameter)

With the transfers the module DAG becomes Module DAG with transfers

The application reads uncompressed raw data for just the pixel detector (about 250 kB/event). This configuration is somewhat artificial, e.g. almost nothing is transferred back to CPU, at the moment there are no modules that would consume the data in the format produced by this workflow, and in offline data processing the input data is compressed.

For more information on the application see

BeamSpot transfer (BeamSpotToCUDA)

This module transfers information about the average beam collision region ("beam spot") from the host to the device for each event.

Operation Description
memcpy host-to-device 44 B Transfer BeamSpotCUDA::Data for position and other information about the beam spot

These essentially only transfer data from CPU to GPU.

Raw-to-cluster (SiPixelRawToClusterCUDA)

This module that unpacks and reformats the raw to into something usable to downstream, applies some calibration, and forms clusters of pixels on each pixel detector module.

The following memory transfers are done once per job on the first event from SiPixelRawToClusterCUDA.cc

Operation Description
memcpy host-to-device 1.44 MB Transfer SiPixelFedCablingMapGPU for pixel detector cabling map
memcpy host-to-device 57.6 kB Transfer an array of module indices to be unpacked. This is to support regional unpacking in CMSSW, even though this functionality is not used in this project.
memcpy host-to-device 3.09 MB Transfer gain calibration data
memcpy host-to-device 24.05 kB Transfer SiPixelGainForHLTonGPU for gain calibration
memcpy host-to-device 8 B Set the gain calibration data pointer in SiPixelGainForHLTonGPU struct

The following CUDA operations are issued for each event from SiPixelRawToClusterGPUKernel.cu

Operation Description
memcpy host-to-device 40 B Transfer SiPixelDigisCUDA::DeviceConstView for SoA of pixel digis (= unpacked raw data)
memset 3.6 MB Zero unpacking error array
memcpy host-to-device 16 B Transfer GPU::SimpleVector<PixelErrorcompact> to provide std::vector-like interface for the unpacking error array
memcpy host-to-device 32 B Transfer SiPixelClustersCUDA::DeviceConstView for SoA of pixel clusters
memcpy host-to-device X B Transfer raw data words
memcpy host-to-device X B Transfer IDs for FEDs that provided data
pixelgpudetails::RawToDigi_kernel() Kernel to unpack the raw data into data structure usable for subsequent kernels
memcpy device-to-host 16 B Transfer GPU::SimpleVector<PixelErrorcompact>, i.e. essentially the number of errors, to host
gpuCalibPixel::calibDigis() Calibrate pixel digis (ADC counts)
gpuClustering::countModules() Fills starting index into the ADC (etc) arrays for each active module
memcpy device-to-host 4 B Transfer number of active modules to host
gpuClustering::findClus() Cluster digis on each pixel module
gpuClustering::clusterChargeCut() Select clusters whose aggregated electric charge is above a given threshold
pixelgpudetails::fillHitsModuleStart() find the "location" of the first hit of each module
memcpy device-to-host 4 B Transfer number of pixel clusters to host

Kernel timings

Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.

Kernel V100
pixelgpudetails::RawToDigi_kernel() 4.8-8.9 (mean 6.1 stdev 0.5) us
gpuCalibPixel::calibDigis() 3.1-4.6 (mean 3.7 stdev 0.2) us
gpuClustering::countModules() 2.9-4.1 (mean 3.3 stdev 0.2) us
gpuClustering::findClus() 43-490 (mean 82 stdev 34) us
gpuClustering::clusterChargeCut() 12-15 (mean 12.4 stdev 0.6 ) us
pixelgpudetails::fillHitsModuleStart() 5.4-6.8 (mean 5.8 stdev 0.3) us

This module computes the 3D position estimate for each cluster.

The following memory transfers are done once per job on the first event from SiPixelRecHitCUDA

Operation Description
memcpy host-to-device 32 B Transfer pixelCPEforGPU::ParamsOnGPU main struct of detector module parameters
memcpy host-to-device 16 B Transfer pixelCPEforGPU::ParamsOnGPU parameters common to all modules (thickness ,pitch)
memcpy host-to-device 3.56 kB Transfer phase1PixelTopology::AverageGeometry parameters of module ladders in barrel, and z positions of first forward disks
memcpy host-to-device 160 B Transfer pixelCPEforGPU::LayerGeometry layer indices
memcpy host-to-device 208 kB Transfer pixelCPEforGPU::DetParams detector module parameters

The following CUDA operations are issued for each event from PixelRecHits.cu

Operation Description
memcpy host-to-device 152 B Transfer TrackingRecHit2DSOAView for SoA of pixel hits
gpuPixelRecHits::getHits() Calculates 3D position for each cluster
setHitsLayerStart() Set index of the first hit for each pixel detector layer
cudautils::countFromVector() First kernel of four to fill a phi-binned histogram of the hits. This kernel counts the number of elements for each bin
cub::DeviceScanInitKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
cub::DeviceScanKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
cudautils::fillFromVector() Last kernel of four to fill a phi-binned histogram of the hits. This kernel fills each bin with the hits.

Kernel timings

Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.

Kernel V100
gpuPixelRecHits::getHits() 35-39 (mean 39 stdev 2) us
setHitsLayerStart() 1.7-3.8 (mean 1.8 stdev 0.1) us
cudautils::countFromVector() 2.4-4.2 (mean 3.1 stdev 0.3) us
cub::DeviceScanInitKernel() 1.1-1.5 (mean 1.18 stdev 0.06) us
cub::DeviceScanKernel() 3.1-4.1 (mean 3.5 stdev 0.2) us
cudautils::fillFromVector() 2.8-4.5 (mean .3. stdev 0.3) us

Pattern recognition (with Cellular Automaton) (CAHitNtupletCUDA)

This module performs the "pattern recognition":

  • create pairs of pixel hits on adjacent layers
  • connect the pairs to form ntuplets (triplets or quadruplets)
  • fit the ntuplets with a helix

The following CUDA operations are issued for each event from CAHitNtupletGeneratorKernels.cu and BrokenLineFitOnGPU.cu

Operation Description
memset 4 B Zero the number of doublets (CA cells)
memset 36 B Zero counter of ntuplets
memset 197 kB Zero association structure from hits to ntuplets
gpuPixelDoublets::initDoublets() Initialize pair/doublet finding data structure
gpuPixelDoublets::getDoubletsFromHisto() Create the hit pairs/doublets
memset 131 kB Zero ntuplet finding data structure
kernel_connect() Connect compatible pairs/doublets
gpuPixelDoublets::fishbone() Identify duplicate ntuplets (a single particle can induce multiple hits per layer because of redundancies in the detector)
kernel_find_ntuplets Find ntuplets from the doublet connection graph (the actual Cellular Automaton)
cudautils::finalizeBulk() finalize data structures (Histogrammer/OnetoManyAssoc) filled in the previous kernel
kernel_earlyDuplicateRemover() Clean duplicate ntuplets
kernel_countMultiplicity() Count the number of ntuplets with different numbers of hits
cub::DeviceScanInitKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
cub::DeviceScanKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
kernel_fillMultiplicity() Fills a nhit-binned histogram of the ntuplets.
kernel_fillHitDetIndices fill for each ht-tuple a tuple with just the det-indices.
kernelBLFastFit<3>() First step of fitting triplets
kernelBLFit<3>() Second step of fitting triplets
kernelBLFastFit<4>() First step of fitting quadruplets
kernelBLFit<4>() Second step of fitting quadruplets
kernelBLFastFit<4>() First step of fitting pentuplets (only first 4 hits)
kernelBLFit<4>() Second step of fitting pentuplets (only first 4 hits)
kernel_classifyTracks Classify fitted tracks according to certain quality criteria
kernel_fastDuplicateRemover Identify duplicate tracks
kernel_countHitInTracks Count the number of times each hit is used in any ntuplet
cub::DeviceScanInitKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
cub::DeviceScanKernel() Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize()
kernel_fillHitInTracks Fill a histogram per hit with ntuplets using that hit
kernel_tripletCleaner Clean triplets

Kernel timings

Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.

Kernel V100
gpuPixelDoublets::initDoublets() 1.6-8.2 (mean 4 stdev 1) us
gpuPixelDoublets::getDoubletsFromHisto() 28.2-195 (mean 85 stdev 27) us
kernel_connect() 34.0-248 (mean 83 stdev 31) us
gpuPixelDoublets::fishbone() 12.1-101 (mean 47 stdev 14) us
kernel_find_ntuplets 53.6-366 (mean 177 stdev 51) us
cudautils::finalizeBulk() 1.8-2.4 (mean 2.0 stdev 0.1) us
kernel_earlyDuplicateRemover() 6.1-27 (mean 15 stdev 3) us
kernel_countMultiplicity() 2-10 (mean 4.5 stdev 1.4)
cub::DeviceScanInitKernel() 0.9-1.7 (mean 1.07 stdev 0.09) us
cub::DeviceScanKernel() 1.6-3.9 (mean 1.7 stdev 0.1) us
kernel_fillMultiplicity() 2.3-10 (mean 4.5 stdev 1.2) us
kernel_fillHitDetIndices 3.0-3.7 (mean 3.2 stdev 0.2) us
kernelBLFastFit<3>() 6.1-12 (mean 7.1 stdev 0.5) us
kernelBLFit<3>() 47-66 (mean 50 stdev 3) us
kernelBLFastFit<4>() 6.4-10 (mean 7.7 stdev 0.7) us
kernelBLFit<4>() 60-76 (mean 62 stdev 2) us
kernelBLFastFit<4>() 5.0-8.7 (mean 5.8 stdev 0.4) us
kernelBLFit<4>() 21-27 (mean 23 stdev 1) us
kernel_classifyTracks 3.2-6.3 (mean 3.9 stdev 0.4) us
kernel_fastDuplicateRemover 8.8-33 (mean 21 stdev 4) us
kernel_countHitInTracks 2.2-4.2 (mean 3.5 stdev 0.2) us
cub::DeviceScanInitKernel() 1.1-1.7 (mean 1.2 stdev 0.1) us
cub::DeviceScanKernel() 3.8-5.0 (mean 4.2 stdev 0.2) us
kernel_fillHitInTracks 3.2-4.6 (mean 2.6 sedev 0.2) us
kernel_tripletCleaner 3.7-11 (mean 6.4 stdev 1.1 us)

This module reconstruct vertices, i.e. finds clusters of track "origin points" that likely correspond to proton-proton collision points.

The following CUDA operations are issued for each event from gpuVertexFinderImpl.h

Operation Description
gpuVertexFinder::init() Zero data structures
gpuVertexFinder::loadTracks() Fill vertexing data structures with information from tracks
gpuVertexFinder::vertexFinderOneKernel() Vertex reconstruction split into cluster tracks to vertices, Fit vertex parameters (first pass), split vertices that are likely to be merged from multiple proton-proton interactions, fit vertex parameters (second pass, after splitting), sort vertices by the sum of the square of the transverse momentum of the tracks contained by a vertex

Kernel timings

Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.

Kernel V100
gpuVertexFinder::init() 1.1-1.7 (mean 1.13 stdev 0.07) us
gpuVertexFinder::loadTracks() 42.7-4.1 (mean 3.4 stdev 0.2) us
gpuVertexFinder::vertexFinderOneKernel() 63.9-237 (mean 100 stdev 21) us

Transfer tracks to host (PixelTrackSoAFromCUDA)

This module transfers the pixel tracks from the device to the host.

Operation Description
memcpy device-to-host 3.97 MB Transfer pixeltrack::TrackSoA

Transfer vertices to host (PixelVertexSoAFromCUDA)

This module transfers the pixel vertices from the device to the host.

Operation Description
memcpy device-to-host 117 kB Transfer ZVertexSoA