- Choose the right metric:
- GFLOP/s: for compute-bound kernels
- Bandwidth: for memory-bound kernels
- Reductions have very low arithmetic intensity, with 1 flop per element loaded (bandwidth-optimal)
- In this project, achieve 92% bandwidth of NVIDIA 3070Ti, which has a bandwidth of 608 GB/s.
Kernels | GB/s | Performance Relative to Theoretical Value |
---|---|---|
Shared | 85.9 |
14.1% |
Sequential | 110.6 |
18.2% |
Grid Stride Loop | 536.7 |
88.2% |
Unroll Last Warp | 540.1 |
88.8% |
Unroll Loop | 539.7 |
88.7% |
Vectorized | 558.5 |
91.8% |
Prefetch | 552.5 |
90.8% |
Completely Unroll | 559.3 |
91.9% |
Warp Sync | 558.4 |
91.8% |
Theoretical Bandwidth | 608.3 |
100.0% |
To build the project, follow these steps:
- Open a terminal and navigate to the
parallel_reduction_optimization
directory. - Create a
build
directory by running the following command:mkdir build
- Navigate to the
build
directory:cd build
- Run the following command to configure the project:
cmake ..
- Build the project using the following command:
make -j4
- Run the reduction:
./reduction