Cuda Parallel Reduction Optimization

Optimization Goal

Choose the right metric:
- GFLOP/s: for compute-bound kernels
- Bandwidth: for memory-bound kernels
  - Reductions have very low arithmetic intensity, with 1 flop per element loaded (bandwidth-optimal)
In this project, achieve 92% bandwidth of NVIDIA 3070Ti, which has a bandwidth of 608 GB/s.

Kernels	GB/s	Performance Relative to Theoretical Value
Shared	`85.9`	14.1%
Sequential	`110.6`	18.2%
Grid Stride Loop	`536.7`	88.2%
Unroll Last Warp	`540.1`	88.8%
Unroll Loop	`539.7`	88.7%
Vectorized	`558.5`	91.8%
Prefetch	`552.5`	90.8%
Completely Unroll	`559.3`	91.9%
Warp Sync	`558.4`	91.8%
Theoretical Bandwidth	`608.3`	100.0%

To build the project, follow these steps:

Open a terminal and navigate to the parallel_reduction_optimization directory.
Create a build directory by running the following command:
```
mkdir build
```
Navigate to the build directory:
```
cd build
```
Run the following command to configure the project:
```
cmake ..
```
Build the project using the following command:
```
make -j4
```
Run the reduction:
```
./reduction
```

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
main.cpp		main.cpp