Parallel Reductions in CUDA

Iteratively optimizing a reduce_sum operation in CUDA until we reach >95% of GPU performance. This code accompanies the blog post Embarrasingly Parallel Reduction in CUDA.

Results

Effective bandwidth achieved on an RTX-3090 (N=1<<25 elements):

#	Kernel	Bandwidth (GB/s)	Relative to `jnp.sum`
1	Vector Loads	9.9	1.1%
2	Interleaved Addressing	223	24.7%
3	Non-divergent Threads	317	36.3%
4	Sequential Addressing	331	38.0%
5	Reduce on First Loads	618	70.9%
6	Warp Unrolling	859	98.6%
0	`jnp.sum` reference	871	100%

Run benchmarks

# Compile
nvcc -arch=native -O3 --use_fast_math reduce_sum.cu -lcublas -lcublasLt -o ./reduce_sum 

# Run
./reduce_sum <1...6>

Acknowledgements

Benchmarking setup borrowed from karpathy/llm.c.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.clang-format		.clang-format
.gitignore		.gitignore
README.md		README.md
benchmark.h		benchmark.h
reduce_sum.cu		reduce_sum.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Reductions in CUDA

Results

Run benchmarks

Acknowledgements

License

About

Languages

MasterSkepticista/parallel_reductions_cuda

Folders and files

Latest commit

History

Repository files navigation

Parallel Reductions in CUDA

Results

Run benchmarks

Acknowledgements

License

About

Topics

Resources

Stars

Watchers

Forks

Languages