Skip to content

MasterSkepticista/parallel_reductions_cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Reductions in CUDA

Iteratively optimizing a reduce_sum operation in CUDA until we reach >95% of GPU performance. This code accompanies the blog post Embarrasingly Parallel Reduction in CUDA.

Results

Effective bandwidth achieved on an RTX-3090 (N=1<<25 elements):

# Kernel Bandwidth (GB/s) Relative to jnp.sum
1 Vector Loads 9.9 1.1%
2 Interleaved Addressing 223 24.7%
3 Non-divergent Threads 317 36.3%
4 Sequential Addressing 331 38.0%
5 Reduce on First Loads 618 70.9%
6 Warp Unrolling 859 98.6%
0 jnp.sum reference 871 100%

Run benchmarks

# Compile
nvcc -arch=native -O3 --use_fast_math reduce_sum.cu -lcublas -lcublasLt -o ./reduce_sum 

# Run
./reduce_sum <1...6>

Acknowledgements

Benchmarking setup borrowed from karpathy/llm.c.

License

MIT

About

Iteratively optimizing parallel reductions in CUDA.

Topics

Resources

Stars

Watchers

Forks