You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GPU MODE Lecture 8: CUDA Performance Checklist – Christian Mills
Lecture #8 provides a comprehensive guide to CUDA performance optimization techniques, covering key concepts like memory coalescing, occupancy, control divergence, tiling, privatization, thread coarsening, and algorithm rewriting with better math, illustrated with practical examples and profiling using NCU to improve kernel performance.
I am actually a bit skeptical about the benefits of thread coarsening for such simple kernels as vector addition, or generally kernels where there is actually not enough redundant work to trade parallelism for increase memory access and compute efficiency. I have run the vector addition example on a A100 and although I get a factor 2x improvement with thread coarsening
VecAdd execution time: 0.006144 ms
VecAddCoarsened execution time: 0.003072 ms
the speedup vanishes as you increase the workload N.
Hi @mredenti,
The GPU Mode Discord channel would be a better place to discuss your findings from going through the lectures. These are just my personal notes and not part of the official lecture series.
GPU MODE Lecture 8: CUDA Performance Checklist – Christian Mills
Lecture #8 provides a comprehensive guide to CUDA performance optimization techniques, covering key concepts like memory coalescing, occupancy, control divergence, tiling, privatization, thread coarsening, and algorithm rewriting with better math, illustrated with practical examples and profiling using NCU to improve kernel performance.
https://christianjmills.com/posts/cuda-mode-notes/lecture-008/
The text was updated successfully, but these errors were encountered: