This repository provides a step-by-step implementation of optimized General Matrix Multiplication (GEMM) using CUDA. It's based on the excellent article by Lei Mao: CUDA Matrix Multiplication Optimization.
The repository includes several versions of GEMM implementations, each building upon the previous one with additional optimizations:
- Naive Implementation (v00)
- Coalesced Memory Access (v01)
- Shared Memory Usage, 2D Block Tiling (v02)
- 2D Block Tiling, 1D Thread Tiling (v03)
- 2D Block Tiling, 2D Thread Tiling (v04)
- 2D Block Tiling, 2D Thread Tiling, load transposed A to enable VMA (v05)
- 2D Block Tiling, 2D Warp Tiling, 2D Thread Tiling, load transposed A to enable VMA (v06)
- Use WMMA API to leverage Tensor Core (v07)
- Use Cutlass CuTe to leverage Tensor Core (v08, v09)
A small but crucial fix has been applied to versions v06 and v07. The Lei Mao's original code used __syncwarp()
, which could potentially lead to race conditions. This has been replaced with __syncthreads()
to ensure proper synchronization across all threads in a block.
git submodule init
git submodule update
cmake -B build
cmake --build build
./build/src/profile_cuda_gemm_fp32
./build/src/profile_cuda_gemm_fp16
This project is based on the work of Lei Mao and Siboehm. I am grateful for their detailed explanations and implementations.