GEMM-101: CUDA Matrix Multiplication Optimization

Overview

This repository provides a step-by-step implementation of optimized General Matrix Multiplication (GEMM) using CUDA. It's based on the excellent article by Lei Mao: CUDA Matrix Multiplication Optimization.

Implementations

The repository includes several versions of GEMM implementations, each building upon the previous one with additional optimizations:

Naive Implementation (v00)
Coalesced Memory Access (v01)
Shared Memory Usage, 2D Block Tiling (v02)
2D Block Tiling, 1D Thread Tiling (v03)
2D Block Tiling, 2D Thread Tiling (v04)
2D Block Tiling, 2D Thread Tiling, load transposed A to enable VMA (v05)
2D Block Tiling, 2D Warp Tiling, 2D Thread Tiling, load transposed A to enable VMA (v06)
Use WMMA API to leverage Tensor Core (v07)
Use Cutlass CuTe to leverage Tensor Core (v08, v09)

Important Note

A small but crucial fix has been applied to versions v06 and v07. The Lei Mao's original code used __syncwarp(), which could potentially lead to race conditions. This has been replaced with __syncthreads() to ensure proper synchronization across all threads in a block.

Build and Benchmark

git submodule init
git submodule update
cmake -B build
cmake --build build
./build/src/profile_cuda_gemm_fp32
./build/src/profile_cuda_gemm_fp16

Credits

This project is based on the work of Lei Mao and Siboehm. I am grateful for their detailed explanations and implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GEMM-101: CUDA Matrix Multiplication Optimization

Overview

Implementations

Important Note

Build and Benchmark

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

GEMM-101: CUDA Matrix Multiplication Optimization

Overview

Implementations

Important Note

Build and Benchmark

Credits