GEMM-101: CUDA Matrix Multiplication Optimization

Overview

This repository provides a step-by-step implementation of optimized General Matrix Multiplication (GEMM) using CUDA. It's based on the excellent article by Lei Mao: CUDA Matrix Multiplication Optimization.

Implementations

The repository includes several versions of GEMM implementations, each building upon the previous one with additional optimizations:

Naive Implementation (v00)
Coalesced Memory Access (v01)
Shared Memory Usage, 2D Block Tiling (v02)
2D Block Tiling, 1D Thread Tiling (v03)
2D Block Tiling, 2D Thread Tiling (v04)
2D Block Tiling, 2D Thread Tiling, load transposed A to enable VMA (v05)
2D Block Tiling, 2D Warp Tiling, 2D Thread Tiling, load transposed A to enable VMA (v06)
Use WMMA API to leverage Tensor Core (v07)
Use Cutlass CuTe to leverage Tensor Core (v08, v09)

Important Note

A small but crucial fix has been applied to versions v06 and v07. The Lei Mao's original code used __syncwarp(), which could potentially lead to race conditions. This has been replaced with __syncthreads() to ensure proper synchronization across all threads in a block.

Build and Benchmark

git submodule init
git submodule update
cmake -B build
cmake --build build
./build/src/profile_cuda_gemm_fp32
./build/src/profile_cuda_gemm_fp16

Credits

This project is based on the work of Lei Mao and Siboehm. I am grateful for their detailed explanations and implementations.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
deps		deps
include		include
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEMM-101: CUDA Matrix Multiplication Optimization

Overview

Implementations

Important Note

Build and Benchmark

Credits

About

Releases

Packages

Languages

HuyNguyen-hust/gemm-101

Folders and files

Latest commit

History

Repository files navigation

GEMM-101: CUDA Matrix Multiplication Optimization

Overview

Implementations

Important Note

Build and Benchmark

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages