Skip to content

HuyNguyen-hust/gemm-101

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GEMM-101: CUDA Matrix Multiplication Optimization

Overview

This repository provides a step-by-step implementation of optimized General Matrix Multiplication (GEMM) using CUDA. It's based on the excellent article by Lei Mao: CUDA Matrix Multiplication Optimization.

Implementations

The repository includes several versions of GEMM implementations, each building upon the previous one with additional optimizations:

  1. Naive Implementation (v00)
  2. Coalesced Memory Access (v01)
  3. Shared Memory Usage, 2D Block Tiling (v02)
  4. 2D Block Tiling, 1D Thread Tiling (v03)
  5. 2D Block Tiling, 2D Thread Tiling (v04)
  6. 2D Block Tiling, 2D Thread Tiling, load transposed A to enable VMA (v05)
  7. 2D Block Tiling, 2D Warp Tiling, 2D Thread Tiling, load transposed A to enable VMA (v06)
  8. Use WMMA API to leverage Tensor Core (v07)
  9. Use Cutlass CuTe to leverage Tensor Core (v08, v09)

Important Note

A small but crucial fix has been applied to versions v06 and v07. The Lei Mao's original code used __syncwarp(), which could potentially lead to race conditions. This has been replaced with __syncthreads() to ensure proper synchronization across all threads in a block.

Build and Benchmark

git submodule init
git submodule update
cmake -B build
cmake --build build
./build/src/profile_cuda_gemm_fp32
./build/src/profile_cuda_gemm_fp16

Credits

This project is based on the work of Lei Mao and Siboehm. I am grateful for their detailed explanations and implementations.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published