Skip to content

CUTLASS 2.0

Compare
Choose a tag to compare
@kerrmudgeon kerrmudgeon released this 22 Nov 17:40
7c0cd26

Substantially refactored for

  • Better performance, particularly for native Turing Tensor Cores
  • Robust and durable templates spanning the design space
  • Encapsulated functionality embodying modern C++11 programming techniques
  • Optimized containers and data types for efficient, generic, portable device code

Updates to:

  • Quick start guide
  • Documentation
  • Utilities
  • CUTLASS Profiler

Native Turing Tensor Cores

  • Efficient GEMM kernels targeting Turing Tensor Cores
  • Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands

Coverage of existing CUTLASS functionality

  • GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
  • Volta Tensor Cores through native mma.sync and through WMMA API
  • Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
  • Batched GEMM operations
  • Complex-valued GEMMs

Note: a host compiler supporting C++11 or greater is required.