CUTLASS 2.0

kerrmudgeon released this 22 Nov 17:40

Substantially refactored for

Better performance, particularly for native Turing Tensor Cores
Robust and durable templates spanning the design space
Encapsulated functionality embodying modern C++11 programming techniques
Optimized containers and data types for efficient, generic, portable device code

Updates to:

Quick start guide
Documentation
Utilities
CUTLASS Profiler

Native Turing Tensor Cores

Efficient GEMM kernels targeting Turing Tensor Cores
Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands

Coverage of existing CUTLASS functionality

GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
Volta Tensor Cores through native mma.sync and through WMMA API
Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
Batched GEMM operations
Complex-valued GEMMs

Note: a host compiler supporting C++11 or greater is required.

Assets 2