Release CUTLASS 3.5.0 · NVIDIA/cutlass

Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
- Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
- Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
- Support for Fprop, Dgrad, and Wgrad algorithms.
- CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
- NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
- Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
- Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
Updates to CuTe documentation for cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
Extensions to CuTe to support L2 prefetching and TMA store+reductions.
Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
Fixes to greatly reduce build warnings.
Updates and bugfixes from the community (thanks!)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTLASS 3.5.0