Release CUTLASS 2.9.0 · NVIDIA/cutlass

CUTLASS 2.9.0

First layer Convolution kernels specialized for small channel counts and reduced alignment
- Few channels specialization for reduced alignment capabilities
- Fixed channels further specialized when channel count perfectly matches the access vector size
- Unit tests
- Python-based instance emitter in the CUTLASS Library and support in the Profiler
BLAS3 operators accelerated by Tensor Cores
- Supported types: f32, cf32, f64, cf64
- HERK with emitter
- SYRK with emitter
- SYMM with emitter
- TRMM with emitter
- Unit tests
CUTLASS Python demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using CUDA Python
- Python-based runtime interoperable with existing emitters
GEMM + Softmax example
Optimal performance using CUDA 11.6u2
Updates and bugfixes from the community (thanks!)

Provide feedback