CUTLASS 2.9.0
CUTLASS 2.9.0
- First layer Convolution kernels specialized for small channel counts and reduced alignment
- Few channels specialization for reduced alignment capabilities
- Fixed channels further specialized when channel count perfectly matches the access vector size
- Unit tests
- Python-based instance emitter in the CUTLASS Library and support in the Profiler
- BLAS3 operators accelerated by Tensor Cores
- CUTLASS Python demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using CUDA Python
- Python-based runtime interoperable with existing emitters
- GEMM + Softmax example
- Optimal performance using CUDA 11.6u2
- Updates and bugfixes from the community (thanks!)