Skip to content

CUTLASS 2.6.0

Compare
Choose a tag to compare
@kerrmudgeon kerrmudgeon released this 03 Sep 16:52
a01feb9

CUTLASS 2.6.0

  • Optimal performance when compiled with the CUDA 11.4 Toolkit
  • Fused operators with GEMM and Convolution
  • 64b tensor strides and leading dimensions support for GEMMs
  • Affine rank=2 matrix layouts
  • Batched GEMV preview implementation
  • New strided Dgrad implementation
    • Accelerates over previous implementation by cutting down redundant math by 4x
    • Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
  • Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
  • Many improvements to the epilogue.
    • Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
    • Performance improvement for FP16 tensor core kernels
    • Bug fixes
  • Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
  • Updated minimum CUDA Toolkit requirement to 10.2
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!