Testbed RTX 4090 CUDA 12.1 CUTLASS 3.5.1 Triton 3.1.0 Warm up : 100 times Execution : 100 times Performance