What's Changed
- [FA2] split-q + tiling-qk D=512 performance🎉 by @DefTruth in #178
- [FA2] split-q + tiling-qk D=512 performance🎉 by @DefTruth in #179
- [FA2] split-q + tiling-qk D=512 performance🎉 by @DefTruth in #180
- [Doc] Refactor README.md to improve readability✔️ by @DefTruth in #181
- [Doc] Refactor README.md for better readability✔️ by @DefTruth in #182
- [FA2] flash-attn-mma 3080/L20/4090 bench✔️ by @DefTruth in #183
- [FA2] flash-attn-mma 3080/L20/4090 bench✔️ by @DefTruth in #184
- [FA2] fa2/hgemm manually smem swizzle🎉 by @DefTruth in #185
flash_attn_mma_stages_split_q_tiling_qk_swizzle_kernel
void flash_attn_mma_stages_split_q_tiling_qk_swizzle_kernel<512, 16, 8, 16, 8, 1, 8, 1, 1, 16, 1, 64, 2, 0, 0, 8>(__half *, __half *, __half *, __half *, int, int) (8, 48, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics
------------------------------------------------------------------ ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------------------------------------------ ----------- ------------
sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.avg 0
sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.max 0
sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.min 0
sm__sass_l1tex_data_bank_conflicts_pipe_lsu_mem_shared_op_ldsm.sum 0
------------------------------------------------------------------ ----------- ------------
Full Changelog: v2.6.11...v2.6.12