cuda : fix flash_attn kernel to produce same results as CPU #3

ggerganov · 2024-02-01T08:45:56Z

No description provided.

FSSRepo · 2024-02-01T15:27:50Z

@ggerganov Thank you very much for the help; it's evident that you know what you're doing!

ggerganov · 2024-02-01T15:32:19Z

@FSSRepo This is ready to merge. I've tried to improve the performance on RTX 2060, but I think it is still worse compared to master in most cases. I suspect there is room for improvement in the "online softmax" loop that would give the necessary speedup, but there might be some additional CUDA tricks that I'm not aware of yet

I propose that we merge this and continue to try to improve the performance

FSSRepo · 2024-02-01T16:39:13Z

@ggerganov

I suspect there is room for improvement in the "online softmax" loop that would give the necessary speedup, but there might be some additional CUDA tricks that I'm not aware of yet.

It seems that there is high pressure on the registers because each 16x16 core tensor fragment uses 8 registers per thread (256 elements, and there are 32 threads per warp). I assume that the distribution of the 65535 registers per block depends on the number of threads launched per block.

There must be a way to reduce the number of fragments loaded at the same time. Additionally, instead of using half types, half2 types should be utilized to maximize the use of registers, as each one is 32 bits and can accommodate a half2.

cuda : fix flash_attn kernel to produce same results as CPU

71b69aa

ggerganov mentioned this pull request Feb 1, 2024

WIP: Flash Attention implementation (forward + backward) #1

Closed

3 tasks

ggerganov added 2 commits February 1, 2024 14:03

cuda : avoid extra QxQ matrix in shared memory

2c04bee

cuda : switch to F16 scalars + tune warps for RTX 2060

9a5c2a1

ggerganov force-pushed the flash-attn-cuda branch from fb085fa to 9a5c2a1 Compare February 1, 2024 13:01

cuda : increase C to 128 for better performance

ac26f27

FSSRepo merged commit 43f7156 into Pints-AI:flash-attn-cuda Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : fix flash_attn kernel to produce same results as CPU #3

cuda : fix flash_attn kernel to produce same results as CPU #3

ggerganov commented Feb 1, 2024

FSSRepo commented Feb 1, 2024

ggerganov commented Feb 1, 2024

FSSRepo commented Feb 1, 2024 •

edited

Loading

cuda : fix flash_attn kernel to produce same results as CPU #3

cuda : fix flash_attn kernel to produce same results as CPU #3

Conversation

ggerganov commented Feb 1, 2024

FSSRepo commented Feb 1, 2024

ggerganov commented Feb 1, 2024

FSSRepo commented Feb 1, 2024 • edited Loading

FSSRepo commented Feb 1, 2024 •

edited

Loading