Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : fix flash_attn kernel to produce same results as CPU #3

Merged
merged 4 commits into from
Feb 1, 2024

Conversation

ggerganov
Copy link

No description provided.

@FSSRepo
Copy link

FSSRepo commented Feb 1, 2024

@ggerganov Thank you very much for the help; it's evident that you know what you're doing!

@ggerganov
Copy link
Author

@FSSRepo This is ready to merge. I've tried to improve the performance on RTX 2060, but I think it is still worse compared to master in most cases. I suspect there is room for improvement in the "online softmax" loop that would give the necessary speedup, but there might be some additional CUDA tricks that I'm not aware of yet

I propose that we merge this and continue to try to improve the performance

@FSSRepo FSSRepo merged commit 43f7156 into Pints-AI:flash-attn-cuda Feb 1, 2024
@FSSRepo
Copy link

FSSRepo commented Feb 1, 2024

@ggerganov

I suspect there is room for improvement in the "online softmax" loop that would give the necessary speedup, but there might be some additional CUDA tricks that I'm not aware of yet.

It seems that there is high pressure on the registers because each 16x16 core tensor fragment uses 8 registers per thread (256 elements, and there are 32 threads per warp). I assume that the distribution of the 65535 registers per block depends on the number of threads launched per block.

There must be a way to reduce the number of fragments loaded at the same time. Additionally, instead of using half types, half2 types should be utilized to maximize the use of registers, as each one is 32 bits and can accommodate a half2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants