-
Notifications
You must be signed in to change notification settings - Fork 318
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix sdpa flash attention op for et llama deployment (#4322)
Summary: Pull Request resolved: #4322 We retropfitted flash attention cpu from aten. The retrofit we did was to make it work to cacluate attention for a) batched prefill and b) decode with different start_pos. For b, there was a bug when kv cache's seqlen dim is split. As a result attention calculation is not right. There is a detail in the code to explain the issue. bypass-github-export-checks ghstack-source-id: 234634902 Reviewed By: larryliu0820 Differential Revision: D60011925 fbshipit-source-id: 50921846b329e449a4a767cf28c7a55d507217bd
- Loading branch information
1 parent
9d85965
commit 6dbb4dc
Showing
2 changed files
with
168 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters