integrate gpu pallas flash attention . Reduce prefill time for llama70b #1305

jwyang-google · 2025-02-24T18:44:19Z

model from 123ms to 77ms on 8 H100 chips.

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

jwyang-google requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla and RissyRan as code owners February 24, 2025 18:44

jwyang-google assigned tohaowu Feb 24, 2025

vipannalla approved these changes Feb 24, 2025

View reviewed changes

jwyang-google force-pushed the gpu_pallas_flash branch from 784d87d to 8f9ebfd Compare February 24, 2025 20:06

jwyang-google requested review from richjames0, rni418 and gagika as code owners February 24, 2025 20:06

tohaowu approved these changes Feb 24, 2025

View reviewed changes

github-actions bot added the pull ready label Feb 24, 2025

jwyang-google force-pushed the gpu_pallas_flash branch 4 times, most recently from 256247a to 963519b Compare February 24, 2025 22:17

add gpu pallas flash kernel.

09ff3c0

jwyang-google force-pushed the gpu_pallas_flash branch from 963519b to 09ff3c0 Compare February 24, 2025 23:38

Merge branch 'main' into gpu_pallas_flash

493cbc3