[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

hebiao064 · 2025-03-24T06:13:12Z

zcnrex · 2025-03-27T20:35:40Z

Will work on speculative decoding

zhyncs · 2025-03-27T21:54:19Z

ref #4686

hebiao064 · 2025-03-28T17:50:42Z

I'll add accuracy and latency benchmark after each major feature introduction in this issue:

Accuracy:

After initial PR of #4680

Model	FA3 Accuracy	Flash Infer Accuracy
Meta-Llama-3.1-8B-Instruct	0.793	0.789
Qwen2.5-7B-Instruct	0.823	0.789
Gemma-2-9B	0.724 (Torch Native is 0.730)	0.132 (potential bug!)

After #4832 and #4855 (Page Size > 1):

Accuracy: with Cuda Graph:

Model	FA3 Accuracy	Flash Infer Accuracy
Meta-Llama-3.1-8B-Instruct	0.792/0.796	0.792/0.792
Qwen2.5-7B-Instruct	0.819/0.818	0.809/0.810

Note: 0.792/0.796 means 0.792 for page_size = 1, 0.796 for page_size = 128

From this we can conclude that page size shouldn't have impact on accuracy

hebiao064 · 2025-03-28T17:55:03Z

Latency:

Benchmark Command I used:

python -m sglang.bench_one_batch --model /path/to/Meta-Llama-3.1-8B-Instruct  --batch-size 16 --input 1024 --output 512 --attention-backend fa3

After initial PR of #4680

Model	FA3 Latency	Flash Infer Latency
Meta-Llama-3.1-8B-Instruct with Cuda Graph	45328.51/1967.94	43511.35/1960.41
Meta-Llama-3.1-8B-Instruct w/o Cuda Graph	45648.71/1296.51	43664.38/1237.78

After couple of changes (introduced more logic, might have negative impact on latency) plus one optimization PR against Prefill: #4932 and one against Decode: #4745

Model	FA3 Latency	Flash Infer Latency
Meta-Llama-3.1-8B-Instruct with Cuda Graph	44392.57/2477.80	44736.24/2415.29
Meta-Llama-3.1-8B-Instruct w/o Cuda Graph	44637.53/1335.13	44796.29/1244.60

hebiao064 assigned hebiao064 and qingquansong Mar 24, 2025

hebiao064 mentioned this issue Mar 24, 2025

Support FA3 as Attention backend by using --attention-backend fa3 #4680

Merged

11 tasks

zhyncs added the high priority label Mar 24, 2025

zhyncs pinned this issue Mar 24, 2025

zhyncs unpinned this issue Mar 24, 2025

hebiao064 mentioned this issue Mar 24, 2025

Development Roadmap (2025 H1) #4042

Open

59 tasks

yubofredwang mentioned this issue Mar 25, 2025

Add integration test for Flash Attention 3 #4760

Open

6 tasks

zhyncs assigned zcnrex Mar 27, 2025

zhyncs assigned yundai424 Mar 27, 2025

zhyncs added the collaboration label Mar 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

hebiao064 commented Mar 24, 2025 •

edited by zhyncs

Loading

zcnrex commented Mar 27, 2025

zhyncs commented Mar 27, 2025

hebiao064 commented Mar 28, 2025 •

edited

Loading

hebiao064 commented Mar 28, 2025 •

edited

Loading

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

Comments

hebiao064 commented Mar 24, 2025 • edited by zhyncs Loading

zcnrex commented Mar 27, 2025

zhyncs commented Mar 27, 2025

hebiao064 commented Mar 28, 2025 • edited Loading

hebiao064 commented Mar 28, 2025 • edited Loading

hebiao064 commented Mar 24, 2025 •

edited by zhyncs

Loading

hebiao064 commented Mar 28, 2025 •

edited

Loading

hebiao064 commented Mar 28, 2025 •

edited

Loading