-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709
Comments
Will work on speculative decoding |
ref #4686 |
I'll add accuracy and latency benchmark after each major feature introduction in this issue: Accuracy: After initial PR of #4680
After #4832 and #4855 (Page Size > 1): Accuracy: with Cuda Graph:
Note: 0.792/0.796 means 0.792 for page_size = 1, 0.796 for page_size = 128 From this we can conclude that page size shouldn't have impact on accuracy |
Latency: Benchmark Command I used: python -m sglang.bench_one_batch --model /path/to/Meta-Llama-3.1-8B-Instruct --batch-size 16 --input 1024 --output 512 --attention-backend fa3 After initial PR of #4680
After couple of changes (introduced more logic, might have negative impact on latency) plus one optimization PR against Prefill: #4932 and one against Decode: #4745
|
Functionality
--attention-backend fa3
#4680 @hebiao064 @qingquansongsglang/test/srt/test_triton_attention_backend.py
: Add integration test for Flash Attention 3 #4760 @yubofredwangDocumentation:
Perf Optimization and Accuracy Problems
item()
device sync: [Fix] avoid stream sync and torch compile in prefill for fa3 backend #4932 @Fridge003Success Criteria:
Other issues we surfaced but not scoped in this task:
The text was updated successfully, but these errors were encountered: