Support cuda graph in the triton attention backend #1401

merrymercy · 2024-09-12T06:21:02Z

Llama 3 8B (1.3x faster)

# triton w/ cuda graph
# Decode.  median latency: 0.00706 s, median throughput:    141.63 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --attention-backend triton

# triton w/o cuda graph
# Decode.  median latency: 0.00928 s, median throughput:    107.79 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --attention-backend triton --disable-cuda-graph


# flashinfer w/ cuda graph
# Decode.  median latency: 0.00735 s, median throughput:    135.98 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --attention-backend flashinfer

# flashinfer w/o cuda graph
# Decode.  median latency: 0.00823 s, median throughput:    121.46 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --attention-backend flashinfer --disable-cuda-graph

DeepSeek-Coder-V2-Lite (4x faster)

# triton w/ cuda graph
# Decode.  median latency: 0.00622 s, median throughput:    160.82 token/s
python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote --batch-size 1 --input 128 --output 8 --enable-mla

# triton w/o cuda graph
# Decode.  median latency: 0.02453 s, median throughput:     40.77 token/s
python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote --batch-size 1 --input 128 --output 8 --enable-mla --disable-cuda-graph

zhyncs · 2024-09-12T14:28:26Z

Significant improvement, especially in small batch latency. Accuracy is similar to before.

ref #1285 (comment)

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --enable-mla --trust-remote-code --disable-radix

lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,base_url=http://127.0.0.1:30000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False

# run 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7695|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.7559|±  |0.0118|

# run 2
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7801|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7688|±  |0.0116|

# run 3
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7741|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7672|±  |0.0116|

The impact on max throughput is not significant, because after enabling CUDA Graph, TP 1 needs to adjust --mem-frac 0.85, otherwise it will result in OOM.

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --enable-mla --trust-remote-code --disable-radix --mem-static 0.85
python3 -m sglang.bench_serving --backend sglang --num-prompts 5000

zhyncs · 2024-09-12T14:35:16Z

python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --batch-size 1 --input 128 --output 8 --attention-backend triton --trust-remote-code
python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --batch-size 1 --input 128 --output 8 --attention-backend triton --trust-remote-code --disable-cuda-graph

Decode.  median latency: 0.00793 s, median throughput:    126.09 token/s
Decode.  median latency: 0.03645 s, median throughput:     27.44 token/s

zhyncs · 2024-09-12T14:55:33Z

python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --batch-size 1 --input 128 --output 8 --attention-backend triton --trust-remote-code --enable-mla
python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --batch-size 1 --input 128 --output 8 --attention-backend triton --trust-remote-code --enable-mla --disable-cuda-graph

Decode.  median latency: 0.00621 s, median throughput:    161.09 token/s
Decode.  median latency: 0.01916 s, median throughput:     52.19 token/s

fengyang95 · 2024-09-13T10:50:22Z

Hi @zhyncs @merrymercy Does this support sm_89 (L40)? I see that cuda graph relies on vllm's fused_moe, but from what I can see, it seems that it does not support sm_89?

merrymercy · 2024-09-18T11:44:31Z

@fengyang95 It should support L40 but I haven't tested it. I think cuda graph does not depend on specific ops. It just captures the existing ops.

merrymercy added 2 commits September 11, 2024 18:50

try triton cuda graph

2b2c516

fix padding

b22e764

merrymercy requested review from ispobock and zhyncs September 12, 2024 06:50

merrymercy added 2 commits September 12, 2024 00:00

add triton attention backend to CI

d38e5af

fix workflow in ci

b50d463

merrymercy merged commit 3efa798 into main Sep 12, 2024
1 check failed

merrymercy deleted the triton-cuda-graph branch September 12, 2024 07:36

This was referenced Sep 12, 2024

[Bug] Lower single request speed with mla enabled #1264

Closed

deepseek-v2 enable-mla 4x slower #1369

Closed

merrymercy mentioned this pull request Sep 18, 2024

[Bug] OOM when runing bench_serving with DeepSeekCoder-V2-Lite. #1455

Closed

5 tasks

This was referenced Sep 19, 2024

Support double sparsity #1459

Merged

will triton kernels support cuda graph? #1097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cuda graph in the triton attention backend #1401

Support cuda graph in the triton attention backend #1401

merrymercy commented Sep 12, 2024 •

edited

Loading

zhyncs commented Sep 12, 2024

zhyncs commented Sep 12, 2024

zhyncs commented Sep 12, 2024

fengyang95 commented Sep 13, 2024 •

edited

Loading

merrymercy commented Sep 18, 2024

Support cuda graph in the triton attention backend #1401

Support cuda graph in the triton attention backend #1401

Conversation

merrymercy commented Sep 12, 2024 • edited Loading

Llama 3 8B (1.3x faster)

DeepSeek-Coder-V2-Lite (4x faster)

zhyncs commented Sep 12, 2024

zhyncs commented Sep 12, 2024

zhyncs commented Sep 12, 2024

fengyang95 commented Sep 13, 2024 • edited Loading

merrymercy commented Sep 18, 2024

merrymercy commented Sep 12, 2024 •

edited

Loading

fengyang95 commented Sep 13, 2024 •

edited

Loading