Benchmark performance on NVIDIA A10
Here are some preview mooncake benchmark results on A10 with up to 2 RDMA NICs. We are currently having some trouble benchmarking PyNcclConnector
now. For some unknown reasons, it crashes a lot for inter-node disaggregated scenarios. So the benchmark results haven't included the PyNcclConnector
yet.
In addition, we are also coordinating resources to integrate some machines with more RDMA NICs and more advanced GPUs. The official benchmark results will be released in due time.
Varying tp (input length = 1024, qps = 2, output length =6)
Setting
num_rdma_nic
Successful Requests
Duration (s)
Total Input Tokens
Total Generated Tokens
Req Throughput (req/s)
Output Token Throughput (tok/s)
Total Token Throughput (tok/s)
Mean TTFT (ms)
Median TTFT (ms)
P99 TTFT (ms)
Mean TPOT (ms)
Median TPOT (ms)
P99 TPOT (ms)
Mean ITL (ms)
Median ITL (ms)
P99 ITL (ms)
tp = 1
2
200
99.47
201995
1200
2.01
12.06
2042.74
1056.76
635.00
4006.59
97.08
26.94
781.91
97.01
14.05
2205.51
tp = 2
2
200
98.98
201995
1200
2.02
12.12
2052.95
314.87
231.20
949.40
25.65
15.56
129.60
25.62
15.48
288.06
tp = 4
2
200
98.76
201995
1200
2.03
12.15
2057.44
198.10
160.03
461.61
23.52
18.93
94.38
23.50
18.01
187.79
tp = 1
1
200
99.44
201995
1200
2.01
12.07
2043.39
1071.12
631.56
4361.02
83.93
26.93
794.75
83.86
14.13
1932.66
tp = 2
1
200
98.96
201995
1200
2.02
12.13
2053.35
335.26
258.30
997.93
28.84
15.56
144.82
28.80
15.42
397.56
tp = 4
1
200
98.78
201995
1200
2.02
12.15
2057.03
201.68
162.85
456.33
22.31
16.74
94.76
22.29
16.73
189.13
tp = 1
TCP
200
99.55
201995
1200
2.01
12.05
2041.13
1414.05
766.23
6035.36
155.01
35.28
1191.24
154.91
14.32
3148.99
tp = 2
TCP
200
98.97
201995
1200
2.02
12.12
2053.03
333.74
251.32
954.63
28.74
15.49
161.24
28.70
15.35
393.52
tp = 4
TCP
200
98.78
201995
1200
2.02
12.15
2056.94
205.37
162.92
463.70
21.54
16.51
94.04
21.51
16.56
170.54
Varying qps (length = 1024, tp = 4, output length =6)
Setting
num_rdma_nic
Successful Requests
Duration (s)
Total Input Tokens
Total Generated Tokens
Req Throughput (req/s)
Output Token Throughput (tok/s)
Total Token Throughput (tok/s)
Mean TTFT (ms)
Median TTFT (ms)
P99 TTFT (ms)
Mean TPOT (ms)
Median TPOT (ms)
P99 TPOT (ms)
Mean ITL (ms)
Median ITL (ms)
P99 ITL (ms)
qps = 2
2
200
98.77
201995
1200
2.02
12.15
2057.33
200.64
156.62
478.22
22.63
17.35
99.61
22.60
17.08
186.25
qps = 4
2
200
49.75
201995
1200
4.02
24.12
4084.03
341.88
240.68
1430.54
38.36
18.39
313.45
38.31
17.17
588.80
qps = 6
2
200
33.44
201995
1200
5.98
35.88
6075.54
851.15
501.59
3239.89
102.51
47.67
606.77
102.34
18.35
1704.79
qps = 8
2
200
27.16
201995
1200
7.36
44.19
7482.52
4835.08
5733.45
8846.27
1276.59
1150.11
4401.23
1274.43
48.34
20682.35
qps = 2
1
200
98.77
201995
1200
2.02
12.15
2057.31
201.77
161.53
473.44
22.13
16.52
96.18
22.11
16.51
190.40
qps = 4
1
200
49.76
201995
1200
4.02
24.12
4083.83
337.31
243.38
1395.85
39.95
17.61
325.39
39.88
17.06
838.68
qps = 6
1
200
33.44
201995
1200
5.98
35.88
6075.99
820.53
458.84
3169.52
83.92
30.50
663.07
83.78
17.85
1306.32
qps = 8
1
200
27.19
201995
1200
7.36
44.14
7473.44
5291.91
6160.55
9596.56
1190.36
1040.63
4418.66
1188.33
47.61
20815.23
qps = 2
TCP
200
98.76
201995
1200
2.03
12.15
2057.42
207.22
160.81
511.01
22.17
16.59
94.96
22.15
16.59
181.82
qps = 4
TCP
200
49.79
201995
1200
4.02
24.10
4081.06
355.43
252.63
1554.91
40.15
16.92
314.28
40.09
16.66
708.50
qps = 6
TCP
200
33.49
201995
1200
5.97
35.83
6067.71
907.74
514.85
3253.93
122.75
45.51
648.40
122.56
18.09
2282.92
qps = 8
TCP
200
28.39
201995
1200
7.04
42.26
7156.09
6714.57
7885.09
11787.51
1116.06
408.32
4645.25
1114.29
46.87
21898.03
Varying input length (tp = 4, qps = 2, output length =6)
Setting
num_rdma_nic
Successful Requests
Duration (s)
Total Input Tokens
Total Generated Tokens
Req Throughput (req/s)
Output Token Throughput (tok/s)
Total Token Throughput (tok/s)
Mean TTFT (ms)
Median TTFT (ms)
P99 TTFT (ms)
Mean TPOT (ms)
Median TPOT (ms)
P99 TPOT (ms)
Mean ITL (ms)
Median ITL (ms)
P99 ITL (ms)
1024
2
200
98.77
201995
1200
2.02
12.15
2057.32
195.47
151.55
482.84
22.83
19.27
96.55
22.81
18.12
158.16
2048
2
200
99.22
406707
1200
2.02
12.09
4110.95
723.76
488.67
2941.96
67.25
18.93
632.73
67.20
17.49
1209.54
4096
2
200
117.42
818415
1200
1.70
10.22
6979.90
14616.48
18323.82
23191.04
8042.84
7593.16
19851.11
8040.02
65.43
93511.26
8192
2
200
247.77
1636065
1200
0.81
4.84
6608.10
75783.36
79331.60
147544.42
16961.27
15140.11
39278.98
16958.32
90.01
186151.61
1024
1
200
98.77
201995
1200
2.02
12.15
2057.31
201.77
161.53
473.44
22.13
16.52
96.18
22.11
16.51
190.40
2048
1
200
99.25
406707
1200
2.02
12.09
4109.96
719.43
482.02
3208.13
61.92
17.64
681.26
61.86
16.83
978.90
4096
1
200
111.88
818415
1200
1.79
10.73
7326.16
20362.10
22807.05
31853.55
5915.16
4521.51
18739.12
5913.18
67.03
81600.29
8192
1
200
270.01
1636065
1200
0.74
4.44
6063.79
103355.40
106546.65
172025.11
12894.35
11027.66
35110.13
12892.85
64.84
151774.68
1024
TCP
200
98.81
201995
1200
2.02
12.14
2056.44
203.32
160.83
460.90
21.81
16.96
95.27
21.78
16.91
171.80
2048
TCP
200
99.27
406707
1200
2.01
12.09
4108.98
731.60
484.78
3213.69
68.55
17.88
639.93
68.49
17.33
1257.45
4096
TCP
200
118.37
818415
1200
1.69
10.14
6923.89
23735.69
27101.97
36573.47
6386.62
5102.00
20032.26
6384.71
69.57
92811.27
8192
TCP
200
278.12
1636065
1200
0.72
4.31
5886.95
106873.23
109941.33
179781.64
13360.87
12155.24
36022.96
13359.20
68.01
156716.38