Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection latency significantly affects throughput #1043

Closed
adg opened this issue Jan 10, 2017 · 66 comments · Fixed by #1273
Closed

Connection latency significantly affects throughput #1043

adg opened this issue Jan 10, 2017 · 66 comments · Fixed by #1273
Assignees
Labels
P1 Type: Performance Performance improvements (CPU, network, memory, etc)

Comments

@adg
Copy link
Contributor

adg commented Jan 10, 2017

I am working on a system that uses GRPC to send 1MB blobs between clients and servers and have observed some poor throughput when connection latency is high (180ms round trip is typical between Australia and the USA).

The same throughput issues are not present when I take GRPC out of the equation. I have prepared a self-contained program that reproduces the problem on a local machine by simulating latency at the net.Listener level. It can run either using GRPC or just plain HTTP/2 POST requests. In each case the payload and latency are the same, but—as you can see from the data below—GRPC becomes several orders of magnitude slower as latency increases.

The program and related files: https://gist.github.com/adg/641d04ef335782648502cb32a03b2b07

The output of a typical run:

$ ./run.sh 
Duration	Latency	Proto

6.977221ms	0s	GRPC
4.833989ms	0s	GRPC
4.714891ms	0s	GRPC
3.884165ms	0s	GRPC
5.254322ms	0s	GRPC

8.507352ms	0s	HTTP/2.0
936.436µs	0s	HTTP/2.0
453.471µs	0s	HTTP/2.0
252.786µs	0s	HTTP/2.0
265.955µs	0s	HTTP/2.0

107.32663ms	1ms	GRPC
102.51629ms	1ms	GRPC
100.235617ms	1ms	GRPC
100.444982ms	1ms	GRPC
100.881221ms	1ms	GRPC

12.423725ms	1ms	HTTP/2.0
3.02918ms	1ms	HTTP/2.0
2.775928ms	1ms	HTTP/2.0
4.161895ms	1ms	HTTP/2.0
2.951534ms	1ms	HTTP/2.0

195.731175ms	2ms	GRPC
190.571784ms	2ms	GRPC
188.810298ms	2ms	GRPC
190.593822ms	2ms	GRPC
190.015888ms	2ms	GRPC

19.18046ms	2ms	HTTP/2.0
4.663983ms	2ms	HTTP/2.0
5.45113ms	2ms	HTTP/2.0
5.56255ms	2ms	HTTP/2.0
5.582744ms	2ms	HTTP/2.0

378.653747ms	4ms	GRPC
362.14625ms	4ms	GRPC
357.95833ms	4ms	GRPC
364.525646ms	4ms	GRPC
364.954077ms	4ms	GRPC

33.666184ms	4ms	HTTP/2.0
8.68926ms	4ms	HTTP/2.0
10.658349ms	4ms	HTTP/2.0
10.741361ms	4ms	HTTP/2.0
10.188444ms	4ms	HTTP/2.0

719.696194ms	8ms	GRPC
699.807568ms	8ms	GRPC
703.794127ms	8ms	GRPC
702.610461ms	8ms	GRPC
710.592955ms	8ms	GRPC

55.66933ms	8ms	HTTP/2.0
18.449093ms	8ms	HTTP/2.0
17.080567ms	8ms	HTTP/2.0
20.597944ms	8ms	HTTP/2.0
17.318133ms	8ms	HTTP/2.0

1.415272339s	16ms	GRPC
1.350923577s	16ms	GRPC
1.355653965s	16ms	GRPC
1.338834603s	16ms	GRPC
1.358419144s	16ms	GRPC

102.133898ms	16ms	HTTP/2.0
39.144638ms	16ms	HTTP/2.0
40.82348ms	16ms	HTTP/2.0
35.133498ms	16ms	HTTP/2.0
39.516466ms	16ms	HTTP/2.0

2.630821843s	32ms	GRPC
2.46741086s	32ms	GRPC
2.507019279s	32ms	GRPC
2.476177935s	32ms	GRPC
2.49266693s	32ms	GRPC

179.271675ms	32ms	HTTP/2.0
72.575954ms	32ms	HTTP/2.0
67.23265ms	32ms	HTTP/2.0
70.651455ms	32ms	HTTP/2.0
67.579124ms	32ms	HTTP/2.0

I theorize that there is something wrong with GRPC's flow control mechanism, but that's just a guess.

@adg
Copy link
Contributor Author

adg commented Jan 10, 2017

cc @iamqizhao (and @rsc @robpike @bradfitz who may be interested)

@iamqizhao
Copy link
Contributor

If you observe from 1ms, both grpc and http2 are increasing proportionally. The only issue with grpc is that it is started with much large value (i.e., 107ms vs 12ms). We need to figure out what is the expected value there (this counters my previous observation with http2.0...).

@apolcyn, assign this to you and please help. Thank you.

@adg
Copy link
Contributor Author

adg commented Jan 10, 2017

@iamqizhao note the relationship between the simulated latency and the request time.

For HTTP/2, with 0 latency a request takes ~1ms. With 1ms of latency (both up and down) you get requests around 3ms (after the first request, when the connection is established). With 2ms of latency, requests are ~5ms. For 4ms latency, requests take ~9ms. See the pattern? So the actual time spent making HTTP/2 requests, minus the introduced latency, is constant (around 1ms).

Contrast this to the GRPC behavior, where the introduced latency multiplies the overall request time.

We need to figure out what is the expected value there (this counters my previous observation with http2.0...).

Starting with zero latency, the GRPC requests take 4-6ms. What we should see for GRPC requests is a similar pattern, where 1ms of latency yields 8ms requests (6ms + 1ms + 1ms), 2ms latency = 10ms requests, and so on.

@iamqizhao
Copy link
Contributor

iamqizhao commented Jan 10, 2017 via email

@adg
Copy link
Contributor Author

adg commented Jan 10, 2017

Why would latency have any affect on proto marshaling overhead?

@adg
Copy link
Contributor Author

adg commented Jan 10, 2017

12.423725ms	1ms	HTTP/2.0
3.02918ms	1ms	HTTP/2.0
2.775928ms	1ms	HTTP/2.0
4.161895ms	1ms	HTTP/2.0
2.951534ms	1ms	HTTP/2.0

The additional 8-10ms in the first request is certainly setting up the HTTP/TLS connection, which increases proportionally with latency as it involves several round trips.

@iamqizhao
Copy link
Contributor

Err, I see -- I did not look into how you set up the http transport.

@iamqizhao
Copy link
Contributor

yes, it seems this is due to flow control issue -- if I enlarge the flow control window above 1M, this issue goes away.

@presotto
Copy link

Perhaps a bit more is happening. At 0s latency grpc is still at most 1/10th the speed of https. Does grpc adjust window to rtt?

rsc added a commit to rsc/tmp that referenced this issue Jan 10, 2017
@rsc
Copy link

rsc commented Jan 10, 2017

I checked in a variant of adg's code that shows the packet breakup and timing.

go get -d -u rsc.io/tmp/grpcbench
cd $GOPATH/src/rsc.io/tmp/grpcbench
bash run.sh

It only shows the output for 32ms round trip time. This is what HTTP/2 POST looks like (the actual output shows packet boundaries within a given millisecond as <-93+89+1024+1024+1024, but I've added all those up for easier reading):

0.036s
	<-184
0.079s
	->1297
0.080s
	<-93
0.117s
	->51
0.118s
	<-12470
0.119s
	<-53573
0.151s
	->56
	<-38
0.184s
	->120
181.526851ms	32ms	HTTP/2.0

0.184s
	<-46
0.217s
	->83
	<-1024
0.218s
	<-64665
0.250s
	->43
65.878218ms	32ms	HTTP/2.0

There is some connection setup in the first one, and then a whole bunch of uninterrupted data transfer. The second one omits the setup but still has the nice fast data transfer.

Compare with GRPC, again two connections:

0.070s
	->9
0.104s
	->13
0.105s
	<-65702
0.138s
	->9
0.171s
	->13
	<-16393
0.206s
	->13
	<-16393
0.243s
	->13
	<-16393
0.281s
	->13
	<-16392
0.316s
	->13
	<-16393
0.351s
	->13
	<-15360
0.352s
	<-1024
0.353s
	<-9
0.384s
	->13
	<-16393
0.417s
	->13
	<-16392
0.450s
	->13
	<-16384
0.451s
	<-9
0.483s
	->13
	<-2048
0.485s
	<-14345
0.516s
	->13
	<-16393
0.549s
	->13
	<-16392
0.582s
	->13
	<-16393
0.615s
	->13
	<-16393
0.648s
	->13
	<-16393
0.681s
	->13
0.714s
	->13
	<-16392
0.747s
	->13
	<-16393
0.780s
	->13
	<-16393
0.813s
	->13
	<-16393
0.846s
	->13
	<-16392
0.879s
	->13
	<-16393
0.912s
	->13
	<-16393
0.945s
	->13
	<-16393
0.980s
	->13
	<-16392
1.014s
	->13
	<-8192
1.015s
	<-8201
1.051s
	->13
1.052s
	<-16393
1.086s
	->13
	<-16393
1.120s
	->13
	<-1024
1.121s
	<-15368
1.154s
	->13
	<-16393
1.187s
	->13
	<-16393
1.220s
	->13
1.221s
	<-16393
1.254s
	->13
1.287s
	->13
	<-16392
1.320s
	->13
1.321s
	<-16393
1.354s
	->13
	<-16393
1.387s
	->13
	<-16393
1.420s
	->13
1.421s
	<-16392
1.458s
	->13
	<-16393
1.490s
	->13
	<-16393
1.527s
	->13
	<-7168
1.528s
	<-9225
1.560s
	->13
	<-13312
1.561s
	<-3080
1.598s
	->13
	<-16393
1.635s
	->13
	<-16393
1.671s
	->13
	<-16393
1.704s
	->13
1.705s
	<-16392
1.737s
	->13
	<-16393
1.770s
	->13
	<-16393
1.803s
	->13
	<-16393
1.838s
	->13
1.870s
	->13
1.871s
	<-16392
1.904s
	->13
	<-16393
1.938s
	->13
	<-13312
1.939s
	<-3081
1.971s
	->13
	<-16393
2.004s
	->13
	<-16392
2.039s
	->13
	<-16393
2.072s
	->13
	<-16393
2.104s
	->13
	<-16393
2.139s
	->13
2.140s
	<-16392
2.173s
	->13
	<-16393
2.205s
	->13
2.206s
	<-16393
2.239s
	->13
	<-16393
2.273s
	->13
2.274s
	<-16392
2.306s
	->13
	<-34
2.338s
	->13
2.371s
	->23
2.404s
	->13
2.437s
	->14
2.470s
	->13
2.504s
	->33
2.501003152s	32ms	GRPC

2.507s
	<-1024
2.508s
	<-64563
2.538s
	->13
2.571s
	->13
	<-16393
2.604s
	->13
	<-15360
2.605s
	<-1033
2.638s
	->13
	<-16393
2.671s
	->13
	<-16392
2.704s
	->13
	<-7168
2.705s
	<-9225
2.738s
	->13
	<-16393
2.771s
	->13
	<-16393
2.804s
	->13
	<-4096
2.805s
	<-12296
2.837s
	->13
	<-16384
2.838s
	<-9
2.871s
	->13
	<-16393
2.904s
	->13
2.905s
	<-16393
2.937s
	->13
	<-16392
2.971s
	->13
	<-16393
3.004s
	->13
3.005s
	<-16393
3.039s
	->13
3.040s
	<-16393
3.072s
	->13
3.106s
	->13
	<-16392
3.139s
	->13
	<-16393
3.171s
	->13
	<-16393
3.204s
	->13
3.205s
	<-16393
3.238s
	->13
	<-16392
3.271s
	->13
	<-16393
3.304s
	->13
3.305s
	<-16393
3.338s
	->13
	<-16393
3.371s
	->13
	<-14336
3.372s
	<-2056
3.405s
	->13
	<-16393
3.438s
	->13
	<-16393
3.471s
	->13
	<-10240
3.472s
	<-6153
3.505s
	->13
	<-16392
3.538s
	->13
	<-16393
3.571s
	->13
	<-1024
3.572s
	<-15369
3.605s
	->13
	<-16393
3.638s
	->13
3.671s
	->13
3.672s
	<-16392
3.705s
	->13
	<-16393
3.738s
	->13
	<-16393
3.771s
	->13
	<-1024
3.772s
	<-15369
3.805s
	->13
	<-16392
3.838s
	->13
	<-16393
3.871s
	->13
3.872s
	<-16393
3.905s
	->13
	<-16393
3.938s
	->13
	<-16392
3.971s
	->13
	<-1024
3.972s
	<-15369
4.004s
	->13
	<-16393
4.041s
	->13
	<-16393
4.078s
	->13
	<-16392
4.113s
	->13
	<-16393
4.146s
	->13
	<-16393
4.179s
	->13
	<-16393
4.212s
	->13
4.245s
	->13
	<-16392
4.278s
	->13
	<-16393
4.311s
	->13
	<-16393
4.344s
	->13
	<-11264
4.345s
	<-5129
4.377s
	->13
	<-16392
4.410s
	->13
	<-16393
4.443s
	->13
	<-5120
4.445s
	<-11273
4.476s
	->13
	<-16393
4.509s
	->13
	<-16392
4.542s
	->13
	<-16393
4.575s
	->13
	<-16393
4.608s
	->13
	<-16393
4.641s
	->13
	<-13312
4.642s
	<-2048
4.643s
	<-1032
4.674s
	->13
	<-34
4.707s
	->13
4.740s
	->11
4.772s
	->13
4.805s
	->14
4.838s
	->13
4.872s
	->11
2.367555959s	32ms	GRPC

It looks like there are two problems. The first one is that GRPC appears to have added its own flow control on top of HTTP/2's flow control (on top of TCP's flow control), and it has a fixed 16 kB window. The second is that it looks like HTTP/2 POST is compressing while GRPC is not: in addition to having better (no added) flow control, the posts transfer only about 64 kB of data in total. Either that or the test is broken.

@iamqizhao
Copy link
Contributor

@rsc Thanks for investigation. Nice findings. In gRPC-Go, there is no HTTP/2 flow control (FYI, Brad and I have not finished the migration to the standard HTTP2 package yet and gRPC-Go still uses its own native HTTP2 implementation). And yes gRPC does not apply compression by default.

The root case that the latency of gRPC-Go is not equal to network RTT + constant processing time is because of its flow control mechanism. Currently, it takes a simplistic form -- it has a fixed 16KB window update as Russ discovered. It does not tune according to network RTT and incoming message sizes so that it favors low latency networks and sucks in long-haul networks (as shown in this issue). The current mechanism is the initial version and was consistent across languages. Recently, gRPC-C added some more intelligence to its flow control and more is under design. I summarize the effort we will put in this area as follows:
i) window size tuning upon message size. This should fix this issue;
ii) window size tuning according to network RTT (under design);
iii) public surface to allow user to tune flow control parameters and enforce a server-scoped flow control limit (under design).

@rsc
Copy link

rsc commented Jan 10, 2017

Can user client code change the fixed flow control window size?

Can user client code enable compression?

@iamqizhao
Copy link
Contributor

iamqizhao commented Jan 10, 2017

Can user client code change the fixed flow control window size?

Currently they cannot. This is the 3rd effort I mentioned in the previous reply.

Can user client code enable compression?

Yes, with grpc.WithCompressor dial option.

@adg
Copy link
Contributor Author

adg commented Jan 10, 2017

This change enables compression for GRPC and uses a randomly-generated payload to defeat the effect of the compression:

diff --git a/grpcbench/main.go b/grpcbench/main.go
index 101677d..c03ac0f 100644
--- a/grpcbench/main.go
+++ b/grpcbench/main.go
@@ -1,6 +1,7 @@
 package main
 
 import (
+       "crypto/rand"
        "crypto/tls"
        "flag"
        "fmt"
@@ -41,6 +42,7 @@ func main() {
                                grpc.WithBlock(),
                                grpc.WithTimeout(3 * time.Second),
                                grpc.WithInsecure(),
+                               grpc.WithCompressor(grpc.NewGZIPCompressor()),
                        }
                        conn, err := grpc.Dial(*addr, opts...)
                        if err != nil {
@@ -59,7 +61,12 @@ func main() {
 
                ctx := context.Background()
 
-               msg := strings.Repeat(" ", *msgSize)
+               randomBytes := make([]byte, *msgSize)
+               _, err := rand.Read(randomBytes)
+               if err != nil {
+                       log.Fatal(err)
+               }
+               msg := string(randomBytes)
                for i := 0; i < *numRuns; i++ {
                        t1 := time.Now()
                        var err error
@@ -86,7 +93,7 @@ func main() {
 
        var server *grpc.Server
        if *useGRPC {
-               server = grpc.NewServer()
+               server = grpc.NewServer(grpc.RPCDecompressor(grpc.NewGZIPDecompressor()))
                helloworld.RegisterGreeterServer(server, greeter{})
        }
        l, err := net.Listen("tcp", *addr)

The numbers appear to be more or less the same:

2.477774562s	32ms	GRPC
2.462765459s	32ms	GRPC

176.592079ms	32ms	HTTP/2.0
69.485653ms	32ms	HTTP/2.0

@presotto
Copy link

presotto commented Jan 10, 2017 via email

@adg
Copy link
Contributor Author

adg commented Jan 10, 2017

How would that change things?

@presotto
Copy link

presotto commented Jan 10, 2017 via email

@adg
Copy link
Contributor Author

adg commented Jan 12, 2017

On investigating this further, I found an issue with the grpcbench program. It was not sending complete HTTP/2 requests, which explains the dramatic difference between GRPC and HTTP/2. GRPC still has performance issues, but it is only about 2x slower than x/net/http2 (not orders of magnitude as first reported).

I updated grpcbench to correctly exercise HTTP/2, and also threw HTTP/1.1 into the mix. See this pull request.

Interestingly, while HTTP/2 is 2x faster than GRPC, HTTP/1.1 is more than 10x faster than HTTP/2. I talked to @bradfitz about this and it is because there's a fixed initial flow control window which never changes, with TODOs to improve this. It would make sense for GRPC to adopt x/net/http2 so that when those TODOs are done, GRPC users should benefit.

Here's the data from a new run of grpcbench:

Duration	Latency	Proto

9.491853ms	0s	GRPC
4.495137ms	0s	GRPC
5.391501ms	0s	GRPC
4.726007ms	0s	GRPC
5.12824ms	0s	GRPC

14.558358ms	0s	HTTP/2.0
6.382529ms	0s	HTTP/2.0
6.178314ms	0s	HTTP/2.0
5.51633ms	0s	HTTP/2.0
6.029431ms	0s	HTTP/2.0

11.587129ms	0s	HTTP/1.1
5.397509ms	0s	HTTP/1.1
4.237196ms	0s	HTTP/1.1
5.960211ms	0s	HTTP/1.1
108.997722ms	0s	HTTP/1.1

93.566806ms	1ms	GRPC
99.017055ms	1ms	GRPC
93.487803ms	1ms	GRPC
98.34198ms	1ms	GRPC
99.387167ms	1ms	GRPC

59.484852ms	1ms	HTTP/2.0
51.734641ms	1ms	HTTP/2.0
54.935499ms	1ms	HTTP/2.0
156.577757ms	1ms	HTTP/2.0
51.694602ms	1ms	HTTP/2.0

15.841433ms	1ms	HTTP/1.1
5.704361ms	1ms	HTTP/1.1
107.550483ms	1ms	HTTP/1.1
5.40012ms	1ms	HTTP/1.1
5.800495ms	1ms	HTTP/1.1

192.232563ms	2ms	GRPC
188.990258ms	2ms	GRPC
180.028459ms	2ms	GRPC
189.284675ms	2ms	GRPC
192.21624ms	2ms	GRPC

107.48685ms	2ms	HTTP/2.0
92.618437ms	2ms	HTTP/2.0
103.681046ms	2ms	HTTP/2.0
94.787457ms	2ms	HTTP/2.0
96.490496ms	2ms	HTTP/2.0

25.557564ms	2ms	HTTP/1.1
7.750258ms	2ms	HTTP/1.1
6.587036ms	2ms	HTTP/1.1
107.893995ms	2ms	HTTP/1.1
6.236492ms	2ms	HTTP/1.1

355.213791ms	4ms	GRPC
349.819232ms	4ms	GRPC
333.278482ms	4ms	GRPC
338.551819ms	4ms	GRPC
338.591875ms	4ms	GRPC

184.903654ms	4ms	HTTP/2.0
179.907175ms	4ms	HTTP/2.0
180.843634ms	4ms	HTTP/2.0
169.588748ms	4ms	HTTP/2.0
167.390974ms	4ms	HTTP/2.0

32.01951ms	4ms	HTTP/1.1
12.554588ms	4ms	HTTP/1.1
9.637975ms	4ms	HTTP/1.1
112.462356ms	4ms	HTTP/1.1
8.400449ms	4ms	HTTP/1.1

669.925997ms	8ms	GRPC
682.715267ms	8ms	GRPC
669.499247ms	8ms	GRPC
653.192967ms	8ms	GRPC
686.181873ms	8ms	GRPC

365.49212ms	8ms	HTTP/2.0
330.625694ms	8ms	HTTP/2.0
340.397932ms	8ms	HTTP/2.0
329.523659ms	8ms	HTTP/2.0
336.895354ms	8ms	HTTP/2.0

50.715689ms	8ms	HTTP/1.1
15.353749ms	8ms	HTTP/1.1
17.007966ms	8ms	HTTP/1.1
114.972155ms	8ms	HTTP/1.1
18.499317ms	8ms	HTTP/1.1

1.272841685s	16ms	GRPC
1.285137518s	16ms	GRPC
1.278880998s	16ms	GRPC
1.286303382s	16ms	GRPC
1.280546303s	16ms	GRPC

688.093256ms	16ms	HTTP/2.0
637.301763ms	16ms	HTTP/2.0
623.836108ms	16ms	HTTP/2.0
658.244228ms	16ms	HTTP/2.0
633.323672ms	16ms	HTTP/2.0

87.381827ms	16ms	HTTP/1.1
24.798529ms	16ms	HTTP/1.1
122.792087ms	16ms	HTTP/1.1
24.228707ms	16ms	HTTP/1.1
22.021362ms	16ms	HTTP/1.1

2.500501533s	32ms	GRPC
2.41936603s	32ms	GRPC
2.458976246s	32ms	GRPC
2.45729285s	32ms	GRPC
2.475112293s	32ms	GRPC

1.325008844s	32ms	HTTP/2.0
1.185842114s	32ms	HTTP/2.0
1.22962957s	32ms	HTTP/2.0
1.195622848s	32ms	HTTP/2.0
1.188314041s	32ms	HTTP/2.0

150.393105ms	32ms	HTTP/1.1
36.553453ms	32ms	HTTP/1.1
40.572305ms	32ms	HTTP/1.1
40.395134ms	32ms	HTTP/1.1
146.914462ms	32ms	HTTP/1.1

@apolcyn
Copy link
Contributor

apolcyn commented Jan 18, 2017

I think this method of simulated latency might be unfair.

As is, it looks like the server is basically sleeping for the full latency (e.g., 32ms) before each write to the network. AFAICS this blocking behavior might be realistic when e.g., sending the large 1M message from a fast sender to a slow receiver, or possibly sending across a faulty network, but I think it's probably unfair otherwise. It looks like the 13 byte window updates coming from the server (receiver) are taking much longer than they should, with effects of small window/update-threshold larger than they should be.

FWIW, looking into the logs posted by @rsc:

At first the client has a fresh 64K window and so it sends the first 64K of it's 1M message. It doesn't arrive at the server until 32ms later because of latency. The window update isn't sent until ~64ms later (took 32ms to send settings ack at 0.138?).

When server receives 64K at 0.138 (32ms after send), it should be ready to give 4 window updates of 16K each right away (GRPC uses 16K threshold to send update), and then the server should be again be able to again send 64K when all of these arrive (32ms later). Because each write is blocking though, the window updates are spread out with 32ms between each.

0.105s
	<-65702
0.138s
	->9
0.171s
	->13
	<-16393
0.206s
	->13
	<-16393
0.243s
	->13
	<-16393
0.281s
	->13
	<-16392

@apolcyn
Copy link
Contributor

apolcyn commented Jan 19, 2017

I tried simulating latency on my localhost by using:

tc qdisc add dev lo root handle 1:0 netem delay 100msec

running the test here with 0ms latency induced from the golang test code, and got:

Duration	Latency	Proto

3.41255655s	0s	GRPC
3.412687395s	0s	GRPC
3.411504574s	0s	GRPC
3.411832411s	0s	GRPC
3.411427349s	0s	GRPC

4.241488937s	0s	HTTP/2.0
3.413378604s	0s	HTTP/2.0
3.412536441s	0s	HTTP/2.0
3.413333573s	0s	HTTP/2.0
3.413284508s	0s	HTTP/2.0

1.637170729s	0s	HTTP/1.1
602.145344ms	0s	HTTP/1.1
601.863744ms	0s	HTTP/1.1
406.623978ms	0s	HTTP/1.1
406.15212ms	0s	HTTP/1.1

My main observation was that now the network writes calls can happen much more asynchronously (not blocking the app for full RTT on each write). The strict ping-pong between window updates and data transfers is gone (previously gRPC's relatively small window size and frequent window updates were seeing a really large penalty for this).

Printed out the sums of the transfers and their times again with this "tc induced 100ms latency" (200 RTT). see https://gist.github.com/apolcyn/ad0068220c74130427f0030cdc095e55

e.g., at line 220 from that log: (middle of gRPC message send)

  • note how the server is able to now send multiple window updates within a short time period, as I believe is realistic - so the client's view of window size stays higher.
1.994s


     <-27648

     ->13

     <-21530

     ->26

2.193s


2.193s


     <-2048

2.194s


2.194s


     <-14345

     ->13

     <-22528

@adg
Copy link
Contributor Author

adg commented Jan 20, 2017 via email

@apolcyn
Copy link
Contributor

apolcyn commented Jan 31, 2017

Sorry if unclear in last two comments, just to summarize:

  • Definitely issues with grpc and net/http2 vs. http1 for large message because of the http2 flow control window size. Possible solutions noted by @iamqizhao
  • net/http2 and grpc look like they're both using the same fixed window size - their performance in this situation looks about the same.
  • Differences in when receiver sends window refresh to sender should be irrelevant. Mainly I think the latency simulation should be changed to rule out things like that.

@petermattis
Copy link
Contributor

CockroachDB just stumbled upon this same issue. GRPC has severe bandwidth limitations on high latency networks. The scenario I encountered was sending ~60MB of data between two servers with a 60ms RTT time (using streaming RPC to send the data in 1MB chunks). I naively expected this to take a few seconds as nc-based tests were able to demonstrate 40MB/s. Instead, the transfer took ~60s, just as you would expect with the fixed 64KB stream window size.

iii) public surface to allow user to tune flow control parameters and enforce a server-scoped flow control limit (under design).

Is there any progress I can follow or help I can give? Perhaps a short term unsupported solution which allows configuration of initialWindowSize and initialConnWindowSize on a per-server/client basis.

@c4milo
Copy link
Contributor

c4milo commented Feb 28, 2017

Is there an issue open to follow up on the migration to the standard HTTP2 package? I couldn't find it.

@apolcyn
Copy link
Contributor

apolcyn commented Mar 1, 2017

@petermattis

Is there any progress I can follow or help I can give? Perhaps a short term unsupported solution which allows configuration of initialWindowSize and initialConnWindowSize on a per-server/client basis.

cc @menghanl. Maybe some temporary API options to client conns and grpc servers is an option? There's #1073 as another option

@c4milo

Is there an issue open to follow up on the migration to the standard HTTP2 package? I couldn't find it.

cc @sailat . But no I think this might be brought up in some other issues but there is no issue about that.

@rakyll
Copy link

rakyll commented May 23, 2017

I re-ran rsc's grpcbench against the master. You can ignore the HTTP/2 results (due to the reason explained at #1043 (comment)), and compare the GRPC results with the previous. I think the new results confirm the improvement.

Duration	Latency	Proto

5.330994ms	0s	GRPC
4.009134ms	0s	GRPC
3.763ms	        0s	GRPC
3.527573ms	0s	GRPC
3.38252ms	0s	GRPC

12.472857ms	0s	HTTP/2.0
5.688769ms	0s	HTTP/2.0
6.409501ms	0s	HTTP/2.0
108.958873ms	0s	HTTP/2.0
6.5151ms	0s	HTTP/2.0

10.245881ms	0s	HTTP/1.1
3.852141ms	0s	HTTP/1.1
3.36458ms	0s	HTTP/1.1
3.510561ms	0s	HTTP/1.1
144.650239ms	0s	HTTP/1.1

14.371479ms	1ms	GRPC
12.052484ms	1ms	GRPC
13.174659ms	1ms	GRPC
14.036349ms	1ms	GRPC
11.440052ms	1ms	GRPC

64.310362ms	1ms	HTTP/2.0
55.559025ms	1ms	HTTP/2.0
53.085947ms	1ms	HTTP/2.0
55.087295ms	1ms	HTTP/2.0
55.589108ms	1ms	HTTP/2.0

17.123458ms	1ms	HTTP/1.1
5.117912ms	1ms	HTTP/1.1
4.775117ms	1ms	HTTP/1.1
5.077654ms	1ms	HTTP/1.1
4.713239ms	1ms	HTTP/1.1

24.798243ms	2ms	GRPC
21.212411ms	2ms	GRPC
24.497939ms	2ms	GRPC
23.300286ms	2ms	GRPC
23.151755ms	2ms	GRPC

109.657809ms	2ms	HTTP/2.0
95.343032ms	2ms	HTTP/2.0
90.149519ms	2ms	HTTP/2.0
97.416299ms	2ms	HTTP/2.0
97.81356ms	2ms	HTTP/2.0

22.973429ms	2ms	HTTP/1.1
5.949754ms	2ms	HTTP/1.1
6.118488ms	2ms	HTTP/1.1
6.083907ms	2ms	HTTP/1.1
6.043363ms	2ms	HTTP/1.1

43.104757ms	4ms	GRPC
38.983007ms	4ms	GRPC
43.200068ms	4ms	GRPC
40.585982ms	4ms	GRPC
42.068339ms	4ms	GRPC

195.434971ms	4ms	HTTP/2.0
177.185796ms	4ms	HTTP/2.0
174.232438ms	4ms	HTTP/2.0
174.821399ms	4ms	HTTP/2.0
178.38124ms	4ms	HTTP/2.0

33.23372ms	4ms	HTTP/1.1
8.903388ms	4ms	HTTP/1.1
111.990624ms	4ms	HTTP/1.1
8.636087ms	4ms	HTTP/1.1
9.185476ms	4ms	HTTP/1.1

77.66905ms	8ms	GRPC
80.779861ms	8ms	GRPC
79.943702ms	8ms	GRPC
77.576871ms	8ms	GRPC
75.829066ms	8ms	GRPC

358.309164ms	8ms	HTTP/2.0
330.676264ms	8ms	HTTP/2.0
335.010852ms	8ms	HTTP/2.0
326.09494ms	8ms	HTTP/2.0
328.849144ms	8ms	HTTP/2.0

52.560471ms	8ms	HTTP/1.1
13.88925ms	8ms	HTTP/1.1
114.867623ms	8ms	HTTP/1.1
13.123327ms	8ms	HTTP/1.1
14.003239ms	8ms	HTTP/1.1

158.107222ms	16ms	GRPC
154.017713ms	16ms	GRPC
155.516585ms	16ms	GRPC
153.444526ms	16ms	GRPC
150.978996ms	16ms	GRPC

702.038344ms	16ms	HTTP/2.0
642.113313ms	16ms	HTTP/2.0
635.178695ms	16ms	HTTP/2.0
640.926487ms	16ms	HTTP/2.0
636.483788ms	16ms	HTTP/2.0

86.444409ms	16ms	HTTP/1.1
22.05304ms	16ms	HTTP/1.1
19.950935ms	16ms	HTTP/1.1
125.511693ms	16ms	HTTP/1.1
23.579457ms	16ms	HTTP/1.1

345.991851ms	32ms	GRPC
286.761732ms	32ms	GRPC
290.791148ms	32ms	GRPC
282.957969ms	32ms	GRPC
280.404998ms	32ms	GRPC

1.30835792s	32ms	HTTP/2.0
1.183078984s	32ms	HTTP/2.0
1.196037184s	32ms	HTTP/2.0
1.187915833s	32ms	HTTP/2.0
1.189489888s	32ms	HTTP/2.0

150.990974ms	32ms	HTTP/1.1
42.193952ms	32ms	HTTP/1.1
38.220967ms	32ms	HTTP/1.1
36.277676ms	32ms	HTTP/1.1
142.413035ms	32ms	HTTP/1.1

@rsc
Copy link

rsc commented May 23, 2017

Improvement, yes, but I wouldn't call it fixed. Even at a modest 8ms latency, GRPC (>=75ms) is almost 4X slower than HTTP/1.1 (>=20ms). At 32ms GRPC (>= 280ms) is 7X slower than HTTP/1.1 (>= 36ms). Those used to be 40X and 30X, so yes improvement, but still more than users are going to want to give up.

@MakMukhi
Copy link
Contributor

The numbers show that we are not there yet and we plan on iterating over it. Reopening the issue to track it here.

@MakMukhi MakMukhi reopened this May 23, 2017
@dfawley
Copy link
Member

dfawley commented May 24, 2017

This benchmark is flawed. Every call to the throttler's Write method incurs the network latency before it returns (see lines 128, 89 & 102 in the linked file), whereas a normal net.Conn's Write function will return as soon as the data is committed to send. When the grpc server sends the response to the client, it writes control frames (to absorb the request payload), headers, data frames, and trailers in separate calls to Write. Since each call has a fixed 32ms delay, that explains why our performance appears so poor in these benchmarks.

@MakMukhi is going to see whether he can improve the throttler to only perform the sleep in the middle of transmitting the data between the writer and reader (instead of in-line in the Write call), and if so, we will provide updated numbers with this benchmark.

In the process of brainstorming performance ideas, we have come up with another improvement for unary RPCs -- we will send a very large window update to the server for the stream immediately after the request is sent. This will enable the server to send the entire response once it is ready, without waiting for a window update after the first 64k is received by the client. This should eliminate one round-trip for large response messages.

@MakMukhi
Copy link
Contributor

Tried several failed attempts to fix the benchmarks but none has worked yet. At this point we're inclined towards running the client and server code on VMs in different continents and compare the numbers. I'll post those numbers by tomorrow.

@MakMukhi
Copy link
Contributor

MakMukhi commented May 25, 2017

Results from running the same benchmark on VMs in different continents (server in Asia and client in North America).
Latency(RTT): 152ms

GRPC:
Duration: 1.154495933s for GRPC
Duration: 462.041517ms for GRPC
Duration: 460.35183ms for GRPC
Duration: 460.48016ms for GRPC
Duration: 460.286916ms for GRPC

HTTP 1:
Duration: 1.572714757s for HTTP 1
Duration: 305.216448ms for HTTP 1
Duration: 156.844879ms for HTTP 1
Duration: 157.044419ms for HTTP 1
Duration: 157.040266ms for HTTP 1

HTTP2:
Duration: 3.251843565s for HTTP 2
Duration: 2.596755423s for HTTP 2
Duration: 2.595688722s for HTTP 2
Duration: 2.596232422s for HTTP 2
Duration: 2.594843281s for HTTP 2

@dfawley
Copy link
Member

dfawley commented May 25, 2017

I took @rsc's benchmark and copied it into a branch here. I modified the throttler to make the reader apply a delay corresponding to when Write was called, plus a latency, and I wrapped both the client and the server net.Conns. Here are my numbers with a one-way latency of 76, which matches up almost exactly with the real-world samples gathered by @MakMukhi:

459.528466ms	76ms	GRPC
459.264934ms	76ms	GRPC
460.386324ms	76ms	GRPC
458.850075ms	76ms	GRPC
460.898012ms	76ms	GRPC

2.904975973s	76ms	HTTP/2.0
2.595898404s	76ms	HTTP/2.0
2.595900343s	76ms	HTTP/2.0
2.594949489s	76ms	HTTP/2.0
2.594792549s	76ms	HTTP/2.0

471.882055ms	76ms	HTTP/1.1
156.190616ms	76ms	HTTP/1.1
157.006169ms	76ms	HTTP/1.1
156.707524ms	76ms	HTTP/1.1
156.684356ms	76ms	HTTP/1.1

So we're approximately 3x the time of HTTP1, which is about twice as good as the previous measurement showed, but still more than what I would expect, which would be 2x of HTTP1, because of flow control.

@dfawley
Copy link
Member

dfawley commented May 25, 2017

Some explanation of the numbers: the reason it's 3x instead of 2x is because of the size of the request (1MB) very closely matching our connection window. Only once 25% of the connection window worth of data is received by the server does it send a window update back to the client. The timing looks like this:

 0ms - Client initiates connection.  Server conn window: 1MB, stream window: 64KB (defaults).
     - Client sends headers and the first 64k-<delta> of request data, where
       <delta> represents a tiny header we put on the data.
     - Server conn window: 960KB, stream window: 0.

32ms - Server receives first 64KB-<delta> of data.
     - Server sends 1MB+<delta>-64KB window update for the stream to let the
       client know to send the entire message.

64ms - Client receives window update.  Server conn window: 960KB, stream window: 960KB+<delta>.
     - Client sends remaining request data -- except for <delta>, before running
       out of _connection_ window quota.
     - Server conn window: 0MB, stream window: <delta>

96ms - Server receives data.
     - Server sends 1MB connection window updates (split in 4 updates)

128ms - Client receives window updates.  Server conn window: 1MB, stream window: <delta>.
      - Client sends remaining <delta> of payload.

160ms - Server receives remainder of request payload.  Sends response.
192ms - Client receives server response; done.

If we set the initial connection window size to 2MB manually (via grpc.InitialConnWindowSize(2<<20)), that removes one hop, and we see ~132ms. If we also set the initial stream window size to 2MB (via grpc.InitialWindowSize(2<<20)), then we can perform the entire RPC in a single hop.

@MakMukhi
Copy link
Contributor

Following is our near-future plan to increase performance further:

  1. Further optimize Unary RPCs to expect a large response from the server: This is a fairly straight-forward change however it won't affect this benchmark.
  2. Batching: There's a bug in the transport that prevents batching from taking place. The fix for this is quite simple too. Again this won't have a major impact on this benchmark.
  3. BDP estimation and dynamic flow control: This is the next thing we're taking up now and hope to bring these benchmark numbers further down.

@robpike
Copy link

robpike commented Jun 5, 2017

I don't believe this should be closed until the process of setting the window size is automatic and the process of making gRPC run fast is clearly documented.

@glycerine
Copy link

I don't believe this should be closed until gRPC is as fast as http1 on the 76msec benchmark; out of the box, with no tuning required.

@presotto
Copy link

presotto commented Jun 6, 2017 via email

@MakMukhi
Copy link
Contributor

MakMukhi commented Jun 6, 2017

  1. We are not claiming that we'll stop working on gRPC performance after this; we're actively looking into strategies to dynamically update flow control.
  2. We acknowledge appropriate documentation should be written.
  3. However, we'd rather open a separate issue for both of them. It is important to close this issue since it contains misinformation (broken benchmarks that show gRPC's performance much worse than it is.)

@adg
Copy link
Contributor Author

adg commented Jun 6, 2017

@MakMukhi I think there is benefit in creating a tracking bug specifically to address throughput issues on high-latency links.Issue #1280 does not mention latency at all (I know this is what "BDP estimation and flow control" refers to). In my experience, issues focused on actual user problems yield better solutions.

@MakMukhi
Copy link
Contributor

MakMukhi commented Jun 6, 2017

@adg That's a good point. I've updated #1280 to mention that. Feel free to update it further.

@glycerine
Copy link

glycerine commented Jun 6, 2017

It is important to close this issue since it contains misinformation (broken benchmarks that show gRPC's performance much worse than it is.)

The above is not a legitimate reason to close this issue--in fact just the opposite. The most recent measurements in this thread show a 3x throughput loss that has not been addressed, and thus the title is still accurate, and thus this remains, still, an open issue. The #1280 issue is not nearly as specific, and looses a ton of useful context for those trying to understand the defects of gRPC in its current form.

glycerine added a commit to glycerine/grpc-go that referenced this issue Jun 6, 2017
Improves throughput by 5x.

Partially addresses grpc#1043.
@glycerine
Copy link

glycerine commented Jun 6, 2017

I just re-tested against tip (8de2dff), on an actual high latency 109msec (one-way) link, between New Jersey and Surry Hills, New South Wales, Australia. Changing the initial window sizes, as in the pull request #1153, doubles the obtained throughput. So #1153 was also pre-maturely closed without being implemented, and it gives a 2x throughput boost.

@dfawley
Copy link
Member

dfawley commented Jun 6, 2017

@robpike, @adg, @glycerine

I'm concerned that if we optimize performance for this one workload, we may do so at the expense of other, more-realistic scenarios. Flow control exists as a way to trade-off bandwidth for application memory usage and tail latency. It would be trivial to change the defaults of grpc-go to maximize bandwidth utilization at the cost of the other two, but that is not our goal. While 64MB seems like a reasonable setting for the window size when you think about it in terms of a single connection, if you consider that servers may have thousands of open connections at once, that quickly turns into a production problem waiting to happen.

At the end of the day, grpc-go needs to be a reasonable out-of-the-box solution for most users, while providing configuration knobs for advanced users to optimize for either speed or memory footprint based upon their individual needs. Also note that any changes to the default behavior has to be made very cautiously, as they will impact all our users.

To that end, our plans for the next month or so are as follows:

  • Add a collection of benchmarks to measure our performance for different types of workloads under various network conditions (high and low bandwidth and latency). Performance metrics will include throughput, QPS, and tail latency (as well as localized metrics like allocations, syscalls, and CPU usage).

  • Provide a setting to limit grpc's memory usage per connection. The default setting for this is TBD.

  • Provide a setting to limit grpc's memory usage per stream. The default setting for this is likely to be the measured BDP of the connection to optimize for maximum performance per stream (and if not, we will provide a way to configure it that way).

  • Implement other optimizations where we find opportunities, like batching pending window updates and pre-emptively sending a large window update along with a unary request.

  • Document how grpc-go manages flow control, how all the various settings impact that, and how to optimize settings for different use cases.

  • Further in the future, provide a setting to limit grpc's memory usage per-Server and per-Client.

This bug will remain closed due to the misinformation / FUD caused by the benchmark it was predicated upon. Significant improvements have already been made for these types of scenarios and settings were added to allow users to easily trade-off memory usage to achieve better performance. I'll file a few extra issues to track things more granularly than #1280 and link them here later today. Thanks for your understanding and patience.

@robpike
Copy link

robpike commented Jun 7, 2017

Thanks, @dfawley, that sounds like a great plan.

@mjgarton
Copy link

mjgarton commented Jan 5, 2018

Should I assume that since this bug is closed, there are no remaining issues with latency significantly affecting throughput?

I only ask because from @dfawley s comments it sounds like there is more to do.

@dfawley did you get around to creating the "few extra issues" that you mentioned? If so can you link them please?

@dfawley
Copy link
Member

dfawley commented Jan 5, 2018

Sorry, I forgot to circle back and link the issues here. You should refer to our performance label for an up-to-date list of our perf issues.

Since this bug was popular, we've added auto-window-size updating based upon BDP estimation, which we believe should completely resolve this issue. Most of our performance work remaining at this point is optimizations to improve CPU utilization by reducing syscalls and avoiding allocations (thereby reducing GC), and reducing lock contention to support higher QPS on each connection.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1 Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

Successfully merging a pull request may close this issue.