-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection latency significantly affects throughput #1043
Comments
cc @iamqizhao (and @rsc @robpike @bradfitz who may be interested) |
If you observe from 1ms, both grpc and http2 are increasing proportionally. The only issue with grpc is that it is started with much large value (i.e., 107ms vs 12ms). We need to figure out what is the expected value there (this counters my previous observation with http2.0...). @apolcyn, assign this to you and please help. Thank you. |
@iamqizhao note the relationship between the simulated latency and the request time. For HTTP/2, with 0 latency a request takes ~1ms. With 1ms of latency (both up and down) you get requests around 3ms (after the first request, when the connection is established). With 2ms of latency, requests are ~5ms. For 4ms latency, requests take ~9ms. See the pattern? So the actual time spent making HTTP/2 requests, minus the introduced latency, is constant (around 1ms). Contrast this to the GRPC behavior, where the introduced latency multiplies the overall request time.
Starting with zero latency, the GRPC requests take 4-6ms. What we should see for GRPC requests is a similar pattern, where 1ms of latency yields 8ms requests (6ms + 1ms + 1ms), 2ms latency = 10ms requests, and so on. |
as I said, your code cached the string used by http2. For example,
12.423725ms 1ms HTTP/2.0
3.02918ms 1ms HTTP/2.0
2.775928ms 1ms HTTP/2.0
4.161895ms 1ms HTTP/2.0
2.951534ms 1ms HTTP/2.0
the 1st line is much larger than the rest. I believe this means the rest
takes advantage of caching. When I said they are both increased
proportionally, I meant the first line of the result from http2.
…On Mon, Jan 9, 2017 at 6:51 PM, Andrew Gerrand ***@***.***> wrote:
@iamqizhao <https://github.com/iamqizhao> note the relationship between
the simulated latency and the request time.
For HTTP/2, with 0 latency a request takes ~1ms. With 1ms of latency (both
up and down) you get requests around 3ms (after the first request, when the
connection is established). With 2ms of latency, requests are ~4ms. For 4ms
latency, requests take ~9ms. See the pattern? So the actual time spent
making HTTP/2 requests, minus the introduced latency, is constant (around
1ms).
Contrast this to the GRPC behavior, where the introduced latency
*multiplies* the overall request time.
We need to figure out what is the expected value there (this counters my
previous observation with http2.0...).
Starting with zero latency, the GRPC requests take 4-6ms. What we *should*
see for GRPC requests is a similar pattern, where 1ms of latency yields 8ms
requests (6ms + 1ms + 1ms), 2ms latency = 10ms requests, and so on.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1043 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJpyLMwAwSMdNAK_VHa1KDXNyjBMIwl0ks5rQvI7gaJpZM4Le51s>
.
|
Why would latency have any affect on proto marshaling overhead? |
The additional 8-10ms in the first request is certainly setting up the HTTP/TLS connection, which increases proportionally with latency as it involves several round trips. |
Err, I see -- I did not look into how you set up the http transport. |
yes, it seems this is due to flow control issue -- if I enlarge the flow control window above 1M, this issue goes away. |
Perhaps a bit more is happening. At 0s latency grpc is still at most 1/10th the speed of https. Does grpc adjust window to rtt? |
I checked in a variant of adg's code that shows the packet breakup and timing.
It only shows the output for 32ms round trip time. This is what HTTP/2 POST looks like (the actual output shows packet boundaries within a given millisecond as
There is some connection setup in the first one, and then a whole bunch of uninterrupted data transfer. The second one omits the setup but still has the nice fast data transfer. Compare with GRPC, again two connections:
It looks like there are two problems. The first one is that GRPC appears to have added its own flow control on top of HTTP/2's flow control (on top of TCP's flow control), and it has a fixed 16 kB window. The second is that it looks like HTTP/2 POST is compressing while GRPC is not: in addition to having better (no added) flow control, the posts transfer only about 64 kB of data in total. Either that or the test is broken. |
@rsc Thanks for investigation. Nice findings. In gRPC-Go, there is no HTTP/2 flow control (FYI, Brad and I have not finished the migration to the standard HTTP2 package yet and gRPC-Go still uses its own native HTTP2 implementation). And yes gRPC does not apply compression by default. The root case that the latency of gRPC-Go is not equal to network RTT + constant processing time is because of its flow control mechanism. Currently, it takes a simplistic form -- it has a fixed 16KB window update as Russ discovered. It does not tune according to network RTT and incoming message sizes so that it favors low latency networks and sucks in long-haul networks (as shown in this issue). The current mechanism is the initial version and was consistent across languages. Recently, gRPC-C added some more intelligence to its flow control and more is under design. I summarize the effort we will put in this area as follows: |
Can user client code change the fixed flow control window size? Can user client code enable compression? |
Currently they cannot. This is the 3rd effort I mentioned in the previous reply.
Yes, with grpc.WithCompressor dial option. |
This change enables compression for GRPC and uses a randomly-generated payload to defeat the effect of the compression:
The numbers appear to be more or less the same:
|
I believe you really want different bytes for each message, not just an
initial random setting.
…On Tue, Jan 10, 2017 at 2:41 PM, Andrew Gerrand ***@***.***> wrote:
This change enables compression for GRPC and uses a randomly-generated
payload to defeat the effect of the compression:
diff --git a/grpcbench/main.go b/grpcbench/main.go
index 101677d..c03ac0f 100644
--- a/grpcbench/main.go
+++ b/grpcbench/main.go
@@ -1,6 +1,7 @@
package main
import (
+ "crypto/rand"
"crypto/tls"
"flag"
"fmt"
@@ -41,6 +42,7 @@ func main() {
grpc.WithBlock(),
grpc.WithTimeout(3 * time.Second),
grpc.WithInsecure(),
+ grpc.WithCompressor(grpc.NewGZIPCompressor()),
}
conn, err := grpc.Dial(*addr, opts...)
if err != nil {
@@ -59,7 +61,12 @@ func main() {
ctx := context.Background()
- msg := strings.Repeat(" ", *msgSize)
+ randomBytes := make([]byte, *msgSize)
+ _, err := rand.Read(randomBytes)
+ if err != nil {
+ log.Fatal(err)
+ }
+ msg := string(randomBytes)
for i := 0; i < *numRuns; i++ {
t1 := time.Now()
var err error
@@ -86,7 +93,7 @@ func main() {
var server *grpc.Server
if *useGRPC {
- server = grpc.NewServer()
+ server = grpc.NewServer(grpc.RPCDecompressor(grpc.NewGZIPDecompressor()))
helloworld.RegisterGreeterServer(server, greeter{})
}
l, err := net.Listen("tcp", *addr)
The numbers appear to be more or less the same:
2.477774562s 32ms GRPC
2.462765459s 32ms GRPC
176.592079ms 32ms HTTP/2.0
69.485653ms 32ms HTTP/2.0
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1043 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFM2PeI2jDuAtw4tioQcWaVe5UgC2tLZks5rRAkGgaJpZM4Le51s>
.
|
How would that change things? |
If they are using a learned dictionary compessor, you would only be sending
the data once.
…On Tue, Jan 10, 2017 at 2:49 PM, Andrew Gerrand ***@***.***> wrote:
How would that change things?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1043 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFM2PWHWjZ7ZA-KVhq1AkaZFT42UFtpFks5rRAsJgaJpZM4Le51s>
.
|
On investigating this further, I found an issue with the grpcbench program. It was not sending complete HTTP/2 requests, which explains the dramatic difference between GRPC and HTTP/2. GRPC still has performance issues, but it is only about 2x slower than x/net/http2 (not orders of magnitude as first reported). I updated grpcbench to correctly exercise HTTP/2, and also threw HTTP/1.1 into the mix. See this pull request. Interestingly, while HTTP/2 is 2x faster than GRPC, HTTP/1.1 is more than 10x faster than HTTP/2. I talked to @bradfitz about this and it is because there's a fixed initial flow control window which never changes, with TODOs to improve this. It would make sense for GRPC to adopt x/net/http2 so that when those TODOs are done, GRPC users should benefit. Here's the data from a new run of grpcbench:
|
I think this method of simulated latency might be unfair. As is, it looks like the server is basically sleeping for the full latency (e.g., 32ms) before each write to the network. AFAICS this blocking behavior might be realistic when e.g., sending the large 1M message from a fast sender to a slow receiver, or possibly sending across a faulty network, but I think it's probably unfair otherwise. It looks like the 13 byte window updates coming from the server (receiver) are taking much longer than they should, with effects of small window/update-threshold larger than they should be. FWIW, looking into the logs posted by @rsc: At first the client has a fresh 64K window and so it sends the first 64K of it's 1M message. It doesn't arrive at the server until 32ms later because of latency. The window update isn't sent until ~64ms later (took 32ms to send settings ack at 0.138?). When server receives 64K at 0.138 (32ms after send), it should be ready to give 4 window updates of 16K each right away (GRPC uses 16K threshold to send update), and then the server should be again be able to again send 64K when all of these arrive (32ms later). Because each write is blocking though, the window updates are spread out with 32ms between each.
|
I tried simulating latency on my localhost by using:
running the test here with
My main observation was that now the network writes calls can happen much more asynchronously (not blocking the app for full RTT on each write). The strict ping-pong between window updates and data transfers is gone (previously gRPC's relatively small window size and frequent window updates were seeing a really large penalty for this). Printed out the sums of the transfers and their times again with this "tc induced 100ms latency" (200 RTT). see https://gist.github.com/apolcyn/ad0068220c74130427f0030cdc095e55 e.g., at line 220 from that log: (middle of gRPC message send)
|
Interesting. The 5-6x slow down still makes GRPC (and x/net/http2) unusable
for us, unfortunately.
I intend to work with @bradfitz on improving HTTP/2, because it's a
codebase I understand.
|
Sorry if unclear in last two comments, just to summarize:
|
CockroachDB just stumbled upon this same issue. GRPC has severe bandwidth limitations on high latency networks. The scenario I encountered was sending ~60MB of data between two servers with a 60ms RTT time (using streaming RPC to send the data in 1MB chunks). I naively expected this to take a few seconds as
Is there any progress I can follow or help I can give? Perhaps a short term unsupported solution which allows configuration of |
Is there an issue open to follow up on the migration to the standard HTTP2 package? I couldn't find it. |
cc @menghanl. Maybe some temporary API options to client conns and grpc servers is an option? There's #1073 as another option
cc @sailat . But no I think this might be brought up in some other issues but there is no issue about that. |
I re-ran rsc's grpcbench against the master. You can ignore the HTTP/2 results (due to the reason explained at #1043 (comment)), and compare the GRPC results with the previous. I think the new results confirm the improvement.
|
Improvement, yes, but I wouldn't call it fixed. Even at a modest 8ms latency, GRPC (>=75ms) is almost 4X slower than HTTP/1.1 (>=20ms). At 32ms GRPC (>= 280ms) is 7X slower than HTTP/1.1 (>= 36ms). Those used to be 40X and 30X, so yes improvement, but still more than users are going to want to give up. |
The numbers show that we are not there yet and we plan on iterating over it. Reopening the issue to track it here. |
This benchmark is flawed. Every call to the throttler's @MakMukhi is going to see whether he can improve the throttler to only perform the sleep in the middle of transmitting the data between the writer and reader (instead of in-line in the In the process of brainstorming performance ideas, we have come up with another improvement for unary RPCs -- we will send a very large window update to the server for the stream immediately after the request is sent. This will enable the server to send the entire response once it is ready, without waiting for a window update after the first 64k is received by the client. This should eliminate one round-trip for large response messages. |
Tried several failed attempts to fix the benchmarks but none has worked yet. At this point we're inclined towards running the client and server code on VMs in different continents and compare the numbers. I'll post those numbers by tomorrow. |
Results from running the same benchmark on VMs in different continents (server in Asia and client in North America).
|
I took @rsc's benchmark and copied it into a branch here. I modified the throttler to make the reader apply a delay corresponding to when
So we're approximately 3x the time of HTTP1, which is about twice as good as the previous measurement showed, but still more than what I would expect, which would be 2x of HTTP1, because of flow control. |
Some explanation of the numbers: the reason it's 3x instead of 2x is because of the size of the request (1MB) very closely matching our connection window. Only once 25% of the connection window worth of data is received by the server does it send a window update back to the client. The timing looks like this:
If we set the initial connection window size to 2MB manually (via |
Following is our near-future plan to increase performance further:
|
I don't believe this should be closed until the process of setting the window size is automatic and the process of making gRPC run fast is clearly documented. |
I don't believe this should be closed until gRPC is as fast as http1 on the 76msec benchmark; out of the box, with no tuning required. |
I agree. Everyone in Australia should.
…On Mon, Jun 5, 2017 at 2:43 PM, Jason E. Aten, Ph.D. < ***@***.***> wrote:
I don't believe this should be closed until gRPC is as fast as http1 on
the 76msec benchmark; out of the box, with no tuning required.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1043 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFM2PTm0xvaAZkmDrd-Hus9s-G4pGcT-ks5sBHaNgaJpZM4Le51s>
.
|
|
@MakMukhi I think there is benefit in creating a tracking bug specifically to address throughput issues on high-latency links.Issue #1280 does not mention latency at all (I know this is what "BDP estimation and flow control" refers to). In my experience, issues focused on actual user problems yield better solutions. |
The above is not a legitimate reason to close this issue--in fact just the opposite. The most recent measurements in this thread show a 3x throughput loss that has not been addressed, and thus the title is still accurate, and thus this remains, still, an open issue. The #1280 issue is not nearly as specific, and looses a ton of useful context for those trying to understand the defects of gRPC in its current form. |
Improves throughput by 5x. Partially addresses grpc#1043.
I just re-tested against tip (8de2dff), on an actual high latency 109msec (one-way) link, between New Jersey and Surry Hills, New South Wales, Australia. Changing the initial window sizes, as in the pull request #1153, doubles the obtained throughput. So #1153 was also pre-maturely closed without being implemented, and it gives a 2x throughput boost. |
I'm concerned that if we optimize performance for this one workload, we may do so at the expense of other, more-realistic scenarios. Flow control exists as a way to trade-off bandwidth for application memory usage and tail latency. It would be trivial to change the defaults of grpc-go to maximize bandwidth utilization at the cost of the other two, but that is not our goal. While 64MB seems like a reasonable setting for the window size when you think about it in terms of a single connection, if you consider that servers may have thousands of open connections at once, that quickly turns into a production problem waiting to happen. At the end of the day, grpc-go needs to be a reasonable out-of-the-box solution for most users, while providing configuration knobs for advanced users to optimize for either speed or memory footprint based upon their individual needs. Also note that any changes to the default behavior has to be made very cautiously, as they will impact all our users. To that end, our plans for the next month or so are as follows:
This bug will remain closed due to the misinformation / FUD caused by the benchmark it was predicated upon. Significant improvements have already been made for these types of scenarios and settings were added to allow users to easily trade-off memory usage to achieve better performance. I'll file a few extra issues to track things more granularly than #1280 and link them here later today. Thanks for your understanding and patience. |
Thanks, @dfawley, that sounds like a great plan. |
Should I assume that since this bug is closed, there are no remaining issues with latency significantly affecting throughput? I only ask because from @dfawley s comments it sounds like there is more to do. @dfawley did you get around to creating the "few extra issues" that you mentioned? If so can you link them please? |
Sorry, I forgot to circle back and link the issues here. You should refer to our performance label for an up-to-date list of our perf issues. Since this bug was popular, we've added auto-window-size updating based upon BDP estimation, which we believe should completely resolve this issue. Most of our performance work remaining at this point is optimizations to improve CPU utilization by reducing syscalls and avoiding allocations (thereby reducing GC), and reducing lock contention to support higher QPS on each connection. |
I am working on a system that uses GRPC to send 1MB blobs between clients and servers and have observed some poor throughput when connection latency is high (180ms round trip is typical between Australia and the USA).
The same throughput issues are not present when I take GRPC out of the equation. I have prepared a self-contained program that reproduces the problem on a local machine by simulating latency at the net.Listener level. It can run either using GRPC or just plain HTTP/2 POST requests. In each case the payload and latency are the same, but—as you can see from the data below—GRPC becomes several orders of magnitude slower as latency increases.
The program and related files: https://gist.github.com/adg/641d04ef335782648502cb32a03b2b07
The output of a typical run:
I theorize that there is something wrong with GRPC's flow control mechanism, but that's just a guess.
The text was updated successfully, but these errors were encountered: