Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

envoy.gzip compression performance is poor #8448

Open
Pluies opened this issue Oct 1, 2019 · 69 comments
Open

envoy.gzip compression performance is poor #8448

Pluies opened this issue Oct 1, 2019 · 69 comments

Comments

@Pluies
Copy link

Pluies commented Oct 1, 2019

Description:

Hello! 👋

We recently realised our outbound traffic had increased a fair amount sinced we switched from serving some assets directly from Akamai to serving them from Ambassador, and traced it down to our gzipped assets being much larger.

I then ran some tests, and it appears that gzipping content in Envoy results in fairly poor compression results, even when tweaking the settings to produce the smallest possible output.

For example, when gzipping html & js from Youtube (picked as a random "big website" example), Envoy-compressed html is 12% larger than nginx and js is 13% larger.

This isn't nginx-specific either: I've tested gzip locally, Apache httpd, and nginx, and all of their results are within a few percents of each other while Envoy is consistently over 10% larger.

Relevant Links:

Here is my testing methodology: https://github.com/Pluies/gzip-comparison

And the relevant Envoy configuration files are:

Is this a known issue? Is there a problem with my configuration? I'm by no means an Envoy expert, so please let me know if there's any obvious issues with the config above or anything else I've missed that might explain this discrepancy.

Thanks a lot!

  • Florent
@mattklein123
Copy link
Member

@dio @lizan @gsagula

@gsagula
Copy link
Member

gsagula commented Oct 1, 2019

@Pluies Do you know what is the delay when compressing with Nginx and Js? In other to reduce latency we decided to compress and flush on each data frame. The downside is that it degrades the compression ratio. The second option is to let gzip to compress until the buffer gets full. I think we could make it configurable.

@Pluies
Copy link
Author

Pluies commented Oct 2, 2019

Do you know what is the delay when compressing with Nginx and Js?

That's a very good question; I guess we could try and measure it with the compose-based stuff. I'll give it a go and report back!

As far as I can tell, compressing in small chunks this way is something only envoy does – httpd, nginx, and Akamai (the one's I've tested so far :) ) all seem to wait until the file is fully available before compressing, if I understand correctly – resulting in much better results. Having it optional would be a great way of letting users choose latency or compression performance, as the current results are a bit baffling.

By the way, by data frame, do you refer exclusively to HTTP/2 streams? (Sorry if it's a silly question, probably need to brush up on my HTTP knowledge).

@Pluies
Copy link
Author

Pluies commented Oct 2, 2019

OK, I've been running tests on a the polymer js file duplicated 10 times, bringing it to 30MB for an extreme test case, with the following curl command: curl -so /dev/null http://localhost:$PORT/$FILE --compressed -w '%{size_download}, Connect: %{time_connect} TTFB: %{time_starttransfer} Total time: %{time_total}\n' (as per this link).

This hopefully should test TTFB, which (once again, if I understand correctly) should be much lower in Envoy than in other servers.

I'm getting a fair amount of variance run-after-run, so please don't read too much in this test, but here are some numbers:

Comparing compression performance for file: assets/desktop_polymer_v2_x10.js (all sizes in bytes)
origin, uncompressed      31668307, Connect: 0.005557 TTFB: 0.020001 Total time: 0.563083
nginx, base               9584605, Connect: 0.005483 TTFB: 0.017597 Total time: 0.620351
nginx, optimised          7821304, Connect: 0.004427 TTFB: 0.021322 Total time: 1.782090
ambassador, base          10881198, Connect: 0.004422 TTFB: 0.022827 Total time: 1.050829
ambassador, optimised     10864866, Connect: 0.004758 TTFB: 0.025497 Total time: 1.316463
envoy, base               10881083, Connect: 0.005510 TTFB: 0.021910 Total time: 0.899159
envoy, optimised          8831289, Connect: 0.004534 TTFB: 0.027110 Total time: 1.713541
apache, base              7853115, Connect: 0.005130 TTFB: 0.019696 Total time: 1.157169
apache, optimised         7820515, Connect: 0.004526 TTFB: 0.023279 Total time: 2.212370
local gzip, base          7854014
local gzip, optimised     7821330

As expected, we're seeing a fair uptick in the total time between "base" and "optimised" version of the same server (as we're trading overall speed for data size), but I'm seeing basically no difference in TTFB between servers... So, other HTTP servers are also compressing-while-streaming, but somehow in an efficient fashion? Or is curl not telling us the full story? (e.g. headers being sent first while the body is being compressed, or other shenanigans)

@gsagula
Copy link
Member

gsagula commented Oct 3, 2019

It looks like that Envoy and Nginx optimized/base are very close:

nginx, base               9584605, Connect: 0.005483 TTFB: 0.017597 Total time: 0.620351
envoy, base               10881083, Connect: 0.005510 TTFB: 0.021910 Total time: 0.899159
nginx, optimised          7821304, Connect: 0.004427 TTFB: 0.021322 Total time: 1.782090
envoy, optimised          8831289, Connect: 0.004534 TTFB: 0.027110 Total time: 1.713541

It could be also the underlying compression library that we are using. I'm looking at some other more modern libraries that may improve the performance.

@Pluies
Copy link
Author

Pluies commented Oct 7, 2019

@gsagula thanks for looking into it! :)

As you point out, the difference between optimised {nginx,apache,gzip} and optimised Envoy is not huge... But still ~10-15% extra, which wipes out gains from a bunch of other application-level improvements we've been working on at $CURRENT_COMPANY, for example.

Regarding the compression library, as far as I can understand, both envoy and nginx use the venerable zlib? Here's the ngx_http_gzip_filter_module.c source, which refers to the exact same parameters (gzip_comp_level, gzip_window), same function calls (deflateInit2), importing zlib.h, etc.

kflynn pushed a commit to emissary-ingress/emissary that referenced this issue Oct 17, 2019
The window_bits parameter is exposed by Envoy (see [docs](https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/http/gzip/v2/gzip.proto)),
and according to our testing is one of the most important settings for
the resulting compressed size.

On averaging a few JS & CSS files, we get pretty consistently (to a few percent):
- gzip, no extra configuration: 100 (baseline)
- gzip, compression_level=BEST: ~99
- gzip, compression_level=BEST, window_size=15: ~80

So setting this gives us 20% extra compression!

The tradeoff in memory usage is documented by Envoy, whose documentation
I've lifted in this PR.

Background: envoyproxy/envoy#8448
@rgs1
Copy link
Member

rgs1 commented Mar 20, 2020

On a related note, we are also hit by this (getting worse latency via Envoy than Nginx when compressing). I wrote #10464, as a starting point to start measuring things.

@mythi
Copy link
Contributor

mythi commented Mar 23, 2020

I wrote #10464, as a starting point to start measuring things.

Perf annotations and flamegraphs could also help debugging

@rgs1
Copy link
Member

rgs1 commented Mar 23, 2020

Ok, deleted my last comment because it was inaccurate -- the problem was with the mocks. Getting rid of those and I get back ~6ms. I'll keep removing the mocks/boilerplate to get more accurate measurements.

rgs1 pushed a commit to rgs1/envoy that referenced this issue Mar 24, 2020
We are suspecting that the gzip/compressor filters are under performing
due to excessive flushes. This should help with the investigation and to
support needed changes, if any.

Helps with envoyproxy#8448.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member

rgs1 commented Mar 24, 2020

Adding stats to track flushes: #10500.

@rgs1
Copy link
Member

rgs1 commented Mar 24, 2020

Ok, so started running with the stats added in #10500 and the ratio of flush vs finish vs upstream_rq seems potentially concerning:

Screen Shot 2020-03-24 at 6 10 03 PM

I am wondering if the change introduced in #3025 -- which was a fix for #2909 -- didn't mean to use Z_NO_FLUSH instead of Z_SYNC_FLUSH in https://github.com/envoyproxy/envoy/blob/master/source/common/compressor/zlib_compressor_impl.cc#L40:

void ZlibCompressorImpl::compress(Buffer::Instance& buffer, State state) {
  for (const Buffer::RawSlice& input_slice : buffer.getRawSlices()) {
    zstream_ptr_->avail_in = input_slice.len_;
    zstream_ptr_->next_in = static_cast<Bytef*>(input_slice.mem_);
    // Z_NO_FLUSH tells the compressor to take the data in and compresses it as much as possible
    // without flushing it out. However, if the data output is greater or equal to the allocated
    // chunk size, process() outputs it to the end of the buffer. This is fine, since at the next
    // step, the buffer is drained from the beginning of the buffer by the size of input.
    process(buffer, Z_NO_FLUSH);
    buffer.drain(input_slice.len_);
  }

  // Was this meant to be Z_NO_FLUSH? Why should we always flush?
  process(buffer, state == State::Finish ? Z_FINISH : Z_SYNC_FLUSH);
}

Sounds like maybe we went a step to far in fixing the missing of Z_FINISH call...

cc: @dio @gsagula

@dio
Copy link
Member

dio commented Mar 24, 2020

@rgs1 thanks for this. I'd love to check on the potential regression (manual check if required) when you have something to test.

@rgs1
Copy link
Member

rgs1 commented Mar 25, 2020

Also related to perf (but still doesn't address the excessive flush issue): #10508.

rgs1 pushed a commit to rgs1/envoy that referenced this issue Mar 25, 2020
We suspect that the gzip filter is under performing
due to excessive flushes. The changes introduced in envoyproxy#3025 -- which
were an attempt to fix envoyproxy#2909 -- are causing too many flushes
(one per encodeData() call) to ensure the compression ise properly
finished. I think at that point it was possible for encodeData()
to be call with end_stream=false while encodeTrailers() wasn't
handle so Z_FINISH could be missed. This isn't the case anymore,
so there's no need to flush every time we encode a chunk of th
response.

We could make this configurable, but given it's a considerable
perf regression _and_ that we'd still be doing the correct thing
by just calling Z_FINISH when end_stream=true or from encodeTrailers(),
I don't think it's worth it.

This also adds stats for visibility.

Helps with envoyproxy#8448.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member

rgs1 commented Mar 25, 2020

Ok, first attempt at reducing the flush calls: #10518.

@rgs1
Copy link
Member

rgs1 commented Mar 25, 2020

So #10518 doesn't appear to move the needle from my initial tests....

@rgs1
Copy link
Member

rgs1 commented Mar 26, 2020

FYI #10530 for better instrumentation.

rgs1 pushed a commit to rgs1/envoy that referenced this issue Mar 26, 2020
Helps with #envoyproxy#8448.
Related to envoyproxy#10530.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
rojkov pushed a commit to rojkov/envoy that referenced this issue Apr 7, 2020
Since Envoy::Compressor::ZlibCompressorImpl::CompressionStrategy is
simply static_cast'ed to uint64_t the Standard strategy (4)
becomes Z_FIXED (4 as well). This basically disables the use of
dynamic Huffman codes when the gzip filter is configured to use
default values.

Make the Standard strategy equal to 0 to translate to
Z_DEFAULT_STRATEGY.

Contributes to envoyproxy#8448

Signed-off-by: Dmitry Rozhkov <dmitry.rozhkov@linux.intel.com>
@PiotrSikora
Copy link
Contributor

PiotrSikora commented May 6, 2020

It looks that GzipFilter is creating ZlibCompressorImpl using the default constructor with 4KB chunk size. Might be source of the issue:
https://github.com/envoyproxy/envoy/blob/master/source/common/compressor/zlib_compressor_impl.cc#L14

EDIT: Nevermind, it seems that the "worse compression ratio" part was fixed already.

@antoniovicente
Copy link
Contributor

What PR fixed the "worse compression ratio" issue?

@rgs1
Copy link
Member

rgs1 commented May 7, 2020

What PR fixed the "worse compression ratio" issue?

this one: #10676

@rojkov
Copy link
Member

rojkov commented Jul 31, 2020

Finally I've got some time to look into it. Below is the output of Intel profiler

compression_profile

It seems Envoy's code is hardly the problem. All hotspots are in zlib itself.

I tried to replace zlib with zlib-ng with all optimizations switched on and for the majority of cases performance improved about twice as much.

Zlib:

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
compressFull/0/manual_time              12.6 ms         12.8 ms           56
compressFull/1/manual_time              6.17 ms         6.42 ms          112
compressFull/2/manual_time              4.62 ms         4.87 ms          149
compressFull/3/manual_time              13.5 ms         13.7 ms           52
compressFull/4/manual_time              8.75 ms         9.00 ms           80
compressFull/5/manual_time              14.5 ms         14.7 ms           49
compressFull/6/manual_time              13.5 ms         13.7 ms           51
compressFull/7/manual_time              8.81 ms         9.05 ms           78
compressFull/8/manual_time              14.4 ms         14.6 ms           49
compressChunks16384/0/manual_time       11.8 ms         12.0 ms           60
compressChunks16384/1/manual_time       5.86 ms         6.11 ms          121
compressChunks16384/2/manual_time       4.48 ms         4.73 ms          156
compressChunks16384/3/manual_time       12.7 ms         12.9 ms           56
compressChunks16384/4/manual_time       8.29 ms         8.54 ms           82
compressChunks16384/5/manual_time       13.6 ms         13.8 ms           52
compressChunks16384/6/manual_time       12.6 ms         12.8 ms           56
compressChunks16384/7/manual_time       8.37 ms         8.61 ms           84
compressChunks16384/8/manual_time       13.5 ms         13.7 ms           51
compressChunks8192/0/manual_time        12.8 ms         13.0 ms           52
compressChunks8192/1/manual_time        6.21 ms         6.47 ms          113
compressChunks8192/2/manual_time        4.88 ms         5.14 ms          144
compressChunks8192/3/manual_time        13.6 ms         13.9 ms           52
compressChunks8192/4/manual_time        8.87 ms         9.13 ms           77
compressChunks8192/5/manual_time        14.6 ms         14.9 ms           47
compressChunks8192/6/manual_time        13.5 ms         13.8 ms           51
compressChunks8192/7/manual_time        9.07 ms         9.33 ms           77
compressChunks8192/8/manual_time        14.5 ms         14.8 ms           48
compressChunks4096/0/manual_time        12.9 ms         13.2 ms           55
compressChunks4096/1/manual_time        6.23 ms         6.51 ms          114
compressChunks4096/2/manual_time        5.15 ms         5.44 ms          140
compressChunks4096/3/manual_time        13.6 ms         13.9 ms           51
compressChunks4096/4/manual_time        8.97 ms         9.25 ms           78
compressChunks4096/5/manual_time        14.9 ms         15.2 ms           47
compressChunks4096/6/manual_time        13.7 ms         14.0 ms           52
compressChunks4096/7/manual_time        9.11 ms         9.39 ms           78
compressChunks4096/8/manual_time        14.9 ms         15.2 ms           46
compressChunks1024/0/manual_time        13.6 ms         14.0 ms           52
compressChunks1024/1/manual_time        7.34 ms         7.74 ms           97
compressChunks1024/2/manual_time        6.19 ms         6.60 ms          115
compressChunks1024/3/manual_time        14.5 ms         14.9 ms           48
compressChunks1024/4/manual_time        10.4 ms         10.8 ms           70
compressChunks1024/5/manual_time        16.3 ms         16.7 ms           43
compressChunks1024/6/manual_time        14.4 ms         14.8 ms           48
compressChunks1024/7/manual_time        10.3 ms         10.7 ms           68
compressChunks1024/8/manual_time        16.2 ms         16.6 ms           43

Zlib-ng

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
compressFull/0/manual_time              3.47 ms         3.72 ms          203
compressFull/1/manual_time              3.44 ms         3.69 ms          202
compressFull/2/manual_time              3.43 ms         3.68 ms          203
compressFull/3/manual_time              13.2 ms         13.5 ms           53
compressFull/4/manual_time              6.28 ms         6.53 ms          112
compressFull/5/manual_time              5.77 ms         6.02 ms          117
compressFull/6/manual_time              12.4 ms         12.6 ms           57
compressFull/7/manual_time              5.92 ms         6.16 ms          119
compressFull/8/manual_time              6.18 ms         6.43 ms          113
compressChunks16384/0/manual_time       3.28 ms         3.53 ms          214
compressChunks16384/1/manual_time       3.26 ms         3.51 ms          218
compressChunks16384/2/manual_time       3.21 ms         3.46 ms          216
compressChunks16384/3/manual_time       12.5 ms         12.7 ms           56
compressChunks16384/4/manual_time       5.88 ms         6.13 ms          121
compressChunks16384/5/manual_time       5.49 ms         5.74 ms          125
compressChunks16384/6/manual_time       11.7 ms         12.0 ms           61
compressChunks16384/7/manual_time       5.51 ms         5.76 ms          127
compressChunks16384/8/manual_time       5.85 ms         6.11 ms          118
compressChunks8192/0/manual_time        3.48 ms         3.74 ms          202
compressChunks8192/1/manual_time        3.42 ms         3.68 ms          197
compressChunks8192/2/manual_time        3.45 ms         3.72 ms          203
compressChunks8192/3/manual_time        13.3 ms         13.6 ms           51
compressChunks8192/4/manual_time        6.26 ms         6.52 ms          109
compressChunks8192/5/manual_time        5.97 ms         6.23 ms          119
compressChunks8192/6/manual_time        12.5 ms         12.7 ms           56
compressChunks8192/7/manual_time        5.92 ms         6.17 ms          118
compressChunks8192/8/manual_time        6.44 ms         6.70 ms          110
compressChunks4096/0/manual_time        3.55 ms         3.83 ms          199
compressChunks4096/1/manual_time        3.49 ms         3.77 ms          199
compressChunks4096/2/manual_time        3.50 ms         3.78 ms          196
compressChunks4096/3/manual_time        13.9 ms         14.2 ms           50
compressChunks4096/4/manual_time        6.44 ms         6.72 ms          107
compressChunks4096/5/manual_time        6.18 ms         6.45 ms          116
compressChunks4096/6/manual_time        13.0 ms         13.3 ms           54
compressChunks4096/7/manual_time        6.03 ms         6.31 ms          115
compressChunks4096/8/manual_time        6.60 ms         6.88 ms          106
compressChunks1024/0/manual_time        3.82 ms         4.23 ms          186
compressChunks1024/1/manual_time        4.01 ms         4.44 ms          186
compressChunks1024/2/manual_time        3.80 ms         4.21 ms          184
compressChunks1024/3/manual_time        14.0 ms         14.4 ms           50
compressChunks1024/4/manual_time        7.24 ms         7.64 ms           96
compressChunks1024/5/manual_time        7.28 ms         7.69 ms           95
compressChunks1024/6/manual_time        12.9 ms         13.2 ms           54
compressChunks1024/7/manual_time        6.91 ms         7.32 ms          103
compressChunks1024/8/manual_time        7.78 ms         8.18 ms           91

Do we want to migrate to zlib-ng?

@dio
Copy link
Member

dio commented Jul 31, 2020

@rojkov What do you think of adding a build flag for choosing zlib-ng (similar to

":with_moonjit": [envoy_external_dep_path("moonjit")],
)?

@rojkov
Copy link
Member

rojkov commented Jul 31, 2020

That could be a good starting point.

I'm on holidays at the moment, will submit a patch later.

@dio
Copy link
Member

dio commented Jul 31, 2020

Awesome. Enjoy your holiday!

https://github.com/dio/envoy/pull/new/zlib-ng it builds but not sure how to switch the madler_zlib native.binding for gRPC to use zlib-ng using a build option. (Since we don't want to have mixed zlib in one binary, right?)

@rgs1
Copy link
Member

rgs1 commented Jul 31, 2020

@dio nice -- go ahead and create the PR so I can test it locally 👍

@mattklein123
Copy link
Member

Nice! I think we can likely switch to zlib-ng, but maybe initially it should be a compile option as we discussed. Per @dio we may need other patches also to other code that uses zlib.

@mattklein123
Copy link
Member

@rgs1 where are we at with this issue? Is Envoy now comparable to other proxies or do we have more work to do here?

@rgs1
Copy link
Member

rgs1 commented Sep 21, 2020

@rgs1 where are we at with this issue? Is Envoy now comparable to other proxies or do we have more work to do here?

Alas, the switch to zlib-ng did not move the needle for us in one of our main workloads (that is, in the only workload where it truly matters -- didn't check if it made things any better in the other ones).

@mattklein123
Copy link
Member

Alas, the switch to zlib-ng did not move the needle for us in one of our main workloads (that is, in the only workload where it truly matters -- didn't check if it made things any better in the other ones).

OK, that's sad. I would love to figure this out. I think @rojkov might possibly have some more cycles to help figure this out. @rojkov is this a perf related investigation you could help drive to closure?

@rojkov
Copy link
Member

rojkov commented Sep 22, 2020

Yes, I'll take a look. Perhaps there's a way to avoid double buffering.

@rojkov
Copy link
Member

rojkov commented Oct 13, 2020

I configured Envoy and Nginx to use similar settings for zlib: compression level 1, default strategy, one thread, memory level 8 and the same sliding window size (window_bits 15). chunk_size is 4096 as Nginx uses buffers of the size of one memory page too.

The test results for downloading a 157K JSON file with

$ k6 run --vus 1 --duration 30s

Envoy

    data_received..............: 175 MB 5.8 MB/s
    data_sent..................: 230 kB 7.7 kB/s
    http_req_blocked...........: avg=4.55µs  min=2.38µs   med=4.29µs  max=216.02µs p(90)=4.64µs  p(95)=4.94µs
    http_req_connecting........: avg=81ns    min=0s       med=0s      max=166.75µs p(90)=0s      p(95)=0s
    http_req_duration..........: avg=14.57ms min=9.83ms   med=14.6ms  max=20.29ms  p(90)=14.92ms p(95)=15.03ms
    http_req_receiving.........: avg=6.76ms  min=3.31ms   med=6.72ms  max=14.14ms  p(90)=6.91ms  p(95)=6.98ms
    http_req_sending...........: avg=25.27µs min=12.65µs  med=23.24µs max=115.94µs p(90)=31.99µs p(95)=34.24µs
    http_req_tls_handshaking...: avg=0s      min=0s       med=0s      max=0s       p(90)=0s      p(95)=0s
    http_req_waiting...........: avg=7.77ms  min=648.03µs med=7.89ms  max=13.94ms  p(90)=8.06ms  p(95)=8.17ms
    http_reqs..................: 2038   67.821657/s
    iteration_duration.........: avg=14.7ms  min=9.97ms   med=14.73ms max=20.51ms  p(90)=15.06ms p(95)=15.17ms
    iterations.................: 2038   67.821657/s
    vus........................: 1      min=1 max=1
    vus_max....................: 1      min=1 max=1

Nginx

    data_received..............: 232 MB 7.7 MB/s
    data_sent..................: 306 kB 10 kB/s
    http_req_blocked...........: avg=8.07µs  min=2.69µs  med=4.33µs  max=464.84µs p(90)=4.87µs  p(95)=5.37µs
    http_req_connecting........: avg=2.32µs  min=0s      med=0s      max=322.05µs p(90)=0s      p(95)=0s
    http_req_duration..........: avg=10.92ms min=6.7ms   med=10.14ms max=18.29ms  p(90)=13.22ms p(95)=13.38ms
    http_req_receiving.........: avg=8.19ms  min=4.82ms  med=7.68ms  max=13.41ms  p(90)=9.92ms  p(95)=10.05ms
    http_req_sending...........: avg=25.65µs min=13.07µs med=23.44µs max=126.91µs p(90)=31.53µs p(95)=34.1µs
    http_req_tls_handshaking...: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
    http_req_waiting...........: avg=2.69ms  min=1.65ms  med=2.49ms  max=9.53ms   p(90)=3.31ms  p(95)=3.39ms
    http_reqs..................: 2709   90.18219/s
    iteration_duration.........: avg=11.06ms min=6.81ms  med=10.28ms max=18.58ms  p(90)=13.35ms p(95)=13.51ms
    iterations.................: 2709   90.18219/s
    vus........................: 1      min=1 max=1
    vus_max....................: 1      min=1 max=1

Nginx is about 30% faster overall with TTFB almost 3 times shorter.

I tried to do three things:

  1. avoid excesive flushing with @rgs1's patch modified to fit the latest master;
  2. get rid of memcpy() when moving bytes from zlib's output to the filter's output buffer with rojkov@d084d1e;
  3. configured zlib to use a custom memory management copied mostly from Nginx to allocate memory for zlib only once per request with rojkov@d91afc0.

The first thing improved TTFB a bit (by about 1-2%), but TTLB didn't change at all. The impact of the other two approaches was not noticeable at all.

Flamegraphs show that most of the time is spent in zlib: Envoy and Nginx. Though in case of Envoy callstacks are much deeper.

Callgrind doesn't reveal excessive cache thrashing neither, though I might have interpreted the numbers wrong way (sorted by Last level cache misses for Data writes, limited to 0.25%):

1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 17825792 B, 64 B, 17-way associative
Timerange: Basic block 0 - 320949664
Trigger: Program termination
Profiled target:  bazel-bin/source/exe/envoy-static --concurrency 1 -c rojkov-data/envoy-config-compression-compress.yaml (PID 1972876, part 1)
Events recorded:  Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Events shown:     Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Event sort order: DLmw DLmr
Thresholds:       99 0
Include dirs:
User annotated:
Auto-annotation:  on

--------------------------------------------------------------------------------
Ir                     Dr                   Dw                   I1mr               D1mr                D1mw               ILmr            DLmr            DLmw
--------------------------------------------------------------------------------
2,589,412,689 (100.0%) 811,817,761 (100.0%) 434,425,819 (100.0%) 1,873,102 (100.0%) 20,526,076 (100.0%) 3,269,251 (100.0%) 52,204 (100.0%) 88,113 (100.0%) 233,862 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                   Dr                   Dw                   I1mr            D1mr               D1mw               ILmr         DLmr            DLmw              file:function
--------------------------------------------------------------------------------
 16,919,264 ( 0.65%)      59,215 ( 0.01%)  16,152,582 ( 3.72%)  6,525 ( 0.35%)       153 ( 0.00%)   274,275 ( 8.39%)   6 ( 0.01%)      0          102,325 (43.75%)  ???:__memset_avx2_unaligned_erms [/usr/lib64/libc-2.30.so]
    733,120 ( 0.03%)      24,734 ( 0.00%)     107,338 ( 0.02%)  3,908 ( 0.21%)     2,463 ( 0.01%)    59,212 ( 1.81%)  10 ( 0.02%)      0           48,775 (20.86%)  ???:tcmalloc::CentralFreeList::Populate() [/home/rojkov/.cache/bazel/_bazel_rojkov/c0022cd08dcf561b0cf3d59ef5423900/execroot/envoy/bazel-out/k8-opt/bin/source/exe/envoy-static]
  2,403,624 ( 0.09%)     557,483 ( 0.07%)     187,115 ( 0.04%)     53 ( 0.00%)    68,098 ( 0.33%)    26,036 ( 0.80%)  53 ( 0.10%) 67,868 (77.02%)  25,986 (11.11%)  ???:_dl_relocate_object [/usr/lib64/ld-2.30.so]
 78,525,198 ( 3.03%)  76,292,910 ( 9.40%)  76,127,463 (17.52%)  9,260 ( 0.49%) 1,085,206 ( 5.29%) 1,286,418 (39.35%)  12 ( 0.02%)  9,379 (10.64%)  13,523 ( 5.78%)  ???:__memcpy_avx_unaligned_erms [/usr/lib64/libc-2.30.so]
     81,467 ( 0.00%)          61 ( 0.00%)      46,248 ( 0.01%)     36 ( 0.00%)        12 ( 0.00%)    11,535 ( 0.35%)  10 ( 0.02%)      0           11,526 ( 4.93%)  ???:memset [/usr/lib64/libc-2.30.so]
    293,583 ( 0.01%)      56,806 ( 0.01%)     107,474 ( 0.02%)  5,004 ( 0.27%)       306 ( 0.00%)     6,025 ( 0.18%)  15 ( 0.03%)      2 ( 0.00%)   4,294 ( 1.84%)  /proc/self/cwd/external/com_google_protobuf/src/google/protobuf/descriptor.cc:google::protobuf::DescriptorBuilder::BuildFieldOrExtension(google::protobuf::FieldDescriptorProto const&, google::protobuf::Descriptor const*, google::protobuf::FieldDescriptor*, bool) [/home/rojkov/.cache/bazel/_bazel_rojkov/c0022cd08dcf561b0cf3d59ef5423900/execroot/envoy/bazel-out/k8-opt/bin/source/exe/envoy-static]
     16,383 ( 0.00%)           0               16,383 ( 0.00%)      0                  0              2,036 ( 0.06%)   0               0            2,010 ( 0.86%)  /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/tuple:void std::vector<std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >::_M_realloc_insert<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&>(__gnu_cxx::__normal_iterator<std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, std::vector<std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::unique_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::default_delete<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&)
    241,350 ( 0.01%)      58,582 ( 0.01%)     114,040 ( 0.03%)    819 ( 0.04%)         5 ( 0.00%)     9,413 ( 0.29%)   2 ( 0.00%)      0            1,347 ( 0.58%)  /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:void std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::_M_realloc_insert<>(__gnu_cxx::__normal_iterator<std::vector<int, std::allocator<int> >*, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > >)
     10,594 ( 0.00%)           0               10,594 ( 0.00%)      0                  0              1,494 ( 0.05%)   0               0            1,337 ( 0.57%)  /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/unique_ptr.h:google::protobuf::DescriptorPool::Tables::AllocateString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
     21,532 ( 0.00%)       6,152 ( 0.00%)      12,304 ( 0.00%)      1 ( 0.00%)         0              1,153 ( 0.04%)   1 ( 0.00%)      0            1,152 ( 0.49%)  ???:ares__init_list_head [/home/rojkov/.cache/bazel/_bazel_rojkov/c0022cd08dcf561b0cf3d59ef5423900/execroot/envoy/bazel-out/k8-opt/bin/source/exe/envoy-static]
808,853,382 (31.24%) 318,609,688 (39.25%) 172,031,372 (39.60%)  4,595 ( 0.25%) 9,794,728 (47.72%) 1,130,816 (34.59%)  20 ( 0.04%)      6 ( 0.01%)   1,133 ( 0.48%)  ???:deflate_fast [/home/rojkov/.cache/bazel/_bazel_rojkov/c0022cd08dcf561b0cf3d59ef5423900/execroot/envoy/bazel-out/k8-opt/bin/source/exe/envoy-static]
  1,023,828 ( 0.04%)     132,908 ( 0.02%)     165,792 ( 0.04%)  2,320 ( 0.12%)         0              7,705 ( 0.24%)   5 ( 0.01%)      0            1,130 ( 0.48%)  /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:re2::Prog::MarkSuccessors(re2::SparseArray<int>*, re2::SparseArray<int>*, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >*, re2::SparseSetT<void>*, std::vector<int, std::allocator<int> >*)
      3,600 ( 0.00%)           0                2,400 ( 0.00%)      0                  0                866 ( 0.03%)   0               0              838 ( 0.36%)  /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/basic_string.h:re2::RE2::RE2(char const*)
     27,140 ( 0.00%)       6,796 ( 0.00%)      10,216 ( 0.00%)  1,638 ( 0.09%)       572 ( 0.00%)       759 ( 0.02%)   4 ( 0.01%)      0              610 ( 0.26%)  ???:tcmalloc::NewSpan(unsigned long, unsigned long) [/home/rojkov/.cache/bazel/_bazel_rojkov/c0022cd08dcf561b0cf3d59ef5423900/execroot/envoy/bazel-out/k8-opt/bin/source/exe/envoy-static]
     13,864 ( 0.00%)       6,932 ( 0.00%)       5,199 ( 0.00%)      0                  0                720 ( 0.02%)   0               0              589 ( 0.25%)  /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/regex_automaton.h:std::__detail::_State<char>* std::__uninitialized_copy<false>::__uninit_copy<std::__detail::_State<char> const*, std::__detail::_State<char
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 17825792 B, 64 B, 17-way associative
Timerange: Basic block 0 - 246335785
Trigger: Program termination
Profiled target:  ./objs/nginx -p /home/rojkov/work/nginx -c nginx.conf (PID 1167, part 1)
Events recorded:  Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Events shown:     Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Event sort order: DLmw DLmr
Thresholds:       99 0
Include dirs:
User annotated:
Auto-annotation:  on

--------------------------------------------------------------------------------
Ir                     Dr                   Dw                   I1mr             D1mr                D1mw               ILmr           DLmr           DLmw
--------------------------------------------------------------------------------
1,762,195,422 (100.0%) 493,892,267 (100.0%) 259,362,338 (100.0%) 234,414 (100.0%) 20,945,147 (100.0%) 1,916,199 (100.0%) 4,309 (100.0%) 4,705 (100.0%) 16,219 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                   Dr                   Dw                   I1mr           D1mr               D1mw             ILmr        DLmr           DLmw            file:function
--------------------------------------------------------------------------------
804,345,960 (45.64%) 206,282,040 (41.77%) 136,108,800 (52.48%) 5,530 ( 2.36%) 9,517,993 (45.44%) 893,536 (46.63%) 32 ( 0.74%)    10 ( 0.21%) 3,585 (22.10%)  ./deflate.c:deflate_fast [/lib/x86_64-linux-gnu/libz.so.1.2.11]
     27,862 ( 0.00%)       4,169 ( 0.00%)       7,203 ( 0.00%)    22 ( 0.01%)        18 ( 0.00%)   3,464 ( 0.18%) 21 ( 0.49%)     0          3,455 (21.30%)  /home/rojkov/work/nginx/src/event/ngx_event.c:ngx_event_process_init [/home/rojkov/work/nginx/objs/nginx]
 15,072,968 ( 0.86%)  12,770,738 ( 2.59%)  12,744,964 ( 4.91%) 1,763 ( 0.75%)   629,332 ( 3.00%) 650,770 (33.96%)  8 ( 0.19%)   518 (11.01%) 2,738 (16.88%)  /build/glibc-2ORdQG/glibc-2.27/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/lib/x86_64-linux-gnu/libc-2.27.so]
  7,893,689 ( 0.45%)         127 ( 0.00%)   7,892,800 ( 3.04%)   121 ( 0.05%)       120 ( 0.00%) 123,336 ( 6.44%)  1 ( 0.02%)     0          2,492 (15.36%)  /build/glibc-2ORdQG/glibc-2.27/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_erms [/lib/x86_64-linux-gnu/libc-2.27.so]
     51,195 ( 0.00%)      12,202 ( 0.00%)       6,215 ( 0.00%)    30 ( 0.01%)     1,612 ( 0.01%)   1,153 ( 0.06%) 30 ( 0.70%) 1,546 (32.86%) 1,126 ( 6.94%)  /build/glibc-2ORdQG/glibc-2.27/elf/../sysdeps/x86_64/dl-machine.h:_dl_relocate_object
440,540,640 (25.00%) 103,362,480 (20.93%)  54,240,360 (20.91%) 4,694 ( 2.00%)   306,295 ( 1.46%) 159,945 ( 8.35%) 16 ( 0.37%)     4 ( 0.09%)   510 ( 3.14%)  ./trees.c:compress_block [/lib/x86_64-linux-gnu/libz.so.1.2.11]
    124,780 ( 0.01%)       5,427 ( 0.00%)      37,950 ( 0.01%) 1,107 ( 0.47%)         0           10,525 ( 0.55%)  5 ( 0.12%)     0            462 ( 2.85%)  /build/glibc-2ORdQG/glibc-2.27/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms [/lib/x86_64-linux-gnu/libc-2.27.so]
    232,352 ( 0.01%)      59,896 ( 0.01%)      14,149 ( 0.01%)    22 ( 0.01%)        65 ( 0.00%)     229 ( 0.01%) 21 ( 0.49%)     0            226 ( 1.39%)  /home/rojkov/work/nginx/src/core/ngx_hash.c:ngx_hash_init [/home/rojkov/work/nginx/objs/nginx]
      1,481 ( 0.00%)          28 ( 0.00%)         680 ( 0.00%)     7 ( 0.00%)         0              158 ( 0.01%)  4 ( 0.09%)     0            158 ( 0.97%)  /build/glibc-2ORdQG/glibc-2.27/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:memset [/lib/x86_64-linux-gnu/ld-2.27.so]
    541,741 ( 0.03%)      72,321 ( 0.01%)      88,277 ( 0.03%) 6,981 ( 2.98%)     3,400 ( 0.02%)   3,187 ( 0.17%) 44 ( 1.02%)     1 ( 0.02%)   141 ( 0.87%)  /build/glibc-2ORdQG/glibc-2.27/malloc/malloc.c:_int_malloc [/lib/x86_64-linux-gnu/libc-2.27.so]
    115,560 ( 0.01%)      14,040 ( 0.00%)      69,240 ( 0.03%)   638 ( 0.27%)     5,631 ( 0.03%)   7,931 ( 0.41%)  5 ( 0.12%)     0            102 ( 0.63%)  /usr/include/x86_64-linux-gnu/bits/string3.h:deflate_fast
     50,302 ( 0.00%)       4,844 ( 0.00%)       4,418 ( 0.00%)     3 ( 0.00%)       104 ( 0.00%)      96 ( 0.01%)  1 ( 0.02%)    20 ( 0.43%)    95 ( 0.59%)  /home/rojkov/work/nginx/src/core/ngx_string.c:ngx_strlow [/home/rojkov/work/nginx/objs/nginx]
     48,873 ( 0.00%)      21,014 ( 0.00%)       6,523 ( 0.00%)    13 ( 0.01%)         3 ( 0.00%)      81 ( 0.00%) 13 ( 0.30%)     0             81 ( 0.50%)  /home/rojkov/work/nginx/src/http/ngx_http_variables.c:ngx_http_add_variable [/home/rojkov/work/nginx/objs/nginx]
     36,028 ( 0.00%)       7,698 ( 0.00%)       3,162 ( 0.00%)     8 ( 0.00%)        24 ( 0.00%)      79 ( 0.00%)  8 ( 0.19%)     3 ( 0.06%)    77 ( 0.47%)  /home/rojkov/work/nginx/src/core/ngx_hash.c:ngx_hash_add_key [/home/rojkov/work/nginx/objs/nginx]
     71,462 ( 0.00%)      14,731 ( 0.00%)      13,620 ( 0.01%)     7 ( 0.00%)         0               73 ( 0.00%)  7 ( 0.16%)     0             73 ( 0.45%)  /home/rojkov/work/nginx/src/http/ngx_http_core_module.c:ngx_http_core_type [/home/rojkov/work/nginx/objs/nginx]
    288,195 ( 0.02%)      93,095 ( 0.02%)      37,549 ( 0.01%)    16 ( 0.01%)     1,135 ( 0.01%)      89 ( 0.00%)  7 ( 0.16%)    91 ( 1.93%)    68 ( 0.42%)  /home/rojkov/work/nginx/src/core/ngx_conf_file.c:ngx_conf_parse'2 [/home/rojkov/work/nginx/objs/nginx]
      1,529 ( 0.00%)         297 ( 0.00%)         298 ( 0.00%)    15 ( 0.01%)         8 ( 0.00%)      55 ( 0.00%) 15 ( 0.35%)     8 ( 0.17%)    55 ( 0.34%)  /build/glibc-2ORdQG/glibc-2.27/elf/dl-object.c:_dl_new_object [/lib/x86_64-linux-gnu/ld-2.27.so]
      9,369 ( 0.00%)       3,263 ( 0.00%)         866 ( 0.00%)    20 ( 0.01%)        67 ( 0.00%)      48 ( 0.00%) 20 ( 0.46%)    57 ( 1.21%)    44 ( 0.27%)  /build/glibc-2ORdQG/glibc-2.27/elf/dl-version.c:_dl_check_map_versions [/lib/x86_64-linux-gnu/ld-2.27.so]
      2,206 ( 0.00%)         314 ( 0.00%)         245 ( 0.00%)    10 ( 0.00%)        56 ( 0.00%)      44 ( 0.00%) 10 ( 0.23%)    56 ( 1.19%)    44 ( 0.27%)  /build/glibc-2ORdQG/glibc-2.27/elf/get-dynamic-info.h:_dl_map_object_from_fd

Looks like Nginx moves data more effectively though.

Nginx uses adaptive settings for zlib if Content-Length is known: it tries to narrow the sliding window for small data blobs. We could do the same. It should improve memory consumption in some cases, but won't improve request duration.

Now I run out of ideas. Any help is appreciated :)

@rojkov
Copy link
Member

rojkov commented Oct 29, 2020

So, I did one more try to solve this mystery. Turned out I was measuring wrong numbers. Flamegraphs and profilers show where CPU cycles are spent, not where wall time is spent. I used PerfAnnotationContext to measure exactly how long it takes to deflate() 157k of JSON in a running Envoy instance. On my machine the average is 6.6ms (out of >14ms connection life span). The same measurement for nginx shows 6.2ms (out of ~10ms connection life span) - still better, but not dramatically. When the same JSON is compressed in compressor_filter_speed_test it takes less than 3ms (avg). The explanation is that when running compressor_filter_speed_test the CPU "turboboosts" to 3.8GHz. In case of deflate()'ing in a running Envoy instance the CPU core works at 1.6GHz in average giving a clue that the load is not CPU bound. As a result in case of Envoy writing data to a socket is as long as calling deflate(): after having written ~85k of compressed data to the socket in one chunk the next "write ready" event comes in ~5ms. Then the connection is finally closed. With compression switched off and writing 157k in one chunk the "write ready" event comes in <1ms. Looks like these "ready" events are scheduled with a delay if writes are done not often enough. To make them more frequent I set per_connection_buffer_limit_bytes to 8192 for clusters and listeners. Here is the result:

    data_received..............: 295 MB 9.8 MB/s
    data_sent..................: 454 kB 15 kB/s
    http_req_blocked...........: avg=309.09µs min=149.04µs med=310.8µs  max=989.34µs p(90)=376.11µs p(95)=389.95µs
    http_req_connecting........: avg=209.28µs min=92.84µs  med=208.41µs max=801.8µs  p(90)=263.33µs p(95)=275.31µs
    http_req_duration..........: avg=8.3ms    min=4.84ms   med=8.19ms   max=14.56ms  p(90)=9.17ms   p(95)=9.31ms
    http_req_receiving.........: avg=7.02ms   min=4ms      med=6.97ms   max=11.35ms  p(90)=7.61ms   p(95)=7.73ms
    http_req_sending...........: avg=93.04µs  min=41.76µs  med=94.11µs  max=624.63µs p(90)=111.93µs p(95)=117.71µs
    http_req_tls_handshaking...: avg=0s       min=0s       med=0s       max=0s       p(90)=0s       p(95)=0s
    http_req_waiting...........: avg=1.18ms   min=424.74µs med=1.17ms   max=7.49ms   p(90)=1.5ms    p(95)=1.57ms
    http_reqs..................: 3439   114.499003/s
    iteration_duration.........: avg=8.71ms   min=5.11ms   med=8.61ms   max=15.06ms  p(90)=9.56ms   p(95)=9.73ms
    iterations.................: 3439   114.499003/s
    vus........................: 1      min=1 max=1
    vus_max....................: 1      min=1 max=1

Now envoy is ~25% faster than nginx.

@rgs1 What were the zlib settings in your setup? Could you check if you set them equal to nginx's?

In my setup they are

memory_level: 8
compression_level: compression_level_1
window_bits: 15
chunk_size: 4096
compression_strategy: default_strategy

Nginx doesn't document how to specify memlevel and wbits (because they are calculated dynamically for known Content-Length), but in case of 157k they are 8 and 15 respectively.

@dio
Copy link
Member

dio commented Oct 29, 2020

Wow. So awesome!

@jmarantz
Copy link
Contributor

Great work!

The only change you made was to set per_connection_buffer_limit_bytes to 8192? What was it prevously?

@rojkov
Copy link
Member

rojkov commented Oct 29, 2020

The only change you made was to set per_connection_buffer_limit_bytes to 8192? What was it prevously?

Yes, only this change. Initially there was nothing in my config and the default is 1M.

@rgs1
Copy link
Member

rgs1 commented Oct 29, 2020

@rojkov hmm, interesting. Could it a problem that we have different limits for listeners and clusters:

clusters.json:    "per_connection_buffer_limit_bytes": 16384,
listeners.json:    "per_connection_buffer_limit_bytes": 8192

?

This is all using zlib-ng right?

@rgs1
Copy link
Member

rgs1 commented Oct 29, 2020

Our settings:

                  "name": "envoy.filters.http.gzip",
                  "typed_config": {
                    "@type": "type.googleapis.com/envoy.config.filter.http.gzip.v2.Gzip",
                    "compression_level": "SPEED",
                    "compressor": {
                      "content_length": 1100,
                      "runtime_enabled": {
                        "default_value": true,
                        "runtime_key": "gzip.filter_enabled"
                      }
                    },
                    "memory_level": 9,
                    "window_bits": 15
                  }

So mostly the same, I don't expect that memory_level:9 vs memory_level:8 would make such a big change...

@rojkov
Copy link
Member

rojkov commented Oct 29, 2020

This is all using zlib-ng right?

No, I compiled Envoy with the same old zlib used by Nginx.

I don't think different limits matter so much. Neither memlevel 9. Then I don't understand why your results are so different :(

@rgs1
Copy link
Member

rgs1 commented Oct 29, 2020

@rojkov let me dig a bit more on our side, regardless of our case this is a great finding -- thanks! Shall we document this somewhere?

@rojkov
Copy link
Member

rojkov commented Oct 29, 2020

Yeah, adding some notes to compressor_filter.rst won't harm. I'll think how to formulate it better.

By the way, at some point I experimented with CPU frequency a bit: while benchmarking compression with k6 I artificially loaded all the cores to 100% (with a full envoy rebuild). This made Envoy serve twice as fast and Nginx showed a similar dynamics. Which can be counterintuitive yet explainable. Just a side note on how important it is to do experiments in the same environmental conditions.

@mattklein123
Copy link
Member

Thanks @rojkov this is amazing work and a really great finding around how buffer sizes effect the overall throughput.

Looks like these "ready" events are scheduled with a delay if writes are done not often enough. To make them more frequent I set per_connection_buffer_limit_bytes to 8192 for clusters and listeners.

After having written ~85k of compressed data to the socket in one chunk the next "write ready" event comes in ~5ms. Then the connection is finally closed. With compression switched off and writing 157k in one chunk the "write ready" event comes in <1ms. Looks like these "ready" events are scheduled with a delay if writes are done not often enough.

I'm a little confused and have a few questions that we can follow up on:

  1. In the compressed case, are we basically doing a single writev with a single slice which is giant?
  2. In the uncompressed case, are we doing a single writev with multiple slices? Or is the data spread over more writes because it is streamed?
  3. If we know all data has been written, what does connection closure have to do with the overall latency? I would assume TTLB on the downstream client would be the same regardless of closure as long as content-length is set? Or is the issue we are chunk encoding and the last chunk is not written until 5ms later?
  4. Is there a way we can split up the writes in the compressed case such that max buffer bytes doesn't come into play? I'm a little confused about how decreasing the max connection buffer bytes is helping. Is it cause back pressure so we are reading in smaller chunks which are then compressed in smaller blocks?

cc @antoniovicente for event and buffer expertise. cc @ggreenway as this very likely relates to the perf work you have been doing.

I used PerfAnnotationContext to measure

As an aside, this is really neat. We should document this approach and/or would it be worthwhile to add this hooks into the code base under some macro guard so that we can more easily use this annotations for perf investigations?

@rojkov
Copy link
Member

rojkov commented Oct 29, 2020

In the compressed case, are we basically doing a single writev with a single slice which is giant?
In the uncompressed case, are we doing a single writev with multiple slices? Or is the data spread over more writes because it is streamed?

I understood that there is one single write with a single slice in both cases. This is how it looks in logs:

  1. compressed
[2020-10-29 17:56:19.346][1539332][trace][http] [source/common/http/conn_manager_impl.cc:1486] [C2][S34169261113250019] encoding data via codec (size=85429 end_stream=false)
[2020-10-29 17:56:19.346][1539332][trace][connection] [source/common/network/connection_impl.cc:428] [C2] writing 85438 bytes, end_stream false
[2020-10-29 17:56:19.346][1539332][trace][http] [source/common/http/http1/codec_impl.cc:1262] [C1] message complete
[2020-10-29 17:56:19.346][1539332][debug][client] [source/common/http/codec_client.cc:109] [C1] response complete
[2020-10-29 17:56:19.346][1539332][trace][http] [source/common/http/conn_manager_impl.cc:1486] [C2][S34169261113250019] encoding data via codec (size=10 end_stream=true)
[2020-10-29 17:56:19.346][1539332][trace][connection] [source/common/network/connection_impl.cc:316] [C2] readDisable: disable=false disable_count=1 state=0 buffer_length=0
[2020-10-29 17:56:19.347][1539332][trace][http] [source/common/http/http1/codec_impl.cc:599] [C1] parsed 160253 bytes
[2020-10-29 17:56:19.347][1539332][trace][connection] [source/common/network/connection_impl.cc:510] [C1] socket event: 2, ts: 19705
[2020-10-29 17:56:19.347][1539332][trace][connection] [source/common/network/connection_impl.cc:611] [C1] write ready
[2020-10-29 17:56:19.347][1539332][trace][connection] [source/common/network/connection_impl.cc:510] [C2] socket event: 2, ts: 19750
[2020-10-29 17:56:19.347][1539332][trace][connection] [source/common/network/connection_impl.cc:611] [C2] write ready
[2020-10-29 17:56:19.347][1539332][trace][connection] [source/common/network/raw_buffer_socket.cc:77] [C2] write returns: 85738 ts: 19887us <- single write
[2020-10-29 17:56:19.351][1539332][trace][connection] [source/common/network/connection_impl.cc:510] [C2] socket event: 3, ts: 24264us
[2020-10-29 17:56:19.351][1539332][trace][connection] [source/common/network/connection_impl.cc:611] [C2] write ready
  1. uncompressed
[2020-10-29 17:58:31.455][1539478][trace][http] [source/common/http/conn_manager_impl.cc:1486] [C2][S13951305510064282760] encoding data via codec (size=160003 end_stream=false)
[2020-10-29 17:58:31.455][1539478][trace][connection] [source/common/network/connection_impl.cc:428] [C2] writing 160003 bytes, end_stream false
[2020-10-29 17:58:31.455][1539478][trace][http] [source/common/http/http1/codec_impl.cc:1262] [C1] message complete
[2020-10-29 17:58:31.455][1539478][debug][client] [source/common/http/codec_client.cc:109] [C1] response complete
[2020-10-29 17:58:31.455][1539478][trace][http] [source/common/http/conn_manager_impl.cc:1486] [C2][S13951305510064282760] encoding data via codec (size=0 end_stream=true)
[2020-10-29 17:58:31.455][1539478][trace][connection] [source/common/network/connection_impl.cc:316] [C2] readDisable: disable=false disable_count=1 state=0 buffer_length=0
[2020-10-29 17:58:31.455][1539478][trace][http] [source/common/http/http1/codec_impl.cc:599] [C1] parsed 160253 bytes
[2020-10-29 17:58:31.456][1539478][trace][connection] [source/common/network/connection_impl.cc:510] [C1] socket event: 2, ts: 5064
[2020-10-29 17:58:31.456][1539478][trace][connection] [source/common/network/connection_impl.cc:611] [C1] write ready
[2020-10-29 17:58:31.456][1539478][trace][connection] [source/common/network/connection_impl.cc:510] [C2] socket event: 2, ts: 5125
[2020-10-29 17:58:31.456][1539478][trace][connection] [source/common/network/connection_impl.cc:611] [C2] write ready
[2020-10-29 17:58:31.456][1539478][trace][connection] [source/common/network/raw_buffer_socket.cc:77] [C2] write returns: 160256 ts: 5322us <- single write
[2020-10-29 17:58:31.457][1539478][trace][connection] [source/common/network/connection_impl.cc:510] [C2] socket event: 3, ts: 6376us
[2020-10-29 17:58:31.457][1539478][trace][connection] [source/common/network/connection_impl.cc:611] [C2] write ready

ts is a timestamp in microseconds since a ActiveTcpConnection creation.
C2 is a downstream connection.

If we know all data has been written, what does connection closure have to do with the overall latency? I would assume TTLB on the downstream client would be the same regardless of closure as long as content-length is set? Or is the issue we are chunk encoding and the last chunk is not written until 5ms later?

A-ha. I should have asked this question myself... So, I did closed loop testing, that is a new connection to the listener is not created until the current one isn't closed. TTLB should be shorter for a real client. I guess I need to use something other than k6 for benchmarking.

I've just tried to benchmark with 10 virtual simultaneous users connecting to a single-threaded Envoy and got
280 reqs/s (default 1M buffer) vs 371 reqs/s (8k per connection buffer).

With 20 virtual users I got 306 reqs/s vs 364 reqs/s respectively.

With 80 VUs I got 420 reqs/s vs 311 reqs/s.

Whereas Nginx consistently gives ~420 reqs/s starting from 10 VUs. Then increasing VUs doesn't increase throughput.

@mattklein123
Copy link
Member

I actually do wonder if some of the fixes @ggreenway is working on may impact what we are seeing. Something still doesn't add up to me here.

@yongzhang
Copy link

The discusstion stops in 2020, is this still a problem for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests