100% nginx usage #80

eustas · 2019-09-03T12:58:56Z

moderndeveloperllc · 2019-09-03T19:35:54Z

@eustas Does this mean that you've officially taken over maintenance of this at Google? Great news if so. I think some other developers might have been waiting for the "official" repo to be considered active before progressing.

eustas · 2019-09-23T19:05:43Z

Yes, finally, brotli team is able to support ngx_brotli =) Sorry for super-late response, I've thought that I've answered already, but, perhaps, that was in another issue =)

eustas · 2019-09-23T19:07:42Z

@kapouer @HansVanEijsden @hroost does it still happen after #83?

HansVanEijsden · 2019-09-23T19:48:36Z

@kapouer @HansVanEijsden @hroost does it still happen after #83?

Currently I'm testing it on two production servers. Compiled and enabled it 15 minutes ago. Up & running with no issues, yet. I'll let it run for some hours and I'll report it back here. Thnx!

HansVanEijsden · 2019-09-23T20:31:39Z

Unfortunately, I still have the problems. After ~18 and ~30 minutes Nginx was hanging on one of the four cores with 100%. And as a consequence 25% of the web requests were hanging too, without getting any response from Nginx on that core. After 40 seconds the CPU usage goes down to normal.
About 20 minutes later the CPU usage was going to 100% on one core again, and after ~ 40 seconds it went down to normal. And as I'm typing, 5 minutes later, CPU is 100% again, so I will immediately have to disable ngx_brotli again in my Nginx.

My Nginx error log shows this:

2019/09/23 21:45:58 [alert] 17909#17909: worker process 17910 exited on signal 9
2019/09/23 21:48:32 [alert] 28839#28839: worker process 28840 exited on signal 9
2019/09/23 22:01:26 [alert] 29478#29478: worker process 29480 exited on signal 9
2019/09/23 22:24:19 [alert] 5144#5144: *4129 open socket #64 left in connection 27
2019/09/23 22:24:19 [alert] 5144#5144: aborting
2019/09/23 22:24:22 [alert] 5143#5143: worker process 5147 exited on signal 9
2019/09/23 22:28:16 [alert] 17471#17471: worker process 17473 exited on signal 9

The 100% CPU problems appeared exactly on those timestamps.

My Nginx Brotli config:

	brotli on;
	brotli_static on;
	brotli_types text/css text/x-component application/x-javascript application/javascript text/javascript text/x-js text/richtext image/svg+xml text/plain text/xsd text/xsl text/xml image/bmp application/java application/msword application/vnd.ms-fontobject application/x-msdownload image/x-icon image/webp application/json application/vnd.ms-access video/webm application/vnd.ms-project application/x-font-otf application/vnd.ms-opentype application/vnd.oasis.opendocument.database application/vnd.oasis.opendocument.chart application/vnd.oasis.opendocument.formula application/vnd.oasis.opendocument.graphics application/vnd.oasis.opendocument.spreadsheet application/vnd.oasis.opendocument.text audio/ogg application/pdf application/vnd.ms-powerpoint application/x-shockwave-flash image/tiff application/x-font-ttf audio/wav application/vnd.ms-write application/font-woff application/font-woff2 application/vnd.ms-excel;

chipitsine · 2019-09-23T20:43:46Z

do you have core dumps allowed ?

it can be easily achived by using

working_directory /root; # or some other writable directory
worker_rlimit_core 99999M;

HansVanEijsden · 2019-09-23T21:21:19Z

do you have core dumps allowed ?

it can be easily achived by using
working_directory /root; # or some other writable directory
worker_rlimit_core 99999M;

Yes I have, thanks for the suggestion. It's strange: no core dumps are being generated.

worker_rlimit_core 99999M;
working_directory   /tmp/;

Nothing. Also dmesg -T doesn't show anything: no Nginx crashes and/or segfaults. Nginx just keeps on hanging at 100% CPU, and/or quits (with the other workers going back to normal) after ~40 seconds without errors.

Do you have any other suggestions?

PiotrSikora · 2019-09-23T21:25:31Z

@chipitsine /root/ isn't writable to worker processes, so you wouldn't get core dumps there.

@HansVanEijsden dmesg wouldn't show NGINX crashes, but you can see them in the logs:

2019/09/23 21:45:58 [alert] 17909#17909: worker process 17910 exited on signal 9

(signal 9 is a segmentation fault)

HansVanEijsden · 2019-09-23T21:27:01Z

@HansVanEijsden dmesg wouldn't show NGINX crashes, but you can see them in the logs:
2019/09/23 21:45:58 [alert] 17909#17909: worker process 17910 exited on signal 9
(signal 9 is a segmentation fault)

That's good to know!
Another thing: I will compile Nginx with --with-debug and try again. Let's see if that will generate the core dumps. 😉

chipitsine · 2019-09-23T21:28:45Z

it depends on user xxx;

if nginx is started with used root, it can write dumps to /root

if nginx is started with another user, so another folder should be chosen

HansVanEijsden · 2019-09-23T22:12:49Z

My Nginx runs as www-data with user www-data; and I have set working_directory /tmp/;. The /tmp is writable by the www-data user.
I've built Nginx with --with-debug but still no core dumps in /tmp although I had signal 9 again.
I'm out of ideas, but I'll try to search for other ways to try to get more information.

HansVanEijsden · 2019-09-24T12:52:17Z

Another update. Monit was running on my servers and was automatically restarting Nginx after those ~40 seconds. That was probably throwing the signal 9 in the Nginx log.

My syslog:

Sep 24 14:45:42 www monit[543]: 'nginx' failed protocol test [HTTP] at [localhost]:80/ [TCP/IP] -- HTTP: Error receiving data -- Resource temporarily unavailable
Sep 24 14:45:42 www monit[543]: 'nginx' trying to restart
Sep 24 14:45:42 www monit[543]: 'nginx' stop: '/bin/systemctl stop nginx'
Sep 24 14:45:42 www systemd[1]: Stopping Nginx...
Sep 24 14:45:50 www systemd[1]: nginx.service: Succeeded.
Sep 24 14:45:50 www systemd[1]: Stopped Nginx.
Sep 24 14:45:50 www monit[543]: 'nginx' start: '/bin/systemctl start nginx'
Sep 24 14:45:50 www systemd[1]: Starting Nginx...
Sep 24 14:45:50 www systemd[1]: Started Nginx.

So, I disabled Monit and tried again. Nginx keeps hanging forever at 100%, without segfaults, without core dumps and without any other log entries.
I thought about enabling Nginx debug error logging but I'm afraid it generates too much log data on a production server.

chipitsine · 2019-09-24T13:45:58Z

is it production environment ?

if "no" (i.e. test env), can you provide an ssh access to it ?

jdylag · 2019-09-24T15:07:51Z

We have the same issue. Nginx 1.16.1 and libbrotli 1.0.7. We deployed ngx_brotli on ~150 servers (with 70k domains) and noticed 100% cpu usage on 3 servers. All servers have same configuration.

We tried to find domain / request responsible (problem starts usually 1-2 hours after restart) and managed to identify domain and that last request processed by hanged process was always 404.

If i try to repeat query that resulted in error it never happens, it happens on random 404 after some time. It suggest that problem is linked with 404 handling. We are using custom error pages generated with SSI using "include" directive (debug logs shows SSI processed just before brotli).

Removing "include" directive from error page fixed problem on all 3 servers, 147 servers are fine with "include" directive and ngx_brotli.

Error log during problem (debug log level) shows repeated:

[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0
[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0
[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0
[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0
[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0

I can provide full nginx error log if needed.

HansVanEijsden · 2019-09-24T15:15:29Z

@chipitsine I cannot reproduce it in a test environment. It needs some traffic to trigger, as far as I know. Currently I'm reproducing it on a VPS with only one website (a local news website).
I can provide you with the complete Nginx configuration though, just let me know.
And I would love to provide SSH access, but root access only to the persons working directly at Nginx, Google or this project, because of my responsibilities.

I managed to get an Nginx debug log while one of the processes was hanging at 100% CPU. I can't find anything special in that log though. Just the normal epoll and thread things.

The only thing which I could find, right before it went to 100% CPU, was this (epoll_wait() error?):

2019/09/24 16:53:04 [debug] 24703#24703: worker cycle
2019/09/24 16:53:04 [debug] 24703#24703: epoll timer: 16256
2019/09/24 16:53:04 [debug] 24700#24700: epoll: fd:86 ev:0005 d:00007FCF8E1D2310
2019/09/24 16:53:04 [debug] 24700#24700: timer delta: 3712
2019/09/24 16:53:04 [debug] 24700#24700: worker cycle
2019/09/24 16:53:04 [debug] 24700#24700: epoll timer: 52032
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:19 ev:0001 d:00007FCF8E1D1F30
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 1332
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14924
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D3671
2019/09/24 16:53:05 [debug] 24703#24703: shmtx lock
2019/09/24 16:53:05 [debug] 24703#24703: slab alloc: 171 slot: 5
2019/09/24 16:53:05 [debug] 24703#24703: slab alloc: 00007FCF86EF4F00
2019/09/24 16:53:05 [debug] 24703#24703: slab alloc: 128 slot: 4
2019/09/24 16:53:05 [debug] 24703#24703: slab alloc: 00007FCF86EF3800
2019/09/24 16:53:05 [debug] 24703#24703: shmtx unlock
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 16
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14908
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D3671
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 8
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14900
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D3671
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 8
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14892
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:45 ev:2019 d:00007FCF8E1D2DB9
2019/09/24 16:53:05 [debug] 24700#24700: epoll: fd:163 ev:201D d:00007FCF8E1D2BC8
2019/09/24 16:53:05 [debug] 24703#24703: epoll_wait() error on fd:45 ev:2019
2019/09/24 16:53:05 [debug] 24700#24700: epoll_wait() error on fd:163 ev:201D
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 72
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14820
2019/09/24 16:53:05 [debug] 24700#24700: timer delta: 780
2019/09/24 16:53:05 [debug] 24700#24700: worker cycle
2019/09/24 16:53:05 [debug] 24700#24700: epoll timer: 51252
2019/09/24 16:53:05 [debug] 24700#24700: epoll: fd:162 ev:2019 d:00007FCF8E1D2DB8
2019/09/24 16:53:05 [debug] 24700#24700: epoll_wait() error on fd:162 ev:2019
2019/09/24 16:53:05 [debug] 24700#24700: *3397 http2 idle handler
2019/09/24 16:53:05 [debug] 24700#24700: *3397 reusable connection: 0
2019/09/24 16:53:05 [debug] 24700#24700: *3397 posix_memalign: 00007FCF8FE67000:4096 @16
2019/09/24 16:53:05 [debug] 24700#24700: *3397 http2 read handler
2019/09/24 16:53:05 [debug] 24700#24700: *3397 SSL_read: -1
2019/09/24 16:53:05 [debug] 24700#24700: *3397 SSL_get_error: 5
2019/09/24 16:53:05 [debug] 24700#24700: *3397 peer shutdown SSL cleanly
2019/09/24 16:53:05 [debug] 24700#24700: *3397 close http connection: 162
2019/09/24 16:53:05 [debug] 24700#24700: *3397 SSL_shutdown: 1
2019/09/24 16:53:05 [debug] 24700#24700: *3397 event timer del: 162: 1095055796
2019/09/24 16:53:05 [debug] 24700#24700: *3397 reusable connection: 0
2019/09/24 16:53:05 [debug] 24700#24700: *3397 run cleanup: 00007FCF8FE25BD0
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE67000, unused: 4016
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 0000000000000000
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE61000
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE25E00
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8E450740
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE25A00, unused: 8
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE25C00, unused: 8
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE27000, unused: 224
2019/09/24 16:53:05 [debug] 24700#24700: timer delta: 0
2019/09/24 16:53:05 [debug] 24700#24700: worker cycle
2019/09/24 16:53:05 [debug] 24700#24700: epoll timer: 52760
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:35 ev:2019 d:00007FCF8E1D2310
2019/09/24 16:53:05 [debug] 24703#24703: epoll_wait() error on fd:35 ev:2019
2019/09/24 16:53:05 [debug] 24703#24703: *3396 http2 idle handler
2019/09/24 16:53:05 [debug] 24703#24703: *3396 reusable connection: 0
2019/09/24 16:53:05 [debug] 24703#24703: *3396 posix_memalign: 00007FCF8FE5D000:4096 @16
2019/09/24 16:53:05 [debug] 24703#24703: *3396 http2 read handler
2019/09/24 16:53:05 [debug] 24703#24703: *3396 SSL_read: -1
2019/09/24 16:53:05 [debug] 24703#24703: *3396 SSL_get_error: 5
2019/09/24 16:53:05 [debug] 24703#24703: *3396 peer shutdown SSL cleanly
2019/09/24 16:53:05 [debug] 24703#24703: *3396 close http connection: 35
2019/09/24 16:53:05 [debug] 24703#24703: *3396 SSL_shutdown: 1
2019/09/24 16:53:05 [debug] 24703#24703: *3396 event timer del: 35: 1095055796
2019/09/24 16:53:05 [debug] 24703#24703: *3396 reusable connection: 0
2019/09/24 16:53:05 [debug] 24703#24703: *3396 run cleanup: 00007FCF8FE4DBD0
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 00007FCF8FE5D000, unused: 4016
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 0000000000000000
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 00007FCF8E4FDD00
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 00007FCF8FE4DA00, unused: 8
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 00007FCF8FE4DC00, unused: 144
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 0
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14820
2019/09/24 16:53:06 [debug] 24703#24703: epoll: fd:19 ev:0001 d:00007FCF8E1D1F30
2019/09/24 16:53:06 [debug] 24703#24703: timer delta: 968
2019/09/24 16:53:06 [debug] 24703#24703: worker cycle
2019/09/24 16:53:06 [debug] 24703#24703: epoll timer: 13852
2019/09/24 16:53:06 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D2311
2019/09/24 16:53:06 [debug] 24703#24703: shmtx lock
2019/09/24 16:53:06 [debug] 24703#24703: slab alloc: 165 slot: 5
2019/09/24 16:53:06 [debug] 24703#24703: slab alloc: 00007FCF86EF5000
2019/09/24 16:53:06 [debug] 24703#24703: slab alloc: 128 slot: 4
2019/09/24 16:53:06 [debug] 24703#24703: slab alloc: 00007FCF86EF3880
2019/09/24 16:53:06 [debug] 24703#24703: shmtx unlock
2019/09/24 16:53:06 [debug] 24703#24703: timer delta: 104
2019/09/24 16:53:06 [debug] 24703#24703: worker cycle
2019/09/24 16:53:06 [debug] 24703#24703: epoll timer: 13748
2019/09/24 16:53:06 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D2311
2019/09/24 16:53:06 [debug] 24703#24703: timer delta: 96
2019/09/24 16:53:06 [debug] 24703#24703: worker cycle

I don't know if it's something useable though.

Nginx 1.17.3 (and also 1.17.4, today's release) and libbrotli 1.0.7. An up-to-date Debian Stretch system.

# uname -a
Linux vps 4.19.0-0.bpo.5-amd64 #1 SMP Debian 4.19.37-5+deb10u2~bpo9+1 (2019-08-16) x86_64 GNU/Linux

chipitsine · 2019-09-24T19:40:42Z

@HansVanEijsden , please contact me in private (either email or skype: chipitsine@gmail.com).
I'd like to get some core dump (if it will be safe for your environment)

HansVanEijsden · 2019-09-25T03:00:00Z

@chipitsine done!

PiotrSikora · 2019-09-25T03:05:47Z

@HansVanEijsden 100% CPU won't result in a segfault (so those are 2 separate issues), but usually indicates that there is an infinite loop somewhere.

If you run NGINX under gdb, then you can hit ctrl+c once the process is at 100% CPU, and it should break in the middle of the infinite loop.

dilyanpalauzov · 2019-11-28T19:18:01Z

Does upgrading to nginx 1.16.1 help, see

http://nginx.org/en/CHANGES-1.16 ?

DocumentRoot · 2019-11-28T21:43:10Z

I tried using ngx_brotli half a year ago (from eustas repo) and came across this bug with 100% cpu.

Can you clarify if this resolved?

kapouer · 2019-11-28T22:45:01Z

AFAICT the bug is still in nginx 1.16.1.
I'll check again.

gzzz · 2019-12-25T02:43:15Z

Same here.

Nginx 1.17.6, brotli 1.0.7.

awused · 2020-01-09T04:23:54Z

I have run into this with the fancyindex module in a couple of large directories on my server. On those directories it reproduces 100% of the time with some clients and HTTP/2, and the partial truncated responses the browser receives always end at the same point regardless of specific compression settings.

Firefox is the only client I've found that works with brotli compression and HTTP/2. Chromium and curl do not, even when curl is sending the same headers as Firefox. When downgrading to HTTP/1.1 (whether on the client or by disabling it on the server) all clients work in every case I can test.

awused · 2020-01-23T09:04:04Z

I took a stab at it but what I thought was an incompatibility between filters turned out to be a trickier bug. The filter doesn't handle the case where the input is too small to be compressed by brotli, and assumes that flushes are guaranteed to produce output. BrotliEncoderCompress has a fallback internally to MakeUncompressedStream for input that is too small but there isn't a clean way to do the same with BrotliEncoderCompressStream. ngx_brotli only sets flush/last_buf on output buffers when there's output from the brotli library so it can end up unable to make progress, leading to spinning.

It ended up not being a simple or clean fix so I'll leave it to the maintainer. I've seen this same bug crop up in other wrappers around brotli, now that I know the cause and symptoms. The BrotliEncoderCompressStream method seems very difficult to use safely when the input may be small.

eustas · 2020-01-23T11:14:54Z

Thanks for the investigation. Will try to reproduce / fix when get back from vacation.
Have a high hope that this long standing blocker could be fixed at last.

ktecho · 2020-02-19T14:22:04Z

@eustas Any news on this one? I was thinking on putting this in out server, but this is a bit scary.

Thanks!

eustas · 2020-03-03T13:17:03Z

Hi. I'm finally back =) Will work on that this week

eustas · 2020-03-04T12:16:35Z

Added basic http2 tests. Still can not reproduce (with curl --http2-prior-knowledge -s -H 'Accept-encoding: br'.
Could you share a minimised repro configuration, please?

awused · 2020-03-04T12:35:15Z

The only consistent repro I have happens to involve too much private data. Just throwing random data of a sufficient length should be enough to hit it eventually, the bug is such that any server running this on dynamic data can expect to run into it. ~~But you can probably trigger this artificially with test cases easier than you can generate random data to reproduce it.~~ (I guess there are no unit tests) It's the same bug as #81 but on flush instead of end_of_input, just instead of truncating the output it enters an infinite loop.

Code inspection is enough to tell that the code does not behave properly when a flush is requested but BrotliEncoderHasMoreOutput remains false after writing all input up to the flush request. The flush flag on the out_buf only gets set when output is read from the compressor.

eustas · 2020-03-04T12:54:56Z

Hmmm... Brotli obeys flushing, so if there was some input (after pushing the last output block) it will produce output again. On the other side, it won't produce more output, if no more input arrived, and just another flush is requested... Will try to reproduce that. Thanks.

eustas · 2020-03-04T13:38:21Z

The flush flag on the out_buf only gets set when output is read from the compressor.

So, brotli should be a second level compressor then? ngx_brotli marks buffer with flush, only if the corresponding input buffer was marked so. So what nginx module could help me with simulating this?

(alternatively, I could try to activate some sctipting, e.g. php, and try to force flush out there)
(also, perhaps http2 multiplexing somehow triggers that, not sure how to check...)

eustas · 2020-03-05T13:28:27Z

Was able to setup http2 + ssl + perl in docker.
Perl script forces the flush. Everything works fine.
Consequently, there is something about the filter stack. Some filters do not collaborate as expected.

Will try to setup fancyindex...

PS:
nginx.conf:

load_module modules/ngx_http_brotli_filter_module.so;
load_module modules/ngx_http_perl_module.so;

events {
  worker_connections 4;
}

daemon on;

http {
  gzip on;
  gzip_comp_level 1;
  gzip_types text/plain text/css;

  brotli on;
  brotli_comp_level 3;
  brotli_types text/plain text/css;

  perl_require hello.pm;

  server {
    listen                     8443 ssl http2;
    server_name                localhost;

    ssl                        on;
    ssl_protocols              TLSv1 TLSv1.1 TLSv1.2;
    ssl_certificate            nginx.crt;
    ssl_certificate_key        nginx.key;

    root ./;

    index index.html;

    location / {
      perl hello::handler;
    }
  }
}

hello.pl:

package hello;

use nginx;

sub handler {
    my $r = shift;

    $r->send_http_header("text/html");
    return OK if $r->header_only;

    $r->print("head\n<br/>");
    for (my $i=0; $i <= 9; $i++) {
        $r->print("body\n<br/>");
        $r->flush();
    }
    $r->print("tail\n<br/>");

    return OK;
}

1;
__END__

eustas · 2020-03-05T13:41:24Z

(though I believe, filters below compression should not affect the workflow, as we've seen with perl; most likely, it is something above the compression; @awused what other filters / modules are configured?)

HansVanEijsden · 2020-03-05T13:45:56Z

@eustas thanks for looking into it.
For what it's worth, this is my nginx -V on a production machine which gives these problems when Brotli is enabled:

nginx version: nginx/1.17.9 built by gcc 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) built with OpenSSL 1.1.1d 10 Sep 2019 TLS SNI support enabled configure arguments: --prefix=/opt/nginx --user=www-data --group=www-data --with-http_v2_module --with-http_ssl_module --with-http_v2_hpack_enc --with-openssl=/usr/local/src/openssl-1.1.1d --with-openssl-opt='enable-ec_nistp_64_gcc_128 enable-rfc3779 enable-cms no-tests -DCFLAGS='-O3 -march=native -flto -fuse-linker-plugin'' --with-pcre-jit --with-file-aio --with-http_flv_module --with-http_geoip_module --with-http_gunzip_module --with-http_mp4_module --with-http_realip_module --with-http_stub_status_module --with-threads --with-libatomic --with-zlib=/usr/local/src/zlib-cloudflare --with-zlib-opt='-O3 -march=native -flto -fuse-linker-plugin' --with-http_gzip_static_module --add-module=/usr/local/src/headers-more-nginx-module --add-module=/usr/local/src/echo-nginx-module --add-module=/usr/local/src/ngx_http_substitutions_filter_module --add-module=/usr/local/src/srcache-nginx-module --add-module=/usr/local/src/redis2-nginx-module --add-module=/usr/local/src/ngx_http_redis-0.3.9 --add-module=/usr/local/src/ngx_devel_kit --add-module=/usr/local/src/set-misc-nginx-module --add-module=/usr/local/src/ngx_brotli --with-cc-opt='-DTCP_FASTOPEN=23 -march=native -flto -O3 -fuse-linker-plugin -Wno-error=strict-aliasing -fstack-protector-strong -D_FORTIFY_SOURCE=2' --with-ld-opt='-lrt -z relro -fstack-protector-strong'

System (uname -a):
Linux www391 4.19.0-0.bpo.6-amd64 #1 SMP Debian 4.19.67-2+deb10u2~bpo9+1 (2019-11-12) x86_64 GNU/Linux

awused · 2020-03-06T05:35:56Z

I played around with it some more. The particular location block where I have a reproduction is also running http_image_filter. If I disable that filter it changes where in the response it happens to spin but doesn't change the nature of the spin. The only other filters running are fancyindex and brotli.

What I do see is that brotli is calling ngx_http_next_body_filter while holding an out_buf it believes to be of non-zero size but http_write_filter_module thinks it has been called with a chain of size 0. This is repeated in a tight loop.

2020/03/05 21:27:03 [debug] 7323#101520: *3 brotli out_buf: 000000080C9FA7C8, size:87482
2020/03/05 21:26:34 [debug] 7317#101644: *2 http write filter limit 0
2020/03/05 21:27:03 [debug] 7323#101520: *3 http write filter: l:0 f:0 s:0
2020/03/05 21:26:34 [debug] 7317#101644: *2 http write filter 0000000000000000
2020/03/05 21:27:03 [debug] 7323#101520: *3 http write filter limit 0

I still believe there's a problem with empty output compressor flushes, but that might not be the cause of this error.

eustas · 2020-03-06T10:07:45Z

That is interesting: limit == 0 -> either sendfile_max_chunk == 0, or r->limit_rate * (ngx_time() - r->start_sec + 1) == c->sent - r->limit_rate_after (unlikely).

eustas · 2020-03-06T10:16:43Z

NB: also, there is no place in the code that logs like "brotli out_buf: X, size: Y", only "brotli out: X, size: Y"

awused · 2020-03-06T10:52:30Z

That was something I added immediately before the call to ngx_http_next_body_filter.

eustas · 2020-03-06T11:04:04Z

Please, add a flags to that output: ctx->output_ready and ctx->output_busy. Depending on it either buffer is passed, or NULL.

Tried https + h2 + fancy index. Works well on folder with 10-20-30-40-50k empty files with uuid names.

eustas · 2020-03-06T11:20:58Z

If it is infinite loop with ctx->output_busy then the "above" filters do not consume the brotli output and hold the buffer. Brotli filter follows the "zero copy" paradigm, so it can not proceed, until the previous output is fully consumed.

If it is the case, we could check what filter in the filter stack holds the buffer. If those can not be fixed, then we could request flush in situations like that, and see if this would help.

splitice · 2020-04-15T14:10:10Z

I've done some testing and:

This issue still occurs with nginx/1.17.10
I was only able to replicate the issue when there is blocking due to output pressure.

Setup for testing: openresty + brotli (over http 1.1). brunzip compiled in but not enabled.

@eustas it sounds like you were on to something on Mar 6. Do you have anything I can test? I'd be happy to help in that way.

Fixes #80 at least for my examples.

splitice · 2020-04-16T10:05:43Z

Thank you @eustas I will test in the next few days (tomorrow maybe) and report back. Crazy busy at the moment (in part due to trying to fix similar issues with my ngx_brunzip) but I promise I'll get back to you ASAP.

eustas · 2020-04-16T10:09:18Z

Thanks to @awused, we got some progress with this issue.

Hey, @splitice, @HansVanEijsden, @ktecho, @kapouer, @gzzz, @jdylag, would appreciate very much your feedback, if the problem is solved now!

HansVanEijsden · 2020-04-16T10:10:15Z

Pushing it to production now. Will keep you up-to-date! 👍🏻 Thanks!

HansVanEijsden · 2020-04-17T10:44:01Z

@eustas, after having it in production for 24 hours, with production traffic, all seems to be stable and functioning. Thanks @awused for your commit, it looks like it did the trick! 🎉

eustas · 2020-04-17T11:00:28Z

Woo-hoo =)

eustas · 2020-04-17T11:01:48Z

Nice moment to cut a release #90 =)

splitice · 2020-04-18T08:52:31Z

Does not fix issue.

Perf result after upgrade on a spinning worker show that it's still brotli doing the spinning.

    55.89%  nginx    nginx              [.] ngx_http_brotli_body_filter
    23.65%  nginx    nginx              [.] ngx_http_gzip_body_filter
    18.31%  nginx    nginx              [.] ngx_http_write_filter
     2.15%  nginx    nginx              [.] ngx_http_chunked_body_filter

Our workload typically means output pressure as the backends are fast.

gzzz · 2020-04-20T20:01:51Z

nginx/1.17.10 with last ngx_brotli in filter mode – 3 days running without problem.

HansVanEijsden · 2020-04-20T20:03:26Z

@gzzz I can confirm also in my case: on all the servers in production (and even without restarts or reloads of nginx) continuously running without problems.

splitice · 2020-04-21T10:00:37Z

Actually ignore my report, I incorrectly cherry-picked the commit. Everything has been good in testing for 12 hours now.

HansVanEijsden mentioned this issue Sep 6, 2019

Sometimes filter does not send the final piece #81

Closed

carlesjavierre mentioned this issue Jan 3, 2020

Releases #90

Closed

awused mentioned this issue Apr 16, 2020

Prevent spinning when progress is not being made. #100

Merged

eustas closed this as completed in #100 Apr 16, 2020

eustas pushed a commit that referenced this issue Apr 16, 2020

Prevent spinning when progress is not being made. (#100)

bcceaab

Fixes #80 at least for my examples.

eustas reopened this Apr 16, 2020

eustas closed this as completed Apr 23, 2020

100% nginx usage #80

100% nginx usage #80

Comments

eustas commented Sep 3, 2019

moderndeveloperllc commented Sep 3, 2019

eustas commented Sep 23, 2019

eustas commented Sep 23, 2019 • edited Loading

HansVanEijsden commented Sep 23, 2019

HansVanEijsden commented Sep 23, 2019 • edited Loading

chipitsine commented Sep 23, 2019

HansVanEijsden commented Sep 23, 2019 • edited Loading

PiotrSikora commented Sep 23, 2019

HansVanEijsden commented Sep 23, 2019

chipitsine commented Sep 23, 2019

HansVanEijsden commented Sep 23, 2019

HansVanEijsden commented Sep 24, 2019

chipitsine commented Sep 24, 2019

jdylag commented Sep 24, 2019 • edited Loading

HansVanEijsden commented Sep 24, 2019 • edited Loading

chipitsine commented Sep 24, 2019

HansVanEijsden commented Sep 25, 2019

PiotrSikora commented Sep 25, 2019

dilyanpalauzov commented Nov 28, 2019

DocumentRoot commented Nov 28, 2019

kapouer commented Nov 28, 2019

gzzz commented Dec 25, 2019

awused commented Jan 9, 2020

awused commented Jan 23, 2020

eustas commented Jan 23, 2020

ktecho commented Feb 19, 2020

eustas commented Mar 3, 2020

eustas commented Mar 4, 2020 • edited Loading

awused commented Mar 4, 2020 • edited Loading

eustas commented Mar 4, 2020

eustas commented Mar 4, 2020

eustas commented Mar 5, 2020

eustas commented Mar 5, 2020

HansVanEijsden commented Mar 5, 2020 • edited Loading

awused commented Mar 6, 2020

eustas commented Mar 6, 2020

eustas commented Mar 6, 2020

awused commented Mar 6, 2020

eustas commented Mar 6, 2020

eustas commented Mar 6, 2020

splitice commented Apr 15, 2020

splitice commented Apr 16, 2020

eustas commented Apr 16, 2020 • edited Loading

HansVanEijsden commented Apr 16, 2020

HansVanEijsden commented Apr 17, 2020

eustas commented Apr 17, 2020

eustas commented Apr 17, 2020

splitice commented Apr 18, 2020

gzzz commented Apr 20, 2020

HansVanEijsden commented Apr 20, 2020

splitice commented Apr 21, 2020

eustas commented Sep 23, 2019 •

edited

Loading

HansVanEijsden commented Sep 23, 2019 •

edited

Loading

HansVanEijsden commented Sep 23, 2019 •

edited

Loading

jdylag commented Sep 24, 2019 •

edited

Loading

HansVanEijsden commented Sep 24, 2019 •

edited

Loading

eustas commented Mar 4, 2020 •

edited

Loading

awused commented Mar 4, 2020 •

edited

Loading

HansVanEijsden commented Mar 5, 2020 •

edited

Loading

eustas commented Apr 16, 2020 •

edited

Loading