Error instead of warning in logs when sending SIGTERM since v21 #3050

feliixx · 2023-08-07T13:11:28Z

Thanks a lot for all the work you've done on Gunicorn!

We recently upgraded from version 20.1.0 to 21.2.0 and noticed in our logs when restarting our application that a warning has been replaced by an error when sending SIGTERM to a worker:

version 20.1.0:

systemd: Stopping App...
app: gunicorn WARNING Worker with pid 2542594 was terminated due to signal 15
app: gunicorn INFO Handling signal: term
app: gunicorn INFO Worker exiting (pid: 2542537)
app: gunicorn INFO Worker exiting (pid: 2542545)
app: gunicorn INFO Shutting down: Master
systemd: Stopped App.
systemd: Started App.

version 21.2.0:

systemd: Stopping App...
app: gunicorn ERROR Worker (pid:1340527) was sent SIGTERM!
app: gunicorn INFO Handling signal: term
app: gunicorn INFO Worker exiting (pid: 1340420)
app: gunicorn INFO Worker exiting (pid: 1340424)
app: gunicorn INFO Shutting down: Master
systemd: Stopped App.
systemd: Started App.

This is a bit annoying for us as we carefully monitor any error sent by our app, and this seems to be a false positive.

The commits introducing this change are 76f8da2 and 28df992, however it's not clear from the commit messages why the log level was changed. The documentation about signals does say that sending SIGTERM is the proper way to do a graceful shutdown.

Is it possible to change the log level back to warning, at least for SIGTERM? We'll be happy to send a PR if you agree.

The text was updated successfully, but these errors were encountered:

Venture200 · 2024-08-28T10:36:20Z

Hello, any update on this ?

pauljeannot · 2024-10-01T13:38:54Z

Hello, also effected by this issue. Any update on this ?

feliixx · 2024-10-01T14:10:29Z

For other people affected by this: we've managed to sidestep the issue by using KillMode=mixed in our systemd service config:

[Service]
KillMode=mixed
...

We've been using this config in production for over a year with Gunicorn 21.2.0 and above without issues.

pauljeannot · 2024-10-01T15:01:29Z

For my understanding, using KillMode=mixed means that while the master process receives SIGTERM for a graceful shutdown, the workers are immediately sent a SIGKILL, which prevents them from completing any pending tasks or writing logs (so no false positive in your monitoring).

Am I right?

feliixx · 2024-10-02T06:24:29Z

For my understanding, using KillMode=mixed means that while the master process receives SIGTERM for a graceful shutdown, the workers are immediately sent a SIGKILL, which prevents them from completing any pending tasks or writing logs

Exactly. As you noted, that's not ideal as the worker won't shutdown gracefully but instead are immediately killed, so it might be problematic in some cases.

benoitc · 2024-10-02T06:51:04Z

the log level is indeed wrong. This will be fixed in coming release

pauljeannot · 2024-10-04T21:54:53Z

@benoitc Thanks! I just opened a PR for it, if it can help.

This is to prevent them getting double SIGTERM signals sent to them (once the main loop runs self.manage_workers again and notices that there are too many workers spawned). It seems as if they exit due to signal termination in case they are sent SIGTERM, but are not yet reaped. Hopefully solves benoitc#3050.

sylt · 2024-10-05T10:16:27Z

I'm gonna go against the stream here, but I honestly don't think the log level is wrong. Instead, I think we should ask: Why do we even hit this trace/code path from the beginning? If SIGTERM is caught by the worker as a mean of doing a graceful shutdown (meaning the worker should exit normally with exit code 0), why are we executing a code path for handling workers terminated by a signal? I would argue that if the worker exited normally, we shouldn't even be in that code path.

I've tried to address what I consider to be the root cause in #3312. It's more intricate than changing the log level of course, but I think it's more correct. The benefit of keeping the log level as is would be that other signals (such as e.g. SIGSEGV and SIGKILL) are logged as errors.

violuke · 2024-10-05T10:31:04Z

@sylt I disagree; sorry. We don't use Gunicorn's settings to automatically restart workers; instead, we do so manually (well automatically) based on a number of factors that we monitor ourselves (e.g., pod movements, memory spikes, deploys, etc.). We first gracefully move new traffic away to other Gunicorn pods and then manually send SIGTERM to the master process. This results in us getting these errors in the logs that we don't want as there nothing wrong and hasn't been any error.

For more context on why, if we let Gunicorn restart workers as it pleases (e.g. based on max_requests), this causes 502 errors to be returned by Caddy (which sits in front of it) for a very brief period, as there are no workers available for the new requests. Sure, we can let Caddy auto-retry, which we do for GET requests, but we cannot let retries happen blindly for requests like PUT/POST as Caddy doesn't know why there was no Gunicorn available and so we might process the same request multiple times.

I hope this makes sense and shows why for us this is a problem.

sylt · 2024-10-05T12:16:04Z

@violuke No problem :) I don't think we're really disagreeing, I'm just questioning the implementation of how to prevent the error from happening! Because I totally agree that you shouldn't be seeing an error in the scenario you describe. My main point here is that I'm wondering if we should even be hitting the code path that produces the error from the very beginning if the worker exited normally. It is with that background I didn't think lowering the log level was the obvious solution from an implementation point of view.

In the PR I posted, what I had in mind fixing was the SIGHUP case (when the arbiter restarts all of its workers), as that is where I noticed these errors myself. This is what I thought systemd was doing, but this is most likely where I was mistaken, no? In your case, when you're sending SIGTERM manually yourself only to the main process, I honestly can't figure out why the error would occur. Only if all child processes however were sent a SIGTERM at the same time, I would have understood it though, as it's easy to reproduce. In that case though, it'd argue it's user fault.

With all that said, lowering the the log level is perhaps the easiest/best way to solve the issue for now, I just think it should be done for the right reasons.

benoitc mentioned this issue Sep 5, 2023

arbiter: don't log if handling SIGCHLD #3064

Open

tilgovi self-assigned this Dec 28, 2023

tilgovi added this to the 22.0 milestone Dec 29, 2023

pajod mentioned this issue Aug 15, 2024

Gunicorn in docker sends SIGKILL to workers when SIGHUP sent to master process #3274

Open

pauljeannot mentioned this issue Oct 4, 2024

fix: change log level of sigterm signal received from error to warning #3311

Open

sylt mentioned this issue Oct 5, 2024

arbiter: Wait for all old workers to terminate on SIGHUP #3312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error instead of warning in logs when sending SIGTERM since v21 #3050

Error instead of warning in logs when sending SIGTERM since v21 #3050

feliixx commented Aug 7, 2023

Venture200 commented Aug 28, 2024

pauljeannot commented Oct 1, 2024

feliixx commented Oct 1, 2024

pauljeannot commented Oct 1, 2024

feliixx commented Oct 2, 2024

benoitc commented Oct 2, 2024

pauljeannot commented Oct 4, 2024

sylt commented Oct 5, 2024

violuke commented Oct 5, 2024

sylt commented Oct 5, 2024

Error instead of warning in logs when sending SIGTERM since v21 #3050

Error instead of warning in logs when sending SIGTERM since v21 #3050

Comments

feliixx commented Aug 7, 2023

Venture200 commented Aug 28, 2024

pauljeannot commented Oct 1, 2024

feliixx commented Oct 1, 2024

pauljeannot commented Oct 1, 2024

feliixx commented Oct 2, 2024

benoitc commented Oct 2, 2024

pauljeannot commented Oct 4, 2024

sylt commented Oct 5, 2024

violuke commented Oct 5, 2024

sylt commented Oct 5, 2024