Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error instead of warning in logs when sending SIGTERM since v21 #3050

Open
feliixx opened this issue Aug 7, 2023 · 10 comments
Open

Error instead of warning in logs when sending SIGTERM since v21 #3050

feliixx opened this issue Aug 7, 2023 · 10 comments
Assignees
Milestone

Comments

@feliixx
Copy link

feliixx commented Aug 7, 2023

Hi @benoitc,

Thanks a lot for all the work you've done on Gunicorn!

We recently upgraded from version 20.1.0 to 21.2.0 and noticed in our logs when restarting our application that a warning has been replaced by an error when sending SIGTERM to a worker:

version 20.1.0:

systemd: Stopping App...
app: gunicorn WARNING Worker with pid 2542594 was terminated due to signal 15
app: gunicorn INFO Handling signal: term
app: gunicorn INFO Worker exiting (pid: 2542537)
app: gunicorn INFO Worker exiting (pid: 2542545)
app: gunicorn INFO Shutting down: Master
systemd: Stopped App.
systemd: Started App.

version 21.2.0:

systemd: Stopping App...
app: gunicorn ERROR Worker (pid:1340527) was sent SIGTERM!
app: gunicorn INFO Handling signal: term
app: gunicorn INFO Worker exiting (pid: 1340420)
app: gunicorn INFO Worker exiting (pid: 1340424)
app: gunicorn INFO Shutting down: Master
systemd: Stopped App.
systemd: Started App.

This is a bit annoying for us as we carefully monitor any error sent by our app, and this seems to be a false positive.

The commits introducing this change are 76f8da2 and 28df992, however it's not clear from the commit messages why the log level was changed. The documentation about signals does say that sending SIGTERM is the proper way to do a graceful shutdown.

Is it possible to change the log level back to warning, at least for SIGTERM? We'll be happy to send a PR if you agree.

@Venture200
Copy link

Hello, any update on this ?

@pauljeannot
Copy link

Hello, also effected by this issue. Any update on this ?

@feliixx
Copy link
Author

feliixx commented Oct 1, 2024

For other people affected by this: we've managed to sidestep the issue by using KillMode=mixed in our systemd service config:

[Service]
KillMode=mixed
...

We've been using this config in production for over a year with Gunicorn 21.2.0 and above without issues.

@pauljeannot
Copy link

For my understanding, using KillMode=mixed means that while the master process receives SIGTERM for a graceful shutdown, the workers are immediately sent a SIGKILL, which prevents them from completing any pending tasks or writing logs (so no false positive in your monitoring).

Am I right?

@feliixx
Copy link
Author

feliixx commented Oct 2, 2024

For my understanding, using KillMode=mixed means that while the master process receives SIGTERM for a graceful shutdown, the workers are immediately sent a SIGKILL, which prevents them from completing any pending tasks or writing logs

Exactly. As you noted, that's not ideal as the worker won't shutdown gracefully but instead are immediately killed, so it might be problematic in some cases.

@benoitc
Copy link
Owner

benoitc commented Oct 2, 2024

the log level is indeed wrong. This will be fixed in coming release

@pauljeannot
Copy link

@benoitc Thanks! I just opened a PR for it, if it can help.

sylt added a commit to sylt/gunicorn that referenced this issue Oct 5, 2024
This is to prevent them getting double SIGTERM signals sent to them
(once the main loop runs self.manage_workers again and notices that
there are too many workers spawned). It seems as if they exit due to
signal termination in case they are sent SIGTERM, but are not yet
reaped.

Hopefully solves benoitc#3050.
@sylt
Copy link

sylt commented Oct 5, 2024

I'm gonna go against the stream here, but I honestly don't think the log level is wrong. Instead, I think we should ask: Why do we even hit this trace/code path from the beginning? If SIGTERM is caught by the worker as a mean of doing a graceful shutdown (meaning the worker should exit normally with exit code 0), why are we executing a code path for handling workers terminated by a signal? I would argue that if the worker exited normally, we shouldn't even be in that code path.

I've tried to address what I consider to be the root cause in #3312. It's more intricate than changing the log level of course, but I think it's more correct. The benefit of keeping the log level as is would be that other signals (such as e.g. SIGSEGV and SIGKILL) are logged as errors.

@violuke
Copy link

violuke commented Oct 5, 2024

@sylt I disagree; sorry. We don't use Gunicorn's settings to automatically restart workers; instead, we do so manually (well automatically) based on a number of factors that we monitor ourselves (e.g., pod movements, memory spikes, deploys, etc.). We first gracefully move new traffic away to other Gunicorn pods and then manually send SIGTERM to the master process. This results in us getting these errors in the logs that we don't want as there nothing wrong and hasn't been any error.

For more context on why, if we let Gunicorn restart workers as it pleases (e.g. based on max_requests), this causes 502 errors to be returned by Caddy (which sits in front of it) for a very brief period, as there are no workers available for the new requests. Sure, we can let Caddy auto-retry, which we do for GET requests, but we cannot let retries happen blindly for requests like PUT/POST as Caddy doesn't know why there was no Gunicorn available and so we might process the same request multiple times.

I hope this makes sense and shows why for us this is a problem.

@sylt
Copy link

sylt commented Oct 5, 2024

@violuke No problem :) I don't think we're really disagreeing, I'm just questioning the implementation of how to prevent the error from happening! Because I totally agree that you shouldn't be seeing an error in the scenario you describe. My main point here is that I'm wondering if we should even be hitting the code path that produces the error from the very beginning if the worker exited normally. It is with that background I didn't think lowering the log level was the obvious solution from an implementation point of view.

In the PR I posted, what I had in mind fixing was the SIGHUP case (when the arbiter restarts all of its workers), as that is where I noticed these errors myself. This is what I thought systemd was doing, but this is most likely where I was mistaken, no? In your case, when you're sending SIGTERM manually yourself only to the main process, I honestly can't figure out why the error would occur. Only if all child processes however were sent a SIGTERM at the same time, I would have understood it though, as it's easy to reproduce. In that case though, it'd argue it's user fault.

With all that said, lowering the the log level is perhaps the easiest/best way to solve the issue for now, I just think it should be done for the right reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants