Re-add and improve the warning for workers terminated due to a signal #2908
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Re-add the warning for workers terminated due to a signal
The warnings are printed later in the main loop, not directly in the signal handler. This should fix the RuntimeError from RuntimeError due to a warning being logged on SIGHUP #2564. See https://stackoverflow.com/q/45680378.
The order of events in the log might be a bit different from before.
Fix false positive "terminated due to signal 15"
During a reload (i.e. SIGHUP sent to master), gunicorn used to send multiple SIGTERMs to each worker. Every time the master got a SIGCHLD from one worker exiting, it sent a SIGTERM to all old workers who are still alive.
If you send a worker another SIGTERM when it's just about to shut down and the Python interpreter is already deinitialized, the signal won't be caught. The process exit status will be WIFSIGNALED SIGTERM.
Explanation:
This PR limits it to sending max one SIGTERM per worker in manage_workers(). I believe this is safe because handle_exit() in workers/base.py just sets self.alive = False anyway. So sending multiple SIGTERMs is never helpful to get a worker "unstuck". If a worker doesn't react to the first SIGTERM because it's in some kind of infinite loop, murder_workers() will notice it eventually, and send it a SIGABRT and a SIGKILL.
The SIGTERMs sent by handle_winch() and stop() are not changed in this PR.
I get the impression Gunicorn maintenance is mostly dead, so I don't have high expectations of getting any reply, but I'll hold out hope that somebody will review this PR one day. :)
Testing
Compare what happens on these three commits. Use any WSGI app.
I got the warnings and errors on the first try, but if you can't reproduce the problem, increasing the number of workers might help.