Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arbiter: Wait for all old workers to terminate on SIGHUP #3312

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sylt
Copy link

@sylt sylt commented Oct 5, 2024

This is to prevent them getting double SIGTERM signals sent to them (once the main loop runs self.manage_workers again and notices that there are too many workers spawned). It seems as if they exit due to signal termination in case they are sent SIGTERM, but are not yet reaped.

Hopefully solves #3050.

This is to prevent them getting double SIGTERM signals sent to them
(once the main loop runs self.manage_workers again and notices that
there are too many workers spawned). It seems as if they exit due to
signal termination in case they are sent SIGTERM, but are not yet
reaped.

Hopefully solves benoitc#3050.
@benoitc
Copy link
Owner

benoitc commented Oct 7, 2024

thanks for the pr. I will check it.

however the need for this looks odd for me, this should have been handled bh the engine already. I will review recent changes to check.

@sylt
Copy link
Author

sylt commented Oct 7, 2024

Yeah, maybe it is odd. Perhaps there are better ways to fix it? I can agree that this is a bit of a band-aid.

Just for the record, in order for you to make an independent analysis of the problem, this is how I've reproduced the ERROR logs:

# In one console, start the example standalone app:
python examples/standalone_app.py

# In another console, issue a HUP to the main gunicorn process:
pkill -f --oldest --signal HUP '.*python.*examples/standalone_app.py'

I think this "error" has existed for ages, but it has never been caught due to that it wasn't logged as an error before. E.g., it couldn't be reproduced 20.0.0, but if you add the commit (28df992) which just adds the logging of the error, you'll see it.

There are other, more obscure, ways to achieve this error as well. Here is an example using the standalone app again:

# Rapidly, make one worker exit, and one worker start:
for sig in TTOU TTIN; do 
  pkill -f --oldest --signal $sig '.*python.*examples/standalone_app.py'; 
done

And to be clear, this PR will not fix this case above. I have tried to fix the more general problem in sylt#1, but that unfortunately relies on work done in #3148, even thought I guess it would be portable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants