Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reap workers in the main loop #2314

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Reap workers in the main loop #2314

wants to merge 1 commit into from

Conversation

tilgovi
Copy link
Collaborator

@tilgovi tilgovi commented Apr 20, 2020

Handle SIGCHLD like every other signal, waking up the arbiter and handling the signal in the main loop rather than in the signal handler.

Take special care to reinstall the signal handler in case Python avoided doing so to prevent infinite recursion.

Clean up workers and call the worker_exit hook in only one place. When killing a worker, do not clean it up. The arbiter will now clean up the worker and invoke the hook when it reaps the worker.

With reaping handled in the main loop and kill_worker delegating responsibility for cleanup to the reaping loop, iterate over the workers dictionary everywhere else without concern for concurrent modification.

@tilgovi tilgovi force-pushed the safe-reaping branch 2 times, most recently from 0a2653c to 0b5d41b Compare April 20, 2020 23:14
@tilgovi tilgovi marked this pull request as draft April 20, 2020 23:26
@tilgovi
Copy link
Collaborator Author

tilgovi commented Apr 20, 2020

This is not safe as is due to this code that might cause Gunicorn to miss signals:

    def signal(self, sig, frame):
        if len(self.SIG_QUEUE) < 5:
            self.SIG_QUEUE.append(sig)
            self.wakeup()

@tilgovi
Copy link
Collaborator Author

tilgovi commented Apr 21, 2020

Ahh, I think I may have identified why we had SIGCHLD separately handled?

Is it because some systems exhibit a behavior where not calling os.waitpid in the signal execution context causes infinite recursion?

http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_knjige/pomocno/apue/APUE/0201433079/ch10lev1sec7.html

@tilgovi
Copy link
Collaborator Author

tilgovi commented Apr 21, 2020

Python actually checks for the situation I linked to, taking care not to re-install the signal handler for SIGCHLD automatically if it might cause infinite recursion.

https://github.com/python/cpython/blob/62183b8d6d49e59c6a98bbdaa65b7ea1415abb7f/Modules/signalmodule.c#L334

I've modified the first commit to address that and commented the code to indicate this case.

@tilgovi tilgovi marked this pull request as ready for review April 21, 2020 01:33
@tilgovi tilgovi marked this pull request as draft May 3, 2020 22:42
Copy link
Owner

@benoitc benoitc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some comment, but i'm still unsure to understand the need of this patch. Can you clarify what is it trying to solve? Do we have a way to reproduce the issue?

gunicorn/arbiter.py Show resolved Hide resolved
gunicorn/arbiter.py Show resolved Hide resolved
@tilgovi
Copy link
Collaborator Author

tilgovi commented Aug 27, 2020

Can you clarify what is it trying to solve? Do we have a way to reproduce the issue?

Just code cleanup. It avoids the need to call list() on the workers list when we loop over it because it removes the concurrent modification, by removing duplicate code.

@tilgovi
Copy link
Collaborator Author

tilgovi commented Sep 13, 2020

I reworked this with a separate indicator flag that the arbiter uses to mark that it needs to reap workers in the next main loop iteration. This change ensures that the arbiter will always reap workers even if the signal queue is full. I also added better explanation for why it is good to re-install the handler.

Please take another look.

@tilgovi tilgovi marked this pull request as ready for review September 13, 2020 20:26
@benoitc
Copy link
Owner

benoitc commented Sep 17, 2020

The use of use of list(self.WORKERS.items() was mainly there because .items() was not considered safe. When a worker die the list is updated concurrently.

Anyway the patch looks fine. I need to test it on BSDs systems, in particularly OpenBSD where I remember the special handling of the CHLD signal was added for.

Can it wait next release (let's make it next month?) ?

@tilgovi
Copy link
Collaborator Author

tilgovi commented Sep 19, 2020

The use of use of list(self.WORKERS.items() was mainly there because .items() was not considered safe. When a worker die the list is updated concurrently.

Not any more because that's what this patch changes!

Anyway the patch looks fine. I need to test it on BSDs systems, in particularly OpenBSD where I remember the special handling of the CHLD signal was added for.

I don't think any BSD has this problem. From what I read it's mos likely some versions of Solaris.1 In any case, setting the handler when it is already set is not a problem, so I'm not worried about regression here. The only possible change here would to fix behavior on a system where Gunicorn was previously broken.

Can it wait next release (let's make it next month?) ?

Absolutely.

@tilgovi tilgovi requested a review from benoitc March 25, 2023 01:18
gunicorn/arbiter.py Outdated Show resolved Hide resolved
@tilgovi tilgovi changed the title Safe Worker Reaping Reap workers in the main loop Dec 27, 2023
@tilgovi tilgovi force-pushed the safe-reaping branch 2 times, most recently from 4387671 to 7e279d2 Compare December 27, 2023 22:36
@tilgovi
Copy link
Collaborator Author

tilgovi commented Dec 27, 2023

@benoitc would you take another look here, please? I think it's a good idea to reap on the main thread. I think this approach is easier to understand, safer, reduces duplicate code, and eliminates concurrent modification of the workers list.

Handle SIGCHLD like every other signal, waking up the arbiter and
handling the signal in the main loop rather than in the signal handler.
Take special care to reinstall the signal handler since Python may not.

Clean up workers and call the worker_exit hook in only one place. When
killing a worker, do not clean it up. The arbiter will now clean up the
worker and invoke the hook when it reaps the worker.

Ensure that all workers have their temporary watchdog files closed and
that the arbiter does not exit or log about other child processes dying.

With reaping handled in the main loop and kill_worker delegating
responsibility for cleanup to the reaping loop, iterate over the workers
dictionary everywhere else without concern for concurrent modification.
if not worker:
continue

worker.tmp.close()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was best to ensure that we actually only do anything below here if this is actually our worker. But I can make this a separate PR if that's desired.

@tilgovi tilgovi added this to the 22.0 milestone Dec 29, 2023
SIG_NAMES = dict(
(getattr(signal, name), name[3:].lower()) for name in dir(signal)
if name[:3] == "SIG" and name[3] != "_"
if name[:3] == "SIG" and name[3] != "_" and name[3:] != "CLD"
Copy link

@piskvorky piskvorky Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition gave me a pause, not obvious.

Suggested change
if name[:3] == "SIG" and name[3] != "_" and name[3:] != "CLD"
if name[:3] == "SIG" and name[3] != "_" and name[3:] != "CLD" # SIGCLD is an obsolete name for SIGCHLD

At least according to https://man7.org/linux/man-pages/man7/signal.7.html

What was the motivation for singling it out here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I encountered an error without this (on Linux) but I'll double check.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sylt's solution in #3148 seems a bit cleaner.

@benoitc
Copy link
Owner

benoitc commented Feb 2, 2024

@tilgovi I will check this patch. But one of the reason we handled CHLD differently is that montoring is separate from handling signals targetting the process itself. Having all in in the same signal loop means that this signal wil be in the same queue.

@sylt
Copy link

sylt commented Feb 2, 2024

I wasn't aware of this pr until now, but FYI, I've posted #3148 which also aims to handle SIGCHLD on the main thread, although in a slightly different way.

@piskvorky
Copy link

piskvorky commented Feb 2, 2024

I've been running & testing #3148 (@sylt's PR), so far without problems. It seems more complete – perhaps the two PRs could be combined, gaining the benefit of both?

And then there's #2908 too of course.

@tilgovi
Copy link
Collaborator Author

tilgovi commented Feb 3, 2024

@tilgovi I will check this patch. But one of the reason we handled CHLD differently is that montoring is separate from handling signals targetting the process itself. Having all in in the same signal loop means that this signal wil be in the same queue.

We could put the SIGCHLD signal on the front of the queue, if what you want is just to handle it with higher priority, but I think it's a good idea to make the handler itself short and defer reaping to the main thread.

I also had a version of this with just a boolean flag for whether to reap on wakeup, if for some reason you really don't want SIGCHLD in the queue at all. But I don't think I understand in what way this signal is for "monitoring" or different from "signals targeting the process itself."

Can you explain anymore what concerns you?

@tilgovi
Copy link
Collaborator Author

tilgovi commented Feb 3, 2024

I see on the other PR that it seems like one hesitation you have is that you don't want to wait for children when exiting Gunicorn, but that already happens. While we reap children in the signal context, handling other signals is blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants