Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Worker Reliability #83

Closed
wants to merge 4 commits into from

Conversation

danhunsaker
Copy link
Contributor

Specifically, this improves the multi-worker forking process by having the
parent process stick around and monitor the children. If a child process
dies, and the supervisor has not been instructed to shut them all down, it
will spawn a new child to replace it. It will pass signals down to all
children. Currently, it only passes the signals explicitly watched by the
Worker class. Children will be told that they have a parent process, and
check periodically whether the supervisor is still running. This check is
performed at the beginning of reserve() to ensure orphaned workers handle
their condition properly. That is, if the supervisor is gone, the child will
exit. This is intended to allow users to kill the supervisor and
realistically expect that no new jobs will be processed until a new
supervisor (or single-worker instance) is started.

The other change, not related to worker-supervisor relationships,
nevertheless still involves forking. Specifically, when a job process exits,
the exit code is already being handled if it is greater than zero. However,
if it is zero, the code prior to this commit assumed the exit was performed
by PHP Resque itself. Usually, this is correct. The trouble comes when jobs
exit without specifying an exit status - PHP defaults these to 0, for
whatever reason. In these cases, the job status is not properly updated.
Generally speaking, if you're writing jobs yourself - in their entirety -
they will responsibly not use exit() at any point. But if your jobs use a
framework, the chances of an unexpected exit() causing issues go up immensely.
So, if the exit code is zero, and the job status hasn't been updated from
queued or running, this commit's code causes the status to be updated to
completed. This assumes that such jobs are properly written to only use exit
status 0 for success, so some users might wish to change it to failed, but
that's not best practice in the slightest, so I wouldn't encourage that.

Signed-off-by: Daniel Hunsaker danhunsaker@gmail.com

@tomschlick
Copy link

Any word on the possibility of this getting merged? Seems like a very good patch and could help solve the painful issue of restarting many workers when you deploy...

@ShonM
Copy link

ShonM commented Feb 25, 2013

The first part seems pretty close to https://github.com/ebernhardson/php-resque-pool

Rebooting all workers would be as simple as kill -HUP ps -A|awk '/[r]esque-pool/{print $1}'``

@danhunsaker
Copy link
Contributor Author

@ShonM: It is certainly similar, but there are some differences, some of which are significant. For one, php-resque-pool acts as a wrapper process around the PHP Resque library, while this pull request integrates this kind of functionality into the PHP Resque library itself. This is significant because PHP Resque already did most of this anyway - the changes I've introduced here simply complete this functionality.

This in no way affects the usefulness of php-resque-pool, however, because it still does a couple of things that PHP Resque still doesn't natively handle. Specifically, the ability to use a YAML document to configure a worker pool, and the ability to specify multiple queue groups in the same pool (for example, 1 worker on one queue, another 1 on a second queue, and 4 on every queue - this is three separate queue groups in a single worker pool). Each pool created using only the native support can only operate on one queue group at a time, and the configuration can only be set via the environment.

I could also comment on how variable the definition of "as simple as" tends to be, but that would just be trolling, so I'll refrain from further comment on that. :)

Ultimately, though, the decision of whether to include this patch will probably come down to whether PHP Resque is intended to operate identically to its Ruby counterpart, or if it is seen as independent of that project and should stand on its own. (And probably also on whether or not the Ruby version handles multiple workers natively...) That's a factor I can't speak to - only @chrisboulton can make that call.

@jbrower
Copy link

jbrower commented Mar 4, 2013

I'd like to see this get merged in. While php-resque-pool works, it's a distinct project that may not be as well supported as this one.

@danhunsaker
Copy link
Contributor Author

Had a bit of a scare earlier today when updating our internal version of PHP-Resque to the latest while still preserving the modifications here. (We also have reasons to forbid the use of Composer, so I had to add the require_once() logic manually, but that part wasn't too difficult...) It seems one of our working branches got corrupted with a prior attempt at what eventually became this branch, and it was still firing and responding to SIGALRM.

This is a problem because PHP doesn't resume execution at the point where the alarm signal was received. At least, it doesn't do so properly. So the child forks created to execute jobs in a safe space ended up never terminating - or, indeed, completing their assigned tasks - and very quickly things were so broken they were useless. Took a while to track down the reversion, but once removed, things resumed their former solid operation.

Just to be safe, I checked that the same reversion hadn't made its way into this PR. Luckily, it hadn't, but I had forgotten to remove an output coloration artifact. I have removed that now, and updated the PR to merge in the latest content of master. Hopefully, this will make approving the merge possible without any conflicts, as it shouldn't be more than a fast-forward merge. 😄

@danhunsaker
Copy link
Contributor Author

Updated again with latest changes upstream. Had to resolve a merge conflict involving the PREFIX env var addition, so I figured I'd push the merge commit to keep the pull simple, should this PR get merged.

@danhunsaker
Copy link
Contributor Author

Following up again so this doesn't get lost completely in the shuffle of issues and PRs. This code has been in use, in production, for months now - at least as long as the PR has existed, and IIRC, somewhat longer - with no negative effects. So, yeah, just curious what the verdict on this kind of functionality will be.

@Dynom
Copy link

Dynom commented Aug 30, 2013

I'd like to see this in too, what do you think @chrisboulton ?

👍

@Aeon
Copy link

Aeon commented Oct 15, 2013

👍 please consider merging this...

@damncabbage
Copy link

A side-benefit of getting this merged would be the possibility of being able to run this via Foreman.

(Foreman, like Monit, thinks the main process has gone away and bails immediately on a foreman start.)

@mjsilva
Copy link

mjsilva commented Apr 7, 2014

Hey guys I just stumble into this problem when I tried to supervise php resque consumers with threads (COUNT=>1).

Is there any plans to merge this? Or any interim solution?

Thanks!

@parkan
Copy link

parkan commented Jan 3, 2015

+1 on this, trying to use php-resque on heroku and experiencing the same problem

@parkan
Copy link

parkan commented Jan 3, 2015

Hmm I am seeing

PHP Parse error: syntax error, unexpected '{' in [..]/php-resque/bin/resque on line 127

bin/ doesn't seem to be tested, so the ci/unit tests pass even though this fails

problem is at HERE--> file_put_contents($PIDFILE, getmypid()) or HERE-->{...

I don't think "or {..." is valid syntax in any recent php version.

@danhunsaker
Copy link
Contributor Author

Hrm. Not sure what I was thinking, there... Must have recently been working with JavaScript, I suppose.

As to bin/resque not being tested, it was always intended to be an example script upon which to build your own, so testing it does feel a little ... superfluous. Of course, it's almost never been treated as an example by users of the project, so we submit patches to it that make it more robust for most uses, and reduce its example status ever further with each one. So it might be a good idea to have tests for it as well, though it would likely need to be reworked some to be made more (or even actually) testable.

@robertheydenrych
Copy link

Any update on this? Would really love to see this merged!

@dkcwd
Copy link

dkcwd commented Jul 23, 2015

It would really be great to see this merged.
👍

@sleenen
Copy link

sleenen commented Sep 19, 2015

@chrisboulton I would love this merged too. Having the ability to correctly monitor multiple spawned workers is just what I need. Can you please have a look?

@pierswarmers
Copy link

+1 this would be awesome to merge.

danhunsaker and others added 4 commits October 14, 2016 19:25
Specifically, this improves the multi-worker forking process by having the
parent process stick around and monitor the children.  If a child process
dies, and the supervisor has not been instructed to shut them all down, it
will spawn a new child to replace it.  It will pass signals down to all
children.  Currently, it only passes the signals explicitly watched by the
Worker class.  Children will be told that they have a parent process, and
check periodically whether the supervisor is still running.  This check is
performed at the beginning of reserve() to ensure orphaned workers handle
their condition properly.  That is, if the supervisor is gone, the child will
exit.  This is intended to allow users to kill the supervisor and
realistically expect that no new jobs will be processed until a new
supervisor (or single-worker instance) is started.

The other change, not related to worker-supervisor relationships,
nevertheless still involves forking.  Specifically, when a job process exits,
the exit code is already being handled if it is greater than zero.  However,
if it is zero, the code prior to this commit assumed the exit was performed
by PHP Resque itself.  Usually, this is correct.  The trouble comes when jobs
exit without specifying an exit status - PHP defaults these to 0, for
whatever reason.  In these cases, the job status is not properly updated.
Generally speaking, if you're writing jobs yourself - in their entirety -
they will responsibly *not* use exit() at any point.  But if your jobs use a
framework, the chances of an unexpected exit() causing issues go up immensely.
So, if the exit code is zero, and the job status hasn't been updated from
queued or running, this commit's code causes the status to be updated to
completed.  This assumes that such jobs are properly written to only use exit
status 0 for success, so some users might wish to change it to failed, but
that's not best practice in the slightest, so I wouldn't encourage that.

Signed-off-by: Daniel Hunsaker <danhunsaker@gmail.com>
Signed-off-by: Daniel Hunsaker <danhunsaker@gmail.com>
- Use `if` instead of logic operators
- Check for `false` explicitly
@danhunsaker danhunsaker closed this Dec 2, 2018
@danhunsaker danhunsaker deleted the improved-forking branch December 2, 2018 01:09
@danhunsaker
Copy link
Contributor Author

Applied in the new repo, resque/php-resque.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.