Improved Worker Reliability #83

danhunsaker · 2013-01-24T21:46:51Z

Specifically, this improves the multi-worker forking process by having the
parent process stick around and monitor the children. If a child process
dies, and the supervisor has not been instructed to shut them all down, it
will spawn a new child to replace it. It will pass signals down to all
children. Currently, it only passes the signals explicitly watched by the
Worker class. Children will be told that they have a parent process, and
check periodically whether the supervisor is still running. This check is
performed at the beginning of reserve() to ensure orphaned workers handle
their condition properly. That is, if the supervisor is gone, the child will
exit. This is intended to allow users to kill the supervisor and
realistically expect that no new jobs will be processed until a new
supervisor (or single-worker instance) is started.

The other change, not related to worker-supervisor relationships,
nevertheless still involves forking. Specifically, when a job process exits,
the exit code is already being handled if it is greater than zero. However,
if it is zero, the code prior to this commit assumed the exit was performed
by PHP Resque itself. Usually, this is correct. The trouble comes when jobs
exit without specifying an exit status - PHP defaults these to 0, for
whatever reason. In these cases, the job status is not properly updated.
Generally speaking, if you're writing jobs yourself - in their entirety -
they will responsibly not use exit() at any point. But if your jobs use a
framework, the chances of an unexpected exit() causing issues go up immensely.
So, if the exit code is zero, and the job status hasn't been updated from
queued or running, this commit's code causes the status to be updated to
completed. This assumes that such jobs are properly written to only use exit
status 0 for success, so some users might wish to change it to failed, but
that's not best practice in the slightest, so I wouldn't encourage that.

Signed-off-by: Daniel Hunsaker danhunsaker@gmail.com

tomschlick · 2013-02-25T17:49:53Z

Any word on the possibility of this getting merged? Seems like a very good patch and could help solve the painful issue of restarting many workers when you deploy...

ShonM · 2013-02-25T17:59:25Z

The first part seems pretty close to https://github.com/ebernhardson/php-resque-pool

Rebooting all workers would be as simple as kill -HUP ps -A|awk '/[r]esque-pool/{print $1}'``

danhunsaker · 2013-02-27T21:30:08Z

@ShonM: It is certainly similar, but there are some differences, some of which are significant. For one, php-resque-pool acts as a wrapper process around the PHP Resque library, while this pull request integrates this kind of functionality into the PHP Resque library itself. This is significant because PHP Resque already did most of this anyway - the changes I've introduced here simply complete this functionality.

This in no way affects the usefulness of php-resque-pool, however, because it still does a couple of things that PHP Resque still doesn't natively handle. Specifically, the ability to use a YAML document to configure a worker pool, and the ability to specify multiple queue groups in the same pool (for example, 1 worker on one queue, another 1 on a second queue, and 4 on every queue - this is three separate queue groups in a single worker pool). Each pool created using only the native support can only operate on one queue group at a time, and the configuration can only be set via the environment.

I could also comment on how variable the definition of "as simple as" tends to be, but that would just be trolling, so I'll refrain from further comment on that. :)

Ultimately, though, the decision of whether to include this patch will probably come down to whether PHP Resque is intended to operate identically to its Ruby counterpart, or if it is seen as independent of that project and should stand on its own. (And probably also on whether or not the Ruby version handles multiple workers natively...) That's a factor I can't speak to - only @chrisboulton can make that call.

jbrower · 2013-03-04T18:02:24Z

I'd like to see this get merged in. While php-resque-pool works, it's a distinct project that may not be as well supported as this one.

danhunsaker · 2013-03-07T03:44:44Z

Had a bit of a scare earlier today when updating our internal version of PHP-Resque to the latest while still preserving the modifications here. (We also have reasons to forbid the use of Composer, so I had to add the require_once() logic manually, but that part wasn't too difficult...) It seems one of our working branches got corrupted with a prior attempt at what eventually became this branch, and it was still firing and responding to SIGALRM.

This is a problem because PHP doesn't resume execution at the point where the alarm signal was received. At least, it doesn't do so properly. So the child forks created to execute jobs in a safe space ended up never terminating - or, indeed, completing their assigned tasks - and very quickly things were so broken they were useless. Took a while to track down the reversion, but once removed, things resumed their former solid operation.

Just to be safe, I checked that the same reversion hadn't made its way into this PR. Luckily, it hadn't, but I had forgotten to remove an output coloration artifact. I have removed that now, and updated the PR to merge in the latest content of master. Hopefully, this will make approving the merge possible without any conflicts, as it shouldn't be more than a fast-forward merge. 😄

danhunsaker · 2013-05-12T19:25:18Z

Updated again with latest changes upstream. Had to resolve a merge conflict involving the PREFIX env var addition, so I figured I'd push the merge commit to keep the pull simple, should this PR get merged.

danhunsaker · 2013-08-14T18:42:08Z

Following up again so this doesn't get lost completely in the shuffle of issues and PRs. This code has been in use, in production, for months now - at least as long as the PR has existed, and IIRC, somewhat longer - with no negative effects. So, yeah, just curious what the verdict on this kind of functionality will be.

Dynom · 2013-08-30T07:12:04Z

I'd like to see this in too, what do you think @chrisboulton ?

👍

Aeon · 2013-10-15T21:39:57Z

👍 please consider merging this...

damncabbage · 2013-10-25T15:47:58Z

A side-benefit of getting this merged would be the possibility of being able to run this via Foreman.

(Foreman, like Monit, thinks the main process has gone away and bails immediately on a foreman start.)

mjsilva · 2014-04-07T02:43:04Z

Hey guys I just stumble into this problem when I tried to supervise php resque consumers with threads (COUNT=>1).

Is there any plans to merge this? Or any interim solution?

Thanks!

parkan · 2015-01-03T22:36:45Z

+1 on this, trying to use php-resque on heroku and experiencing the same problem

parkan · 2015-01-03T23:30:16Z

Hmm I am seeing

PHP Parse error: syntax error, unexpected '{' in [..]/php-resque/bin/resque on line 127

bin/ doesn't seem to be tested, so the ci/unit tests pass even though this fails

problem is at HERE--> file_put_contents($PIDFILE, getmypid()) or HERE-->{...

I don't think "or {..." is valid syntax in any recent php version.

danhunsaker · 2015-01-04T06:22:07Z

Hrm. Not sure what I was thinking, there... Must have recently been working with JavaScript, I suppose.

As to bin/resque not being tested, it was always intended to be an example script upon which to build your own, so testing it does feel a little ... superfluous. Of course, it's almost never been treated as an example by users of the project, so we submit patches to it that make it more robust for most uses, and reduce its example status ever further with each one. So it might be a good idea to have tests for it as well, though it would likely need to be reworked some to be made more (or even actually) testable.

robertheydenrych · 2015-07-08T13:51:29Z

Any update on this? Would really love to see this merged!

dkcwd · 2015-07-23T04:03:40Z

It would really be great to see this merged.
👍

sleenen · 2015-09-19T11:45:27Z

@chrisboulton I would love this merged too. Having the ability to correctly monitor multiple spawned workers is just what I need. Can you please have a look?

pierswarmers · 2016-02-12T00:16:16Z

+1 this would be awesome to merge.

Specifically, this improves the multi-worker forking process by having the parent process stick around and monitor the children. If a child process dies, and the supervisor has not been instructed to shut them all down, it will spawn a new child to replace it. It will pass signals down to all children. Currently, it only passes the signals explicitly watched by the Worker class. Children will be told that they have a parent process, and check periodically whether the supervisor is still running. This check is performed at the beginning of reserve() to ensure orphaned workers handle their condition properly. That is, if the supervisor is gone, the child will exit. This is intended to allow users to kill the supervisor and realistically expect that no new jobs will be processed until a new supervisor (or single-worker instance) is started. The other change, not related to worker-supervisor relationships, nevertheless still involves forking. Specifically, when a job process exits, the exit code is already being handled if it is greater than zero. However, if it is zero, the code prior to this commit assumed the exit was performed by PHP Resque itself. Usually, this is correct. The trouble comes when jobs exit without specifying an exit status - PHP defaults these to 0, for whatever reason. In these cases, the job status is not properly updated. Generally speaking, if you're writing jobs yourself - in their entirety - they will responsibly *not* use exit() at any point. But if your jobs use a framework, the chances of an unexpected exit() causing issues go up immensely. So, if the exit code is zero, and the job status hasn't been updated from queued or running, this commit's code causes the status to be updated to completed. This assumes that such jobs are properly written to only use exit status 0 for success, so some users might wish to change it to failed, but that's not best practice in the slightest, so I wouldn't encourage that. Signed-off-by: Daniel Hunsaker <danhunsaker@gmail.com>

Signed-off-by: Daniel Hunsaker <danhunsaker@gmail.com>

- Use `if` instead of logic operators - Check for `false` explicitly

danhunsaker · 2018-12-11T03:27:18Z

Applied in the new repo, resque/php-resque.

danhunsaker mentioned this pull request Jan 24, 2013

Process exited with status 0 , running COUNT > 1 on heroku worker process #55

Open

jbrower mentioned this pull request Jan 28, 2013

init.d script #15

Closed

danhunsaker mentioned this pull request Oct 14, 2013

Running multiple workers with monit #138

Closed

danhunsaker mentioned this pull request Nov 25, 2013

Potentially unexpected behavior in tearDown with exit #148

Open

This was referenced May 14, 2014

Write child worker PID to PIDFILE for multiple workers #187

Open

Worker pcntl_wait to non-blocking loop #189

Open

parkan mentioned this pull request Jan 3, 2015

Improved forking parkan/php-resque#1

Merged

danhunsaker force-pushed the improved-forking branch from 789364f to 0a374d8 Compare October 14, 2016 23:21

danhunsaker and others added 4 commits October 14, 2016 19:25

Removed output coloration artifact from custom use case implementation.

24cd185

Signed-off-by: Daniel Hunsaker <danhunsaker@gmail.com>

Fix {/( block syntax error

7b28567

Cleaner PIDFILE success check

8968e8a

- Use `if` instead of logic operators - Check for `false` explicitly

danhunsaker force-pushed the improved-forking branch from 0a374d8 to 8968e8a Compare October 15, 2016 01:35

jradtilbrook mentioned this pull request Oct 23, 2017

Danhunsaker improved forking #356

Closed

danhunsaker closed this Dec 2, 2018

danhunsaker deleted the improved-forking branch December 2, 2018 01:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Worker Reliability #83

Improved Worker Reliability #83

danhunsaker commented Jan 24, 2013

tomschlick commented Feb 25, 2013

ShonM commented Feb 25, 2013

danhunsaker commented Feb 27, 2013

jbrower commented Mar 4, 2013

danhunsaker commented Mar 7, 2013

danhunsaker commented May 12, 2013

danhunsaker commented Aug 14, 2013

Dynom commented Aug 30, 2013

Aeon commented Oct 15, 2013

damncabbage commented Oct 25, 2013

mjsilva commented Apr 7, 2014

parkan commented Jan 3, 2015

parkan commented Jan 3, 2015

danhunsaker commented Jan 4, 2015

robertheydenrych commented Jul 8, 2015

dkcwd commented Jul 23, 2015

sleenen commented Sep 19, 2015

pierswarmers commented Feb 12, 2016

danhunsaker commented Dec 11, 2018

Improved Worker Reliability #83

Improved Worker Reliability #83

Conversation

danhunsaker commented Jan 24, 2013

tomschlick commented Feb 25, 2013

ShonM commented Feb 25, 2013

danhunsaker commented Feb 27, 2013

jbrower commented Mar 4, 2013

danhunsaker commented Mar 7, 2013

danhunsaker commented May 12, 2013

danhunsaker commented Aug 14, 2013

Dynom commented Aug 30, 2013

Aeon commented Oct 15, 2013

damncabbage commented Oct 25, 2013

mjsilva commented Apr 7, 2014

parkan commented Jan 3, 2015

parkan commented Jan 3, 2015

danhunsaker commented Jan 4, 2015

robertheydenrych commented Jul 8, 2015

dkcwd commented Jul 23, 2015

sleenen commented Sep 19, 2015

pierswarmers commented Feb 12, 2016

danhunsaker commented Dec 11, 2018