Better subprocess handling #272

antifuchs · 2013-03-03T05:18:31Z

This addresses #252 and another issue where the zeus slave would leave zombie runner processes around (one per command ever having been run, until a restart happened).

I'm pretty confident that the fix for #252 is what we want.

The parent commit of that (waiting for runner processes) should also be desirable, but the technique it uses trades off implementation simplicity for precision: While this no longer leaves around O(times a command was run) zombies around, it now leaves around 1 zombie per distinct command per slave (because it reaps only after a new command was read from the command pipe). I'm pretty sure this is still better than what was there before, but would love a suggestion for a better technique that preserves the simplicity of that command loop and ends up working better than this fix (-:

When running a command, the runner process correctly waits for termination of that command, but the slave also needs to wait for the runner process. This adds a set of child pids that get waitpid'd on (with WNOHANG) every time a command is read.

Previously, the runner would kill the client process at_exit - this introduced a race condition where the client would sometimes get the async TERM signal before it could read the exit status, and exit with a failure status. While the reasoning behind killing the client is sound (if the runner dies without the client noticing, the client could hang around forever), it should only happen if there's an exceptional condition.

Better subprocess handling

turadg · 2013-03-03T19:59:46Z

@antifuchs I just tested this locally (unfortunately after merging the request) and zeus start doesn't work for me anymore. It puts all commands into the [crash] state and says to see their backtraces, but each command runs fine without Zeus.

Does the current master branch work for you?

antifuchs · 2013-03-03T20:03:37Z

I just tested current master, and it works for me without any problems. Argh, I think I know what the real problem is. Try inserting

require 'set'

on top of your zeus.rb - I suspect your code base doesn't load that file early enough (and I forgot to add that require). Sorry sorry sorry.

lastobelus · 2013-03-13T07:50:53Z

It is worse for me after the change. Before, I would generally get in a day or two before zeus zombied a test runner and I would have to hard reset my mac before being able to run guard and/or rspec again.

Now it happens within a few minutes.

Is there any way to recover and be able to run rspec again without hard reseting my machine? So far everything I've tried ends up with zombies owned by launchd, which don't reap if I send launchd a HUP, and running rspec (or guard) by itself hangs and creates another zombie. The only way to recover is hard reset of machine, a normal restart hangs with a spinner on the gray shutdown screen, presumably because os x can't clean up the zombies either.

lastobelus · 2013-03-13T07:51:50Z

I haven't been able to associate it with any particular trigger or action of mine so far.

antifuchs · 2013-03-14T00:38:58Z

@lastobelus when you figure out how to (more or less reliably) trigger this, can you send me steps to reproduce this on an empty rails app? I don't see that behavior in my (non-rails) app, and don't know what to do to trigger it in a rails one. )-:

lastobelus · 2013-03-14T21:24:38Z

I THINK it has only happened after a git pull, i.e. it probably is related to changing several files in a short period of time.

lastobelus · 2013-04-30T09:42:50Z

Now I don't think it is associated with anything in particular, just happens randomly

antifuchs · 2013-05-04T07:19:43Z

@lastobelus I've seen this happen on our own project a couple of times too, and mostly in connection with many files changing, especially files changing while a reload is happening - most steps will get stuck in "connecting" state and then just go incommunicado until you kill Zeus.

I believe there might be some race condition here where zeus tries to restart a process it hasn't been able to properly set up yet. This might not be related to this pull request, but I'll debug it further this week.

lastobelus · 2013-05-06T22:47:30Z

when it happens to you, does it leave permanent zombies? I can't even do a machine restart -- os x hangs after logging out, on the gray screen, because it can't kill the zeus zombie and I have to do a hard reset.

antifuchs · 2013-05-06T22:51:41Z

@lastobelus - are you referring to zombies as processes that exited but haven't been reaped by their parent yet? If so, I can't really make sense of that problem description. )-:

What does pstree show you?

lastobelus · 2013-05-07T21:29:10Z

maybe zombie isn't the right word, I'm not much of a systems guy. This happens in conjunction with guard, and either zeus or both zeus & guard become unkillable and unreachable, so unkillable that os x hangs on a restart. Next time it happens I'll capture the output of pstree.

antifuchs · 2013-05-07T21:39:00Z

Thanks for that, I was really confused about which zombies you meant there for a while. (-:

One other thing that might be interesting would be to check what killing each of the "zeus" processes does - the zeus master process should exit cleanly when killed; I'm less sure about the "zeus slave" and "zeus client" processes though (they're ruby). Killing them with signal 9 should definitely clean them up, though. When this happens to you next time, can you try a pkill -9 -f zeus as well?

Also, I am currently testing a fix for #320, which probably will affect your experience re. any hung processes. I'll ping you when I submit it for review.

antifuchs added 2 commits March 2, 2013 21:07

turadg added a commit that referenced this pull request Mar 3, 2013

Merge pull request #272 from antifuchs/better-subprocess-handling

383b443

Better subprocess handling

turadg merged commit 383b443 into burke:master Mar 3, 2013

turadg mentioned this pull request Mar 4, 2013

zeus rake -T exits with 1 exit code #198

Closed

antifuchs mentioned this pull request May 4, 2013

Rapidly changing many files sometimes causes processes to hang #320

Closed

antifuchs mentioned this pull request May 7, 2013

Gather file changes within a short time window for restarts #323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better subprocess handling #272

Better subprocess handling #272

antifuchs commented Mar 3, 2013

turadg commented Mar 3, 2013

antifuchs commented Mar 3, 2013

lastobelus commented Mar 13, 2013

lastobelus commented Mar 13, 2013

antifuchs commented Mar 14, 2013

lastobelus commented Mar 14, 2013

lastobelus commented Apr 30, 2013

antifuchs commented May 4, 2013

lastobelus commented May 6, 2013

antifuchs commented May 6, 2013

lastobelus commented May 7, 2013

antifuchs commented May 7, 2013

Better subprocess handling #272

Better subprocess handling #272

Conversation

antifuchs commented Mar 3, 2013

turadg commented Mar 3, 2013

antifuchs commented Mar 3, 2013

lastobelus commented Mar 13, 2013

lastobelus commented Mar 13, 2013

antifuchs commented Mar 14, 2013

lastobelus commented Mar 14, 2013

lastobelus commented Apr 30, 2013

antifuchs commented May 4, 2013

lastobelus commented May 6, 2013

antifuchs commented May 6, 2013

lastobelus commented May 7, 2013

antifuchs commented May 7, 2013