Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better subprocess handling #272

Merged
merged 2 commits into from
Mar 3, 2013

Conversation

antifuchs
Copy link
Collaborator

This addresses #252 and another issue where the zeus slave would leave zombie runner processes around (one per command ever having been run, until a restart happened).

I'm pretty confident that the fix for #252 is what we want.

The parent commit of that (waiting for runner processes) should also be desirable, but the technique it uses trades off implementation simplicity for precision: While this no longer leaves around O(times a command was run) zombies around, it now leaves around 1 zombie per distinct command per slave (because it reaps only after a new command was read from the command pipe). I'm pretty sure this is still better than what was there before, but would love a suggestion for a better technique that preserves the simplicity of that command loop and ends up working better than this fix (-:

When running a command, the runner process correctly waits for
termination of that command, but the slave also needs to wait for the
runner process. This adds a set of child pids that get waitpid'd
on (with WNOHANG) every time a command is read.
Previously, the runner would kill the client process at_exit - this
introduced a race condition where the client would sometimes get the
async TERM signal before it could read the exit status, and exit with
a failure status.

While the reasoning behind killing the client is sound (if the runner
dies without the client noticing, the client could hang around
forever), it should only happen if there's an exceptional condition.
turadg added a commit that referenced this pull request Mar 3, 2013
@turadg turadg merged commit 383b443 into burke:master Mar 3, 2013
@turadg
Copy link
Collaborator

turadg commented Mar 3, 2013

@antifuchs I just tested this locally (unfortunately after merging the request) and zeus start doesn't work for me anymore. It puts all commands into the [crash] state and says to see their backtraces, but each command runs fine without Zeus.

Does the current master branch work for you?

@antifuchs
Copy link
Collaborator Author

I just tested current master, and it works for me without any problems. Argh, I think I know what the real problem is. Try inserting

require 'set'

on top of your zeus.rb - I suspect your code base doesn't load that file early enough (and I forgot to add that require). Sorry sorry sorry.

@lastobelus
Copy link
Contributor

It is worse for me after the change. Before, I would generally get in a day or two before zeus zombied a test runner and I would have to hard reset my mac before being able to run guard and/or rspec again.

Now it happens within a few minutes.

Is there any way to recover and be able to run rspec again without hard reseting my machine? So far everything I've tried ends up with zombies owned by launchd, which don't reap if I send launchd a HUP, and running rspec (or guard) by itself hangs and creates another zombie. The only way to recover is hard reset of machine, a normal restart hangs with a spinner on the gray shutdown screen, presumably because os x can't clean up the zombies either.

@lastobelus
Copy link
Contributor

I haven't been able to associate it with any particular trigger or action of mine so far.

@antifuchs
Copy link
Collaborator Author

@lastobelus when you figure out how to (more or less reliably) trigger this, can you send me steps to reproduce this on an empty rails app? I don't see that behavior in my (non-rails) app, and don't know what to do to trigger it in a rails one. )-:

@lastobelus
Copy link
Contributor

I THINK it has only happened after a git pull, i.e. it probably is related to changing several files in a short period of time.

@lastobelus
Copy link
Contributor

Now I don't think it is associated with anything in particular, just happens randomly

@antifuchs
Copy link
Collaborator Author

@lastobelus I've seen this happen on our own project a couple of times too, and mostly in connection with many files changing, especially files changing while a reload is happening - most steps will get stuck in "connecting" state and then just go incommunicado until you kill Zeus.

I believe there might be some race condition here where zeus tries to restart a process it hasn't been able to properly set up yet. This might not be related to this pull request, but I'll debug it further this week.

@lastobelus
Copy link
Contributor

when it happens to you, does it leave permanent zombies? I can't even do a machine restart -- os x hangs after logging out, on the gray screen, because it can't kill the zeus zombie and I have to do a hard reset.

@antifuchs
Copy link
Collaborator Author

@lastobelus - are you referring to zombies as processes that exited but haven't been reaped by their parent yet? If so, I can't really make sense of that problem description. )-:

What does pstree show you?

@lastobelus
Copy link
Contributor

maybe zombie isn't the right word, I'm not much of a systems guy. This happens in conjunction with guard, and either zeus or both zeus & guard become unkillable and unreachable, so unkillable that os x hangs on a restart. Next time it happens I'll capture the output of pstree.

@antifuchs
Copy link
Collaborator Author

Thanks for that, I was really confused about which zombies you meant there for a while. (-:

One other thing that might be interesting would be to check what killing each of the "zeus" processes does - the zeus master process should exit cleanly when killed; I'm less sure about the "zeus slave" and "zeus client" processes though (they're ruby). Killing them with signal 9 should definitely clean them up, though. When this happens to you next time, can you try a pkill -9 -f zeus as well?

Also, I am currently testing a fix for #320, which probably will affect your experience re. any hung processes. I'll ping you when I submit it for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants