-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better subprocess handling #272
Conversation
When running a command, the runner process correctly waits for termination of that command, but the slave also needs to wait for the runner process. This adds a set of child pids that get waitpid'd on (with WNOHANG) every time a command is read.
Previously, the runner would kill the client process at_exit - this introduced a race condition where the client would sometimes get the async TERM signal before it could read the exit status, and exit with a failure status. While the reasoning behind killing the client is sound (if the runner dies without the client noticing, the client could hang around forever), it should only happen if there's an exceptional condition.
@antifuchs I just tested this locally (unfortunately after merging the request) and Does the current master branch work for you? |
I just tested current master, and it works for me without any problems. Argh, I think I know what the real problem is. Try inserting
on top of your zeus.rb - I suspect your code base doesn't load that file early enough (and I forgot to add that require). Sorry sorry sorry. |
It is worse for me after the change. Before, I would generally get in a day or two before zeus zombied a test runner and I would have to hard reset my mac before being able to run guard and/or rspec again. Now it happens within a few minutes. Is there any way to recover and be able to run rspec again without hard reseting my machine? So far everything I've tried ends up with zombies owned by launchd, which don't reap if I send launchd a HUP, and running rspec (or guard) by itself hangs and creates another zombie. The only way to recover is hard reset of machine, a normal restart hangs with a spinner on the gray shutdown screen, presumably because os x can't clean up the zombies either. |
I haven't been able to associate it with any particular trigger or action of mine so far. |
@lastobelus when you figure out how to (more or less reliably) trigger this, can you send me steps to reproduce this on an empty rails app? I don't see that behavior in my (non-rails) app, and don't know what to do to trigger it in a rails one. )-: |
I THINK it has only happened after a git pull, i.e. it probably is related to changing several files in a short period of time. |
Now I don't think it is associated with anything in particular, just happens randomly |
@lastobelus I've seen this happen on our own project a couple of times too, and mostly in connection with many files changing, especially files changing while a reload is happening - most steps will get stuck in "connecting" state and then just go incommunicado until you kill Zeus. I believe there might be some race condition here where zeus tries to restart a process it hasn't been able to properly set up yet. This might not be related to this pull request, but I'll debug it further this week. |
when it happens to you, does it leave permanent zombies? I can't even do a machine restart -- os x hangs after logging out, on the gray screen, because it can't kill the zeus zombie and I have to do a hard reset. |
@lastobelus - are you referring to zombies as processes that exited but haven't been reaped by their parent yet? If so, I can't really make sense of that problem description. )-: What does pstree show you? |
maybe zombie isn't the right word, I'm not much of a systems guy. This happens in conjunction with guard, and either zeus or both zeus & guard become unkillable and unreachable, so unkillable that os x hangs on a restart. Next time it happens I'll capture the output of pstree. |
Thanks for that, I was really confused about which zombies you meant there for a while. (-: One other thing that might be interesting would be to check what killing each of the "zeus" processes does - the zeus master process should exit cleanly when killed; I'm less sure about the "zeus slave" and "zeus client" processes though (they're ruby). Killing them with signal 9 should definitely clean them up, though. When this happens to you next time, can you try a Also, I am currently testing a fix for #320, which probably will affect your experience re. any hung processes. I'll ping you when I submit it for review. |
This addresses #252 and another issue where the zeus slave would leave zombie runner processes around (one per command ever having been run, until a restart happened).
I'm pretty confident that the fix for #252 is what we want.
The parent commit of that (waiting for runner processes) should also be desirable, but the technique it uses trades off implementation simplicity for precision: While this no longer leaves around O(times a command was run) zombies around, it now leaves around 1 zombie per distinct command per slave (because it reaps only after a new command was read from the command pipe). I'm pretty sure this is still better than what was there before, but would love a suggestion for a better technique that preserves the simplicity of that command loop and ends up working better than this fix (-: