Skip to content

Commit

Permalink
Reap runner and sub-slave processes in slaves.
Browse files Browse the repository at this point in the history
When running a command, the runner process correctly waits for
termination of that command, but the slave also needs to wait for the
runner process. This adds a set of child pids that get waitpid'd
on (with WNOHANG) every time a command is read.
  • Loading branch information
antifuchs committed Mar 3, 2013
1 parent 7a0de42 commit fed0652
Showing 1 changed file with 13 additions and 2 deletions.
15 changes: 13 additions & 2 deletions rubygem/lib/zeus.rb
Original file line number Diff line number Diff line change
Expand Up @@ -58,15 +58,26 @@ def go(identifier=:boot)
Thread.new { notify_features(feature_pipe_w, features) }

# We are now 'connected'. From this point, we may receive requests to fork.
children = Set.new
loop do
messages = local.recv(2**16)

# Reap any child runners or slaves that might have exited in
# the meantime. Note that reaping them like this can leave <=1
# zombie process per slave around while the slave waits for a
# new command.
children.each do |pid|
children.delete(pid) if Process.waitpid(pid, Process::WNOHANG)
end

messages.split("\0").each do |new_identifier|
new_identifier =~ /^(.):(.*)/
code, ident = $1, $2
pid = nil
if code == "S"
fork { go(ident.to_sym) }
children << fork { go(ident.to_sym) }
else
fork { command(ident.to_sym, local) }
children << fork { command(ident.to_sym, local) }
end
end
end
Expand Down

2 comments on commit fed0652

@metcalf
Copy link
Collaborator

@metcalf metcalf commented on fed0652 Aug 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antifuchs, I was digging around Zeus a bit more and got confused by this change. Why do we want the runner waiting on its children to exit? The comment also mentions "reaping" but I don't see anything getting killed or otherwise cleaned up.

@antifuchs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hazy memory of this, but on shared machines, we'd run out of PIDs with long-running zeuses that left zombie processes around. This is because the various processes get killed by the go portion of the code as reloads happen, but the ruby processes never retrieved their children's corpses (via the nohang-ed waitpid there). Those stick around and continue to consume one process table entry per child, until you kill the entire zeus process tree (or the boot entry gets restarted, which ~never happens, AIUI).

Please sign in to comment.