Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

janitor leaks processes, destabilizing the node #5877

Closed
BenTheElder opened this issue Dec 8, 2017 · 18 comments · Fixed by #5883
Closed

janitor leaks processes, destabilizing the node #5877

BenTheElder opened this issue Dec 8, 2017 · 18 comments · Fixed by #5883
Assignees
Labels
area/boskos Issues or PRs related to code in /boskos priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯

Comments

@BenTheElder
Copy link
Member

BenTheElder commented Dec 8, 2017

see: #5700 (comment)
additionally a previous issue: #4892

I'm looking into this.
/assign
/priority critical-urgent
/area boskos

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/boskos Issues or PRs related to code in /boskos labels Dec 8, 2017
@BenTheElder
Copy link
Member Author

ps aux on the node gives a large number of entries like:
root 32762 0.0 0.0 0 0 ? Zs 14:08 0:01 [python2] <defunct>

which trace back to:

$ ps -p 32762 -o ppid
 PPID
31580
$ ps -p 31580
  PID TTY          TIME CMD
31580 ?        00:00:22 janitor

@BenTheElder
Copy link
Member Author

    ├─pod86ecb194-da3b-11e7-962d-42010a800055
    │ ├─6ad84509b179480cb2cfe4d06430e44193e0d05db1eca530a7026f6a2ffdca0e
    │ │ └─31580 /janitor --service-account=/etc/service-account/service-account.json --resource-type=gce-project,gke-project,gci-qa-project,gpu-project --pool-size=20
    │ └─bc0a8f92ba6ecb05b6ea13c52e3dcc5cf9b1b74bbf78a128f2bfaceba60fb05c
    │   └─31362 /pause

@BenTheElder
Copy link
Member Author

for context into how excessive this is:

$ ps aux | grep defunct | wc -l
20640

still looking for the root cause, we call gcloud a lot which I believe is also python so it's possible that gcloud has a bug instead...

@BenTheElder
Copy link
Member Author

This node is now tainted (should have done this earlier), we will need to un-taint it after debugging.

$ kubectl taint nodes gke-prow-pool-n1-highmem-8-81ce4395-83j6 key=value:NoSchedule
node "gke-prow-pool-n1-highmem-8-81ce4395-83j6" tainted

@BenTheElder
Copy link
Member Author

What's weirder is that the zombie processes all seem to have the boskos janitor binary as their parent, which only execs with .Run() and .CombinedOutput(...) as far as I can tell, which should both ultimately call Wait() so we shouldn't be seeing zombies. 😕

@krzyzacy
Copy link
Member

krzyzacy commented Dec 9, 2017

hummmm, xref kubernetes/kubernetes#54313 - probably because of the million gcloud foo calls in the actual janitor script

@krzyzacy
Copy link
Member

krzyzacy commented Dec 9, 2017

I probably can add some health check for each janitor pod

@BenTheElder
Copy link
Member Author

BenTheElder commented Dec 9, 2017

@krzyzacy we specifically need to avoid process exhaustion on the host, so we should make sure these child processes get cleaned up.
edit: in fact the health check is probably un-necessary the pod gets restarted, the problem is in the meantime this can hurt the node.

@krzyzacy
Copy link
Member

krzyzacy commented Dec 9, 2017

@BenTheElder the node is still around? my current idea is after each sync loop in the janitor, I can look for leaked gcloud processes and wipe them up

@BenTheElder
Copy link
Member Author

BenTheElder commented Dec 9, 2017

@krzyzacy the node itself is currently fine, but it caused docker to restart which broke jobs. I have the node quarantined for debugging since prow demand is low currently anyhow.

We can look for leaked processes and wipe them up pretty easily, but this is a bit of a bandaid. they shouldn't be leaking to begin with.

@BenTheElder
Copy link
Member Author

BenTheElder commented Dec 9, 2017

Also the rate they accumulate is pretty bad:

ps aux | wc -l
895
ps aux | grep defunct | wc -l
736

(this is after the pod has been restarted this evening)

@krzyzacy
Copy link
Member

krzyzacy commented Dec 9, 2017

/shrug
obviously ssh doesn't work here :-(

@k8s-ci-robot k8s-ci-robot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Dec 9, 2017
@krzyzacy
Copy link
Member

krzyzacy commented Dec 9, 2017

hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.

@BenTheElder
Copy link
Member Author

hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.

Yeah, I'm starting to think gcloud might be leaking processes internally, I'm starting to look at building the "bandaid" in the go janitor :(

@BenTheElder
Copy link
Member Author

There is actually a small library specifically for this use case: https://github.com/ramr/go-reaper

@krzyzacy
Copy link
Member

krzyzacy commented Dec 9, 2017

maybe we can also recycle the janitor pods (like set a max-lifetime for like 1h for each instance), if that's easier

@BenTheElder
Copy link
Member Author

"re-fixed" in #5885

@BenTheElder
Copy link
Member Author

Additionally this is not particularly new, but for reference the complete diagnosis and solution are:

  • zombie processes pile up, filling the PID space and we cause docker ps to fail so the kubelet health check fails and decides to restart docker, which interrupts our jobs on the node.
  • we see lots of python2 <defunct> piling up on the machine, and suspect the janitor since it appears to be the ppid for these zombie processes, and we know it spawns some janitor.py subprocesses with python 2

BUT:

  • the janitor go binary was previously ENTRYPOINT in our image, making it PID1 inside the container, where a normal linux (not in docker) would have an init daemon instead
  • when the janitor spawns some janitor.py it seems to wait on them properly
  • these in turn spawns some gcloud processes which are waited on properly (as far as I can tell)
  • Fun fact: gcloud / the google cloud sdk runs as python2 as well
  • in the case of that a parent process was aborted when a process exits it is an orphaned child which the kernel reparents to PID1
  • when any orphaned process exits it still exists in the kernel's process table because no parent ever wait(...)ed it or set SIGCHILD to SIG_IGN, so they fill up on the machine, and don't have much information to diagnose this because the kernel isn't required to keep everything for them, just the pid -> exit status.
  • all this means that if gcloud spawns a subprocess and then aborts the subprocess is re-parented to the janitor binary and we see aborted python2 commands parented to janitor causing our original suspicion
  • we waste a little time searching for a bug in the janitor, not immediately realizing these could be gcloud, because I don't remember that gcloud is also python2.x

This is the "PID1 problem" for docker.

You can solve this by making sure your ENTRYPOINT is a suitable init daemon of varying levels, in our case we just need to make sure orphaned processes are reaped (wait(...)ed) so a shell is sufficient.
We do have to be careful though, to make sure the shell doesn't outsmart us and directly exec the subprocess and replace the shell (bash will do this if you bash -c 'cmd').
On Alpine's /bin/sh this doesn't seem to be an issue, but for good measure I tossed in /bin/sh -c 'echo started && /bin/janitor "$@" to make sure it can't just exec the command.

For other use cases a real init process of varying complexity might be desired to cleanly exit children (see the PID1 problem link), but we're already otherwise resilient to failed subprocess calls so we really just want these dead process entries cleaned up. #5885 does this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/boskos Issues or PRs related to code in /boskos priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants