janitor leaks processes, destabilizing the node #5877

BenTheElder · 2017-12-08T22:58:55Z

see: #5700 (comment)
additionally a previous issue: #4892

I'm looking into this.
/assign
/priority critical-urgent
/area boskos

BenTheElder · 2017-12-08T23:02:38Z

ps aux on the node gives a large number of entries like:
root 32762 0.0 0.0 0 0 ? Zs 14:08 0:01 [python2] <defunct>

which trace back to:

$ ps -p 32762 -o ppid
 PPID
31580
$ ps -p 31580
  PID TTY          TIME CMD
31580 ?        00:00:22 janitor

BenTheElder · 2017-12-08T23:11:38Z

    ├─pod86ecb194-da3b-11e7-962d-42010a800055
    │ ├─6ad84509b179480cb2cfe4d06430e44193e0d05db1eca530a7026f6a2ffdca0e
    │ │ └─31580 /janitor --service-account=/etc/service-account/service-account.json --resource-type=gce-project,gke-project,gci-qa-project,gpu-project --pool-size=20
    │ └─bc0a8f92ba6ecb05b6ea13c52e3dcc5cf9b1b74bbf78a128f2bfaceba60fb05c
    │   └─31362 /pause

BenTheElder · 2017-12-08T23:23:02Z

for context into how excessive this is:

$ ps aux | grep defunct | wc -l
20640

still looking for the root cause, we call gcloud a lot which I believe is also python so it's possible that gcloud has a bug instead...

BenTheElder · 2017-12-08T23:25:17Z

This node is now tainted (should have done this earlier), we will need to un-taint it after debugging.

$ kubectl taint nodes gke-prow-pool-n1-highmem-8-81ce4395-83j6 key=value:NoSchedule
node "gke-prow-pool-n1-highmem-8-81ce4395-83j6" tainted

BenTheElder · 2017-12-09T00:41:02Z

What's weirder is that the zombie processes all seem to have the boskos janitor binary as their parent, which only execs with .Run() and .CombinedOutput(...) as far as I can tell, which should both ultimately call Wait() so we shouldn't be seeing zombies. 😕

krzyzacy · 2017-12-09T03:56:15Z

hummmm, xref kubernetes/kubernetes#54313 - probably because of the million gcloud foo calls in the actual janitor script

krzyzacy · 2017-12-09T03:56:35Z

I probably can add some health check for each janitor pod

BenTheElder · 2017-12-09T04:13:43Z

@krzyzacy we specifically need to avoid process exhaustion on the host, so we should make sure these child processes get cleaned up.
edit: in fact the health check is probably un-necessary the pod gets restarted, the problem is in the meantime this can hurt the node.

krzyzacy · 2017-12-09T05:05:20Z

@BenTheElder the node is still around? my current idea is after each sync loop in the janitor, I can look for leaked gcloud processes and wipe them up

BenTheElder · 2017-12-09T05:06:48Z

@krzyzacy the node itself is currently fine, but it caused docker to restart which broke jobs. I have the node quarantined for debugging since prow demand is low currently anyhow.

We can look for leaked processes and wipe them up pretty easily, but this is a bit of a bandaid. they shouldn't be leaking to begin with.

BenTheElder · 2017-12-09T05:08:15Z

Also the rate they accumulate is pretty bad:

ps aux | wc -l
895
ps aux | grep defunct | wc -l
736

(this is after the pod has been restarted this evening)

krzyzacy · 2017-12-09T05:18:09Z

/shrug
obviously ssh doesn't work here :-(

krzyzacy · 2017-12-09T05:44:46Z

hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.

BenTheElder · 2017-12-09T05:47:02Z

hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.

Yeah, I'm starting to think gcloud might be leaking processes internally, I'm starting to look at building the "bandaid" in the go janitor :(

BenTheElder · 2017-12-09T05:49:26Z

There is actually a small library specifically for this use case: https://github.com/ramr/go-reaper

krzyzacy · 2017-12-09T05:50:04Z

maybe we can also recycle the janitor pods (like set a max-lifetime for like 1h for each instance), if that's easier

BenTheElder · 2017-12-09T23:21:23Z

"re-fixed" in #5885

BenTheElder · 2017-12-10T03:41:38Z

Additionally this is not particularly new, but for reference the complete diagnosis and solution are:

zombie processes pile up, filling the PID space and we cause docker ps to fail so the kubelet health check fails and decides to restart docker, which interrupts our jobs on the node.
we see lots of python2 <defunct> piling up on the machine, and suspect the janitor since it appears to be the ppid for these zombie processes, and we know it spawns some janitor.py subprocesses with python 2

BUT:

the janitor go binary was previously ENTRYPOINT in our image, making it PID1 inside the container, where a normal linux (not in docker) would have an init daemon instead
when the janitor spawns some janitor.py it seems to wait on them properly
these in turn spawns some gcloud processes which are waited on properly (as far as I can tell)
Fun fact: gcloud / the google cloud sdk runs as python2 as well
in the case of that a parent process was aborted when a process exits it is an orphaned child which the kernel reparents to PID1
when any orphaned process exits it still exists in the kernel's process table because no parent ever wait(...)ed it or set SIGCHILD to SIG_IGN, so they fill up on the machine, and don't have much information to diagnose this because the kernel isn't required to keep everything for them, just the pid -> exit status.
all this means that if gcloud spawns a subprocess and then aborts the subprocess is re-parented to the janitor binary and we see aborted python2 commands parented to janitor causing our original suspicion
we waste a little time searching for a bug in the janitor, not immediately realizing these could be gcloud, because I don't remember that gcloud is also python2.x

This is the "PID1 problem" for docker.

You can solve this by making sure your ENTRYPOINT is a suitable init daemon of varying levels, in our case we just need to make sure orphaned processes are reaped (wait(...)ed) so a shell is sufficient.
We do have to be careful though, to make sure the shell doesn't outsmart us and directly exec the subprocess and replace the shell (bash will do this if you bash -c 'cmd').
On Alpine's /bin/sh this doesn't seem to be an issue, but for good measure I tossed in /bin/sh -c 'echo started && /bin/janitor "$@" to make sure it can't just exec the command.

For other use cases a real init process of varying complexity might be desired to cleanly exit children (see the PID1 problem link), but we're already otherwise resilient to failed subprocess calls so we really just want these dead process entries cleaned up. #5885 does this.

k8s-ci-robot assigned BenTheElder Dec 8, 2017

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/boskos Issues or PRs related to code in /boskos labels Dec 8, 2017

BenTheElder assigned krzyzacy Dec 8, 2017

BenTheElder mentioned this issue Dec 8, 2017

Allow setting pids-limit on containers kubernetes/kubernetes#43783

Closed

k8s-ci-robot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Dec 9, 2017

BenTheElder mentioned this issue Dec 9, 2017

make the boskos janitor reap orphaned processes #5883

Merged

BenTheElder closed this as completed in #5883 Dec 9, 2017

This was referenced Dec 10, 2017

Ensure zombies don't take over our nodes #5887

Closed

Cleaning the janitor #5885

Merged

BenTheElder mentioned this issue Jul 23, 2018

figure out podutils and init / zombies #8779

Closed

hklai mentioned this issue Aug 29, 2018

Root cause why some presubmit jobs are not finishing istio/test-infra#957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

janitor leaks processes, destabilizing the node #5877

janitor leaks processes, destabilizing the node #5877

BenTheElder commented Dec 8, 2017 •

edited

Loading

BenTheElder commented Dec 8, 2017

BenTheElder commented Dec 8, 2017

BenTheElder commented Dec 8, 2017

BenTheElder commented Dec 8, 2017

BenTheElder commented Dec 9, 2017

krzyzacy commented Dec 9, 2017

krzyzacy commented Dec 9, 2017

BenTheElder commented Dec 9, 2017 •

edited

Loading

krzyzacy commented Dec 9, 2017

BenTheElder commented Dec 9, 2017 •

edited

Loading

BenTheElder commented Dec 9, 2017 •

edited

Loading

krzyzacy commented Dec 9, 2017

krzyzacy commented Dec 9, 2017

BenTheElder commented Dec 9, 2017

BenTheElder commented Dec 9, 2017

krzyzacy commented Dec 9, 2017

BenTheElder commented Dec 9, 2017

BenTheElder commented Dec 10, 2017

janitor leaks processes, destabilizing the node #5877

janitor leaks processes, destabilizing the node #5877

Comments

BenTheElder commented Dec 8, 2017 • edited Loading

BenTheElder commented Dec 8, 2017

BenTheElder commented Dec 8, 2017

BenTheElder commented Dec 8, 2017

BenTheElder commented Dec 8, 2017

BenTheElder commented Dec 9, 2017

krzyzacy commented Dec 9, 2017

krzyzacy commented Dec 9, 2017

BenTheElder commented Dec 9, 2017 • edited Loading

krzyzacy commented Dec 9, 2017

BenTheElder commented Dec 9, 2017 • edited Loading

BenTheElder commented Dec 9, 2017 • edited Loading

krzyzacy commented Dec 9, 2017

krzyzacy commented Dec 9, 2017

BenTheElder commented Dec 9, 2017

BenTheElder commented Dec 9, 2017

krzyzacy commented Dec 9, 2017

BenTheElder commented Dec 9, 2017

BenTheElder commented Dec 10, 2017

BenTheElder commented Dec 8, 2017 •

edited

Loading

BenTheElder commented Dec 9, 2017 •

edited

Loading

BenTheElder commented Dec 9, 2017 •

edited

Loading

BenTheElder commented Dec 9, 2017 •

edited

Loading