-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
janitor leaks processes, destabilizing the node #5877
Comments
which trace back to:
|
|
for context into how excessive this is:
still looking for the root cause, we call |
This node is now tainted (should have done this earlier), we will need to un-taint it after debugging.
|
What's weirder is that the zombie processes all seem to have the boskos janitor binary as their parent, which only execs with |
hummmm, xref kubernetes/kubernetes#54313 - probably because of the million |
I probably can add some health check for each janitor pod |
@krzyzacy we specifically need to avoid process exhaustion on the host, so we should make sure these child processes get cleaned up. |
@BenTheElder the node is still around? my current idea is after each sync loop in the janitor, I can look for leaked gcloud processes and wipe them up |
@krzyzacy the node itself is currently fine, but it caused docker to restart which broke jobs. I have the node quarantined for debugging since prow demand is low currently anyhow. We can look for leaked processes and wipe them up pretty easily, but this is a bit of a bandaid. they shouldn't be leaking to begin with. |
Also the rate they accumulate is pretty bad:
(this is after the pod has been restarted this evening) |
/shrug |
hummmm, in janitor.py we are only using |
Yeah, I'm starting to think gcloud might be leaking processes internally, I'm starting to look at building the "bandaid" in the go janitor :( |
There is actually a small library specifically for this use case: https://github.com/ramr/go-reaper |
maybe we can also recycle the janitor pods (like set a max-lifetime for like 1h for each instance), if that's easier |
"re-fixed" in #5885 |
Additionally this is not particularly new, but for reference the complete diagnosis and solution are:
BUT:
This is the "PID1 problem" for docker. You can solve this by making sure your For other use cases a real init process of varying complexity might be desired to cleanly exit children (see the PID1 problem link), but we're already otherwise resilient to failed subprocess calls so we really just want these dead process entries cleaned up. #5885 does this. |
see: #5700 (comment)
additionally a previous issue: #4892
I'm looking into this.
/assign
/priority critical-urgent
/area boskos
The text was updated successfully, but these errors were encountered: