Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose rollout for Docker shared PID namespace #207

Merged
merged 4 commits into from
Jan 26, 2017

Conversation

verb
Copy link
Contributor

@verb verb commented Dec 21, 2016

This PR is a redirect from kubernetes/kubernetes#37404 where it was reviewed by @euank (LGTM comment) and @brendandburns (LGTM comment)

@liggitt
Copy link
Member

liggitt commented Dec 21, 2016

Does shared PID namespace mean there is new visibility to processes in other containers in the pod? It seems like there are use cases for side car containers that would not want that behavior (not PID 1 as much as process isolation where cross container contact is limited to things in shared volume mounts)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 21, 2016
@euank
Copy link
Contributor

euank commented Dec 21, 2016

@liggitt

Yes, it adds more visibility/shared-ness within a pod.t Per the discussion here, this fits with the pod model. Pods are bound together, they share resources, networking, etc.

IMO, if you need process isolation between two things, then they shouldn't be in the same pod. It happens that due to how mount namespacing works and how containers are packaged, there's no option but to isolate disk resources to some degree, but I don't think that should be a justification for sharing other things.

Do you have specific examples or reasons that you think a pod should not share PID namespacing in some case? I'd rather not add complexity based on "what if some theoretical thing could maybe potentially possibly need some other thing".

@liggitt
Copy link
Member

liggitt commented Dec 21, 2016

they share resources

I thought resource consumption was specified and enforced per container

Do you have specific examples or reasons that you think a pod should not share PID namespacing in some case?

Containers in the same pod can have mixed trust levels, especially if some of the side car containers are doing "infrastructurish" things (like interacting with software like Vault in order to obtain credentials/certs, etc, to provide to the main container). It seems useful to retain the ability to isolate the processes as much as possible between containers.

@verb
Copy link
Contributor Author

verb commented Dec 21, 2016

@liggitt According to pod-resource-management.md the desire is to move pod-based resource management.

Isolated PID namespaces may not be possible in some runtimes (e.g. hypervisors, rkt). We probably want to be consistent across run times when possible, so it might take some extra thought. Would it be reasonable to poll the community for use cases not supported by a shared PID namespace once it's available?

@liggitt
Copy link
Member

liggitt commented Dec 21, 2016

According to pod-resource-management.md the desire is to move pod-based resource management.

Maybe I'm misreading, but I thought that was proposing adding pod level cgroups in addition to the existing container level ones

Would it be reasonable to poll the community for use cases not supported by a shared PID namespace once it's available?

Possibly, sure

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Dec 21, 2016 via email

@verb
Copy link
Contributor Author

verb commented Dec 21, 2016

Whoops, I didn't notice the instructions to kubernetes-dev about moving proposals from the main repo until now. I've added links in the description to the previous PR's LGTMs.

@derekwaynecarr You're correct, this was discussed in SIG Node and was approved pending a rollout plan, which is this doc.

@philips
Copy link
Contributor

philips commented Dec 21, 2016

+1 to the entire proposal. Reviewed and no objections.

@metral
Copy link

metral commented Dec 22, 2016

Wanted to point out that I stumbled upon moby/moby#25348, which identifies that in a shared PID namespace, zombie processes are not being reparented to pid 1 in Docker v1.12.

Per moby/moby#25348 (comment), a fix doesn't seem like it will land until Docker v1.13+

Not sure if this is pertinent to this proposal's execution or not, but wanted to call it out given that the current plan is to enable this in v1.6 with Docker >= v1.12

`docker.kubernetes.io/shared-pid: true` (i.e. opt-in) when running with
Docker >= 1.12. Pods with this annotation will fail to start with older
Docker versions rather than failing to meet a user's expectation.
2. Release 1.7: Enable the shared PID namespace for pods unless annotated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable. The first release will default to a disabled shared namespace, which is the current behavior. The second release will default to an enabled shared namespace, changing the default behavior. The third release will require shared namespace unless we discover a compelling use case and choose to formalize the option.

have built upon this assumption so we should change the default behavior over
the course of multiple releases.

1. Release 1.6: Enable the shared PID namespace for pods annotated with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will serve as pid 1 in this mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infra container. kubernetes/kubernetes#36853 adds functionality to pause and kubernetes/kubernetes#1615 tracks kubelet changes & tests (e.g. ensuring infra container starts first).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning this inline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIG node specifically requested a short rollout plan rather than a full proposal. I didn't do a good job of calling that out, so I've added a section at the top.

I can write up a full proposal for this change if you'd like, but I think it's overkill for a small change which fits neatly into a single SIG.

@verb
Copy link
Contributor Author

verb commented Dec 22, 2016

@metral That particularly issue isn't quite as dire as it seems since it only causes orphaned zombies to be reaped by the node's init rather than the pod's init, but I agree that there are several issues that could cause a schedule to slip.

It's also possible the feature won't see sufficient opt-in in the initial release (e.g. because GKE is still on Docker 1.11.x) and we'll opt to delay changing the default behavior.

@philips
Copy link
Contributor

philips commented Dec 22, 2016

cc cc @ethernetdan @lucab @jonboulle

Copy link
Contributor

@jonboulle jonboulle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, looks good overall!


Pods share many namespaces, but the ability to share a PID namespace was not
supported by Docker until version 1.12. This document proposes how to roll out
support for sharing the PID namespace in the docker runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a mention here about why this is Docker specific - i.e. the other runtime impls like cri-o, rkt, hyper all already support this mode (assuming they do!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, mind renaming the file to shared-pid-namespace-docker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one other point this raises: I think we should have a corollary change in the CRI clarifying that all new runtime implementers must use a shared PID namespace. Otherwise even after this change goes through we might have other implementations that have (inadvertently or otherwise) implemented the old/undesirable behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to comments from SIG Node it's the case that cri-o & rkt already have the shared namespace. I agree that CRI should require the shared PID namespace, however others have expressed concern in kubernetes/kubernetes#37404.

@timothysc @dchen1107 for thoughts on enforcing shared PID in the CRI (which would make this change easier as well). I'll hold off on the rename in case this morphs into a broader mandate for all runtimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@verb To my reading there aren't any substantive or unaddressed concerns in that thread. You're correct that rkt and cri-o support a shared namespace, but that's incidental rather than something that CRI has already requested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonboulle we're in agreement. I'll bring it up at the next SIG node meeting on 1/3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran out of time at SIG node to discuss this. I've added it to the agenda for next week, but I don't want to block this any longer and will constrain it to docker. We can increase its scope later if appropriate.

## Goals and Non-Goals

Goals include:
- Change default behavior in the Kubernetes Docker runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, changing

have built upon this assumption so we should change the default behavior over
the course of multiple releases.

1. Release 1.6: Enable the shared PID namespace for pods annotated with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning this inline.

@verb
Copy link
Contributor Author

verb commented Jan 4, 2017

Integrated additional feedback. I'm not sure what the next step is, I'm assuming someone will let me know if I need to squash.

@yujuhong
Copy link
Contributor

yujuhong commented Jan 5, 2017

It's also possible the feature won't see sufficient opt-in in the initial release (e.g. because GKE is still on Docker 1.11.x) and we'll opt to delay changing the default behavior.

I think this is the main issue I have with the proposal. If the majority of k8s-docker users are still on Docker 1.11 in 1.6, this schedule is less meaningful.

Another, perhaps more minor issue, is that based on the rollout plan, pods may have shared or non-shared namespaces based on what version of docker you use in release 1.8 (assuming k8s still supports docker <=1.11). This is not so evident from looking at the pod spec, and may causer user/admin/support confusion.

/cc @kubernetes/sig-node-misc @dchen1107

@jonboulle
Copy link
Contributor

you use in release 1.8 (assuming k8s still supports docker <=1.11)

that's a question this raises for me: what defines the policy around what versions of docker are supported?

@verb
Copy link
Contributor Author

verb commented Jan 5, 2017

I would expect SIG Node to define that policy, but that's outside the scope of this PR so it might be something we should raise on a mailing list instead.

@verb
Copy link
Contributor Author

verb commented Jan 10, 2017

There was general agreement in SIG node today to that the CRI should eventually require shared PID namespace of all run times (where feasible). I'm going to give some thought to how this should affect rollout of shared pid namespace (e.g. whether to couple with rollout of CRI) and either update or abandon this proposal.

@verb verb changed the title Propose rollout for Docker shared PID namespace WIP: Propose rollout for Docker shared PID namespace Jan 10, 2017
@dchen1107 dchen1107 self-assigned this Jan 13, 2017
@verb verb force-pushed the sharedpid-rollout branch from 5cb7c4f to d4789e1 Compare January 19, 2017 01:31
@verb verb changed the title WIP: Propose rollout for Docker shared PID namespace Propose rollout for Docker shared PID namespace Jan 19, 2017

Other changes that must be made to support this change:

1. Ensure all containers restart if the infra container responsible for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this means. kubelet already restarts all containers if the sandbox/infra container is dead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we should verify this behavior and maybe add a test. I added this item because a comment in another PR says differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a change in the code itself. Suggest making it clear by rewording it as "Adding a test to verify"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to "add a test", but maybe we can drop this entirely if it's inherent in the concept of the sandbox.

## Rollout Plan

SIG Node is planning to switch to the CRI as a default in 1.6, at which point
users with Docker >= 1.12 will be able to test Shared namespaces. Switching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Shared namespaces/shared PID namespaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.


SIG Node is planning to switch to the CRI as a default in 1.6, at which point
users with Docker >= 1.12 will be able to test Shared namespaces. Switching
back to isolated PID namespaces will require disabling the CRI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned about bundling the PID namespace with the CRI rollout, i.e., there are no way to disable shared PID namespaces in CRI. The majority of the CRI implementation should not change the behavior of the applications, but this (the shared pid namespace) does not fall into this category. This may affect the adoption of CRI in 1.6.

@dchen1107, can we discuss more whether we want to bundle the two, as this is currently not included in the CRI rollout plan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's very safe to assume SIG Node will discuss this more. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire proposal revolves around the idea of bundling rollout, hence it deserves more discussion.

back to isolated PID namespaces will require disabling the CRI.

At some point, say 1.7, SIG Node will remove support for disabling the CRI.
After this point users must roll back to a previous version of Kubernetes or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why rolling back? It seems to me that the users won't even choose to upgrade their kubernetes cluster if this is the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose not to upgrade in order to avoid the additional shared namespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't get why users would upgrade and then roll back for the shared pid namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they would roll back if they found their container images were incompatible with a new version of Kubernetes.

After this point users must roll back to a previous version of Kubernetes or
Docker to achieve PID namespace isolation. This is acceptable because:

* No one has been able to identify a concrete use case requiring isolated PID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm....I am not sure what has been done to reach a broader audience. Could you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No attempt has been made to reach a broader audience. I was referring to reviewers of this proposal.

namespaces.
* The lack of use cases means we can't justify the complexity required to make
PID namespace type configurable.
* Users will already be looking for issues due to the major version upgrade and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "major version upgrade" mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.X -> 1.Y

* Users will already be looking for issues due to the major version upgrade and
prepared for a rollback to the previous release.

Alternatively, we could create a flag in the kublet to disable shared PID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/kublet/kubelet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

prepared for a rollback to the previous release.

Alternatively, we could create a flag in the kublet to disable shared PID
namespace, but this wouldn't be especially useful to users of a hosted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? The maintainer of the hosted clusters can still choose to disable this feature....

Copy link
Contributor Author

@verb verb Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would need a mechanism for user's to affect flags of the running kubelet. I'd be surprised if this was common.

Or perhaps you mean they would choose to disable shared PID namespaces for all users on principle, which seems a stretch for a feature that so far no one opposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an escape hatch is better. The other option is to make a new patch release which can incur a lot of latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would need a mechanism for user's to affect flags of the running kubelet. I'd be surprised if this was common.

I can't comment on behalf of all the provider of hosted kubernetes clusters...

Or perhaps you mean they would choose to disable shared PID namespaces for all users on principle, which seems a stretch for a feature that so far no one opposes.

From past experiences, it's hard to gather feedback from users. Erring on the side of caution might be good. Is making shared pid namespace configurable for the transitional period really such a big hurdle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it's not a big hurdle. It can be done with an annotation and adding an option to the CRI.

@@ -86,7 +86,7 @@ container setup that are not currently trackable as Pod constraints, e.g.,
filesystem setup, container image pulling, etc.*

A container in a PodSandbox maps to an application in the Pod Spec. For Linux
containers, they are expected to share at least network and IPC namespaces,
containers, they are expected to share at least network, PID and IPC namespaces,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we should change an old design proposal. This should be documented somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change should be reflected somewhere, and SIG Node pointed me here.

I guess the broader question is whether proposals are living documents, but that's out of scope for this PR. If you want me to make the change somewhere else please suggest a place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do update old proposals to reflect new features or changes to design originally proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible to keep all the proposals up-to-date though. Adding a separate spec for CRI anyway has to be done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, since CRI v1alpha1 is not out yet, and shared pid namespace is what we agreed for v1alpha1, updating the proposal might be the least effort we could do?


## Rollout Plan

SIG Node is planning to switch to the CRI as a default in 1.6, at which point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to have shared pid namespace by default for CRI v1alpha1 interface. In this case, if the user use docker 1.12 or above, enabling shared pid namespace means early exposure to the end users is good thing for us to catch any potential issues.

On another side, I think there is a real concern that this is a behavior change for Kubernetes container runtime in production. Even other runtime implementations are having shared pid namespace by default today, but in production, the majority of Kubernetes users are not using such features, and we are not sure if this might cause any user issues. If any problem involving with shared pid namespace resulted in a CRI disabling might be too dramatic.

Can we introduce a flag to disable the shared pid namespace instead of disabling entire CRI?

cc/ @yujuhong @verb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I'll update the proposal.

@dchen1107
Copy link
Member

LGTM

@dchen1107 dchen1107 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2017
@yujuhong
Copy link
Contributor

yujuhong commented Jan 24, 2017

LGTM

@dchen1107 dchen1107 merged commit 160e9b6 into kubernetes:master Jan 26, 2017
sameo pushed a commit to sameo/cri-o that referenced this pull request Feb 22, 2017
And not only the network and IPC ones.
This is following a recent kubernetes change:
kubernetes/community#207

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
sameo pushed a commit to sameo/cri-o that referenced this pull request Feb 22, 2017
And not only the network and IPC ones.
This is following a recent kubernetes change:
kubernetes/community#207

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
sameo pushed a commit to sameo/cri-o that referenced this pull request Feb 22, 2017
And not only the network and IPC ones.
This is following a recent kubernetes change:
kubernetes/community#207

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
@yujuhong
Copy link
Contributor

yujuhong commented Feb 23, 2017

Let's hold on to this one for a little bit longer.

I was testing the shared PID namespace using docker 1.12.6, and found an issue (which I believe has been fixed in docker 1.13, which we haven't qualified yet for kubernetes).

Docker 1.12.6 sends sigkill only to the init process. This is not a problem if each container has its own PID namespace because all processes in the namespace will terminate with the PID 1. This, however, is a problem for containers sharing the PID namespace because killing one process (that's not PID 1) doesn't kill all its children processes.

E.g.,

$ sudo docker run -d busybox sleep 10000
a317a86a95abea36a2d91510f659da6500f10196828d3026e61e49c42e5a822d
$ sudo docker run -d --pid=container:a317a86a95 busybox /bin/sh -c "sleep 9999"
476186c4c3bc1d04b847431f76989748d5035f6713e4980e9d6bef5dfd961e78
$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS               NAMES
476186c4c3bc        busybox             "/bin/sh -c 'sleep 99"   13 seconds ago       Up 12 seconds                           focused_mccarthy
a317a86a95ab        busybox             "sleep 10000"            About a minute ago   Up About a minute                       nostalgic_curran
$ sudo docker stop -t=5 476186c4c3bc

The final docker stop command hangs forever, and the sleep 9999 process never terminates.

I can reliably reproduce this every time on ubuntu trusty with docker 1.12.6 (and 1.12.3), but docker 1.13.0 doesn't have this problem.

@mrunalp pointed me to the runc fix (opencontainers/runc#1180), where signals were sent to all the processes in the container. This could've been included in docker 1.13, but I haven't checked closely.

/cc @dchen1107

@verb
Copy link
Contributor Author

verb commented Feb 23, 2017

@yujuhong wow, that's a bad one. I can reproduce it as well, though I only get the infinite hang when the in-container parent process is a shell, strangely.

This isn't the only problem with the shared PID namespace implementation in 1.12.*, so I think we should bump the minimum docker version for shared PID to 1.13.

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Apr 29, 2017
Automatic merge from submit-queue (batch tested with PRs 41583, 45117, 45123)

Implement shared PID namespace in the dockershim

**What this PR does / why we need it**: Defaults the Docker CRI to using a shared PID namespace for pods. Implements proposal in kubernetes/community#207 tracked by #1615.

//cc @dchen1107 @vishh @timstclair 

**Special notes for your reviewer**: none

**Release note**:
```release-note
Some container runtimes share a process (PID) namespace for all containers in a pod. This will become the default for Docker in a future release of Kubernetes. You can preview this functionality if running with the CRI and Docker 1.13.1 by enabling the --experimental-docker-enable-shared-pid kubelet flag.
```
@yujuhong
Copy link
Contributor

@yujuhong wow, that's a bad one. I can reproduce it as well, though I only get the infinite hang when the in-container parent process is a shell, strangely.

The shell doesn't propagate signals to its child processes on its own, hence the problem.

@euank
Copy link
Contributor

euank commented Jun 14, 2017

Docker 1.12.6 sends sigkill only to the init process. This is not a problem if each container has its own PID namespace because all processes in the namespace will terminate with the PID 1. This, however, is a problem for containers sharing the PID namespace because killing one process (that's not PID 1) doesn't kill all its children processes.

I don't think docker stop's semantics need be a real blocker.

The kubelet controls the parent cgroup(s) of the pod, so it has a convenient way to iterate all pids and terminate them. This behaviour matches what many process-managers do. The kubelet is a process manager to a degree. It already controls the lifecycle of a pod, so it doesn't seem too crazy for it to also handle the death of it more explicitly than via 'docker stop'.

@yujuhong
Copy link
Contributor

@euank with the current abstraction of CRI, kubelet is pretty much isolated from the actual processes. The containers could be run in a hypervisor-based runtime. The only possible place to implement this is in the docker integration code, and I find it hard to justify the work to enable the new feature.

@euank
Copy link
Contributor

euank commented Jun 14, 2017

kubelet is pretty much isolated from the actual processes

One of the core features of K8s is its resource constraints. For those to work, the kubelet must have purview over the actual process at least to the degree of resource usage.
Even with hypervisor-based runtimes it must be true that the pod's processes all reside in that cgroup (even if the processes are just VMs). SIGKILL after the stop timeout works just fine on VMs too.
Any runtime that has pod processes sitting outside the cgroup arguably isn't really a K8s runtime because it's breaking resource constraints.

hard to justify the work to enable the new feature

I can agree with that; it does seem like needless work when we could instead only support runtimes that do the right thing here.
However, if there's no plan to drop support for docker 1.12 in the next few release cycles, it might be worth solving it.

@verb
Copy link
Contributor Author

verb commented Jun 14, 2017

@euank I feel like I might have missed part of the conversation here, so apologies if I'm off topic, but shared pid isn't supported for docker 1.12. Docker version has to be at least 1.13.1 for kubernetes to use shared pid.

@yujuhong
Copy link
Contributor

Even with hypervisor-based runtimes it must be true that the pod's processes all reside in that cgroup (even if the processes are just VMs).

@euank in my example above (#207 (comment)), I simply wanted to kill one container in the pod and did not want to nuke the entire pod. How is kubelet going to reap the children of one specific process in that pod? That's why I said only the runtime (shim) had enough information to fix this.

I can agree with that; it does seem like needless work when we could instead only support runtimes that do the right thing here.
However, if there's no plan to drop support for docker 1.12 in the next few release cycles, it might be worth solving it.

Sure, there is value in fixing this, but in practice, users can install docker 1.13 if they really want the shared pid namespace. There is also need to communicate with the users for changing the default behavior for docker 1.12, in case anyone's interested.

@euank
Copy link
Contributor

euank commented Jun 15, 2017

I simply wanted to kill one container in the pod and did not want to nuke the entire pod

The Kubernetes API doesn't let you do that; you only get to create and kill things at a pod granularity :)

When the kubelet needs to do container level operations, it still of course needs to go through the runtime, but for the final "kill" resulting from a pod being marked for deletion is always pod level, and thus sweeping leftover tasks from the whole pod in the cgroup is sufficient granularity.

but in practice, users can install docker 1.13

Users only get to do that if they ignore the recommendations of the Kubernetes project though, right? Seems like a rock and a hard place.


I am convinced that hackily working around this in the kubelet isn't the right approach, and rolling forwards to only supporting a newer docker version is probably the sane option.

@yujuhong
Copy link
Contributor

When the kubelet needs to do container level operations, it still of course needs to go through the runtime, but for the final "kill" resulting from a pod being marked for deletion is always pod level, and thus sweeping leftover tasks from the whole pod in the cgroup is sufficient granularity.

Yes, I was mainly talking about the per-container cleanup. Kubelet can handle the pod-level cleanup like you said.

Users only get to do that if they ignore the recommendations of the Kubernetes project though, right? Seems like a rock and a hard place.

We've seen early adopters trying out new docker versions.

I am convinced that hackily working around this in the kubelet isn't the right approach, and rolling forwards to only supporting a newer docker version is probably the sane option.

:-)

MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Propose rollout for Docker shared PID namespace
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
fabiand pushed a commit to fabiand/community that referenced this pull request Mar 1, 2024
Signed-off-by: Andrew Burden <aburden@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.