-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose rollout for Docker shared PID namespace #207
Conversation
Does shared PID namespace mean there is new visibility to processes in other containers in the pod? It seems like there are use cases for side car containers that would not want that behavior (not PID 1 as much as process isolation where cross container contact is limited to things in shared volume mounts) |
Yes, it adds more visibility/shared-ness within a pod.t Per the discussion here, this fits with the pod model. Pods are bound together, they share resources, networking, etc. IMO, if you need process isolation between two things, then they shouldn't be in the same pod. It happens that due to how mount namespacing works and how containers are packaged, there's no option but to isolate disk resources to some degree, but I don't think that should be a justification for sharing other things. Do you have specific examples or reasons that you think a pod should not share PID namespacing in some case? I'd rather not add complexity based on "what if some theoretical thing could maybe potentially possibly need some other thing". |
I thought resource consumption was specified and enforced per container
Containers in the same pod can have mixed trust levels, especially if some of the side car containers are doing "infrastructurish" things (like interacting with software like Vault in order to obtain credentials/certs, etc, to provide to the main container). It seems useful to retain the ability to isolate the processes as much as possible between containers. |
@liggitt According to pod-resource-management.md the desire is to move pod-based resource management. Isolated PID namespaces may not be possible in some runtimes (e.g. hypervisors, rkt). We probably want to be consistent across run times when possible, so it might take some extra thought. Would it be reasonable to poll the community for use cases not supported by a shared PID namespace once it's available? |
Maybe I'm misreading, but I thought that was proposing adding pod level cgroups in addition to the existing container level ones
Possibly, sure |
I agree with Euan and we should default to shared pid namespace to promote
communication among containers in a pod. I thought this was discussed in
sig-node previously.
Looks like I need to update that pod resource management document. Right
now, we have a pod level cgroup in addition to existing container level
ones. Resource requirements are still at the container level, the pod
cgroup is just the sum of container resources. That said, I am +1 on pod
level resource requirements in future. Right now, two containers in a pod
can share memory and CPU resources (i.e. Pod where Container 1 makes CPU
and memory request, container 2 makes none, container 2 scavenges from
container 1 but is bounded by pod cgroup)
|
Whoops, I didn't notice the instructions to kubernetes-dev about moving proposals from the main repo until now. I've added links in the description to the previous PR's LGTMs. @derekwaynecarr You're correct, this was discussed in SIG Node and was approved pending a rollout plan, which is this doc. |
+1 to the entire proposal. Reviewed and no objections. |
Wanted to point out that I stumbled upon moby/moby#25348, which identifies that in a shared PID namespace, zombie processes are not being reparented to pid 1 in Docker v1.12. Per moby/moby#25348 (comment), a fix doesn't seem like it will land until Docker v1.13+ Not sure if this is pertinent to this proposal's execution or not, but wanted to call it out given that the current plan is to enable this in v1.6 with Docker >= v1.12 |
`docker.kubernetes.io/shared-pid: true` (i.e. opt-in) when running with | ||
Docker >= 1.12. Pods with this annotation will fail to start with older | ||
Docker versions rather than failing to meet a user's expectation. | ||
2. Release 1.7: Enable the shared PID namespace for pods unless annotated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enable. The first release will default to a disabled shared namespace, which is the current behavior. The second release will default to an enabled shared namespace, changing the default behavior. The third release will require shared namespace unless we discover a compelling use case and choose to formalize the option.
have built upon this assumption so we should change the default behavior over | ||
the course of multiple releases. | ||
|
||
1. Release 1.6: Enable the shared PID namespace for pods annotated with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will serve as pid 1 in this mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The infra container. kubernetes/kubernetes#36853 adds functionality to pause and kubernetes/kubernetes#1615 tracks kubelet changes & tests (e.g. ensuring infra container starts first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning this inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIG node specifically requested a short rollout plan rather than a full proposal. I didn't do a good job of calling that out, so I've added a section at the top.
I can write up a full proposal for this change if you'd like, but I think it's overkill for a small change which fits neatly into a single SIG.
@metral That particularly issue isn't quite as dire as it seems since it only causes orphaned zombies to be reaped by the node's init rather than the pod's init, but I agree that there are several issues that could cause a schedule to slip. It's also possible the feature won't see sufficient opt-in in the initial release (e.g. because GKE is still on Docker 1.11.x) and we'll opt to delay changing the default behavior. |
cc cc @ethernetdan @lucab @jonboulle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, looks good overall!
|
||
Pods share many namespaces, but the ability to share a PID namespace was not | ||
supported by Docker until version 1.12. This document proposes how to roll out | ||
support for sharing the PID namespace in the docker runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a mention here about why this is Docker specific - i.e. the other runtime impls like cri-o, rkt, hyper all already support this mode (assuming they do!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, mind renaming the file to shared-pid-namespace-docker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, one other point this raises: I think we should have a corollary change in the CRI clarifying that all new runtime implementers must use a shared PID namespace. Otherwise even after this change goes through we might have other implementations that have (inadvertently or otherwise) implemented the old/undesirable behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to comments from SIG Node it's the case that cri-o & rkt already have the shared namespace. I agree that CRI should require the shared PID namespace, however others have expressed concern in kubernetes/kubernetes#37404.
@timothysc @dchen1107 for thoughts on enforcing shared PID in the CRI (which would make this change easier as well). I'll hold off on the rename in case this morphs into a broader mandate for all runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@verb To my reading there aren't any substantive or unaddressed concerns in that thread. You're correct that rkt and cri-o support a shared namespace, but that's incidental rather than something that CRI has already requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonboulle we're in agreement. I'll bring it up at the next SIG node meeting on 1/3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ran out of time at SIG node to discuss this. I've added it to the agenda for next week, but I don't want to block this any longer and will constrain it to docker. We can increase its scope later if appropriate.
## Goals and Non-Goals | ||
|
||
Goals include: | ||
- Change default behavior in the Kubernetes Docker runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, changing
have built upon this assumption so we should change the default behavior over | ||
the course of multiple releases. | ||
|
||
1. Release 1.6: Enable the shared PID namespace for pods annotated with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning this inline.
Integrated additional feedback. I'm not sure what the next step is, I'm assuming someone will let me know if I need to squash. |
I think this is the main issue I have with the proposal. If the majority of k8s-docker users are still on Docker 1.11 in 1.6, this schedule is less meaningful. Another, perhaps more minor issue, is that based on the rollout plan, pods may have shared or non-shared namespaces based on what version of docker you use in release 1.8 (assuming k8s still supports docker <=1.11). This is not so evident from looking at the pod spec, and may causer user/admin/support confusion. /cc @kubernetes/sig-node-misc @dchen1107 |
that's a question this raises for me: what defines the policy around what versions of docker are supported? |
I would expect SIG Node to define that policy, but that's outside the scope of this PR so it might be something we should raise on a mailing list instead. |
There was general agreement in SIG node today to that the CRI should eventually require shared PID namespace of all run times (where feasible). I'm going to give some thought to how this should affect rollout of shared pid namespace (e.g. whether to couple with rollout of CRI) and either update or abandon this proposal. |
Also rename file to be docker specific.
5cb7c4f
to
d4789e1
Compare
|
||
Other changes that must be made to support this change: | ||
|
||
1. Ensure all containers restart if the infra container responsible for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this means. kubelet already restarts all containers if the sandbox/infra container is dead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we should verify this behavior and maybe add a test. I added this item because a comment in another PR says differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a change in the code itself. Suggest making it clear by rewording it as "Adding a test to verify"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to "add a test", but maybe we can drop this entirely if it's inherent in the concept of the sandbox.
## Rollout Plan | ||
|
||
SIG Node is planning to switch to the CRI as a default in 1.6, at which point | ||
users with Docker >= 1.12 will be able to test Shared namespaces. Switching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Shared namespaces/shared PID namespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
|
||
SIG Node is planning to switch to the CRI as a default in 1.6, at which point | ||
users with Docker >= 1.12 will be able to test Shared namespaces. Switching | ||
back to isolated PID namespaces will require disabling the CRI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly concerned about bundling the PID namespace with the CRI rollout, i.e., there are no way to disable shared PID namespaces in CRI. The majority of the CRI implementation should not change the behavior of the applications, but this (the shared pid namespace) does not fall into this category. This may affect the adoption of CRI in 1.6.
@dchen1107, can we discuss more whether we want to bundle the two, as this is currently not included in the CRI rollout plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's very safe to assume SIG Node will discuss this more. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire proposal revolves around the idea of bundling rollout, hence it deserves more discussion.
back to isolated PID namespaces will require disabling the CRI. | ||
|
||
At some point, say 1.7, SIG Node will remove support for disabling the CRI. | ||
After this point users must roll back to a previous version of Kubernetes or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why rolling back? It seems to me that the users won't even choose to upgrade their kubernetes cluster if this is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose not to upgrade in order to avoid the additional shared namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't get why users would upgrade and then roll back for the shared pid namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they would roll back if they found their container images were incompatible with a new version of Kubernetes.
After this point users must roll back to a previous version of Kubernetes or | ||
Docker to achieve PID namespace isolation. This is acceptable because: | ||
|
||
* No one has been able to identify a concrete use case requiring isolated PID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm....I am not sure what has been done to reach a broader audience. Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No attempt has been made to reach a broader audience. I was referring to reviewers of this proposal.
namespaces. | ||
* The lack of use cases means we can't justify the complexity required to make | ||
PID namespace type configurable. | ||
* Users will already be looking for issues due to the major version upgrade and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "major version upgrade" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.X -> 1.Y
* Users will already be looking for issues due to the major version upgrade and | ||
prepared for a rollback to the previous release. | ||
|
||
Alternatively, we could create a flag in the kublet to disable shared PID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/kublet/kubelet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
prepared for a rollback to the previous release. | ||
|
||
Alternatively, we could create a flag in the kublet to disable shared PID | ||
namespace, but this wouldn't be especially useful to users of a hosted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? The maintainer of the hosted clusters can still choose to disable this feature....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There would need a mechanism for user's to affect flags of the running kubelet. I'd be surprised if this was common.
Or perhaps you mean they would choose to disable shared PID namespaces for all users on principle, which seems a stretch for a feature that so far no one opposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an escape hatch is better. The other option is to make a new patch release which can incur a lot of latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There would need a mechanism for user's to affect flags of the running kubelet. I'd be surprised if this was common.
I can't comment on behalf of all the provider of hosted kubernetes clusters...
Or perhaps you mean they would choose to disable shared PID namespaces for all users on principle, which seems a stretch for a feature that so far no one opposes.
From past experiences, it's hard to gather feedback from users. Erring on the side of caution might be good. Is making shared pid namespace configurable for the transitional period really such a big hurdle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it's not a big hurdle. It can be done with an annotation and adding an option to the CRI.
@@ -86,7 +86,7 @@ container setup that are not currently trackable as Pod constraints, e.g., | |||
filesystem setup, container image pulling, etc.* | |||
|
|||
A container in a PodSandbox maps to an application in the Pod Spec. For Linux | |||
containers, they are expected to share at least network and IPC namespaces, | |||
containers, they are expected to share at least network, PID and IPC namespaces, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we should change an old design proposal. This should be documented somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change should be reflected somewhere, and SIG Node pointed me here.
I guess the broader question is whether proposals are living documents, but that's out of scope for this PR. If you want me to make the change somewhere else please suggest a place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do update old proposals to reflect new features or changes to design originally proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's impossible to keep all the proposals up-to-date though. Adding a separate spec for CRI anyway has to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, since CRI v1alpha1 is not out yet, and shared pid namespace is what we agreed for v1alpha1, updating the proposal might be the least effort we could do?
|
||
## Rollout Plan | ||
|
||
SIG Node is planning to switch to the CRI as a default in 1.6, at which point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed to have shared pid namespace by default for CRI v1alpha1 interface. In this case, if the user use docker 1.12 or above, enabling shared pid namespace means early exposure to the end users is good thing for us to catch any potential issues.
On another side, I think there is a real concern that this is a behavior change for Kubernetes container runtime in production. Even other runtime implementations are having shared pid namespace by default today, but in production, the majority of Kubernetes users are not using such features, and we are not sure if this might cause any user issues. If any problem involving with shared pid namespace resulted in a CRI disabling might be too dramatic.
Can we introduce a flag to disable the shared pid namespace instead of disabling entire CRI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, I'll update the proposal.
LGTM |
LGTM |
And not only the network and IPC ones. This is following a recent kubernetes change: kubernetes/community#207 Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
And not only the network and IPC ones. This is following a recent kubernetes change: kubernetes/community#207 Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
And not only the network and IPC ones. This is following a recent kubernetes change: kubernetes/community#207 Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Let's hold on to this one for a little bit longer. I was testing the shared PID namespace using docker 1.12.6, and found an issue (which I believe has been fixed in docker 1.13, which we haven't qualified yet for kubernetes). Docker 1.12.6 sends sigkill only to the init process. This is not a problem if each container has its own PID namespace because all processes in the namespace will terminate with the PID 1. This, however, is a problem for containers sharing the PID namespace because killing one process (that's not PID 1) doesn't kill all its children processes. E.g., $ sudo docker run -d busybox sleep 10000
a317a86a95abea36a2d91510f659da6500f10196828d3026e61e49c42e5a822d
$ sudo docker run -d --pid=container:a317a86a95 busybox /bin/sh -c "sleep 9999"
476186c4c3bc1d04b847431f76989748d5035f6713e4980e9d6bef5dfd961e78
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
476186c4c3bc busybox "/bin/sh -c 'sleep 99" 13 seconds ago Up 12 seconds focused_mccarthy
a317a86a95ab busybox "sleep 10000" About a minute ago Up About a minute nostalgic_curran
$ sudo docker stop -t=5 476186c4c3bc The final I can reliably reproduce this every time on ubuntu trusty with docker 1.12.6 (and 1.12.3), but docker 1.13.0 doesn't have this problem. @mrunalp pointed me to the runc fix (opencontainers/runc#1180), where signals were sent to all the processes in the container. This could've been included in docker 1.13, but I haven't checked closely. /cc @dchen1107 |
@yujuhong wow, that's a bad one. I can reproduce it as well, though I only get the infinite hang when the in-container parent process is a shell, strangely. This isn't the only problem with the shared PID namespace implementation in 1.12.*, so I think we should bump the minimum docker version for shared PID to 1.13. |
Automatic merge from submit-queue (batch tested with PRs 41583, 45117, 45123) Implement shared PID namespace in the dockershim **What this PR does / why we need it**: Defaults the Docker CRI to using a shared PID namespace for pods. Implements proposal in kubernetes/community#207 tracked by #1615. //cc @dchen1107 @vishh @timstclair **Special notes for your reviewer**: none **Release note**: ```release-note Some container runtimes share a process (PID) namespace for all containers in a pod. This will become the default for Docker in a future release of Kubernetes. You can preview this functionality if running with the CRI and Docker 1.13.1 by enabling the --experimental-docker-enable-shared-pid kubelet flag. ```
The shell doesn't propagate signals to its child processes on its own, hence the problem. |
I don't think The kubelet controls the parent cgroup(s) of the pod, so it has a convenient way to iterate all pids and terminate them. This behaviour matches what many process-managers do. The kubelet is a process manager to a degree. It already controls the lifecycle of a pod, so it doesn't seem too crazy for it to also handle the death of it more explicitly than via 'docker stop'. |
@euank with the current abstraction of CRI, kubelet is pretty much isolated from the actual processes. The containers could be run in a hypervisor-based runtime. The only possible place to implement this is in the docker integration code, and I find it hard to justify the work to enable the new feature. |
One of the core features of K8s is its resource constraints. For those to work, the kubelet must have purview over the actual process at least to the degree of resource usage.
I can agree with that; it does seem like needless work when we could instead only support runtimes that do the right thing here. |
@euank I feel like I might have missed part of the conversation here, so apologies if I'm off topic, but shared pid isn't supported for docker 1.12. Docker version has to be at least 1.13.1 for kubernetes to use shared pid. |
@euank in my example above (#207 (comment)), I simply wanted to kill one container in the pod and did not want to nuke the entire pod. How is kubelet going to reap the children of one specific process in that pod? That's why I said only the runtime (shim) had enough information to fix this.
Sure, there is value in fixing this, but in practice, users can install docker 1.13 if they really want the shared pid namespace. There is also need to communicate with the users for changing the default behavior for docker 1.12, in case anyone's interested. |
The Kubernetes API doesn't let you do that; you only get to create and kill things at a pod granularity :) When the kubelet needs to do container level operations, it still of course needs to go through the runtime, but for the final "kill" resulting from a pod being marked for deletion is always pod level, and thus sweeping leftover tasks from the whole pod in the cgroup is sufficient granularity.
Users only get to do that if they ignore the recommendations of the Kubernetes project though, right? Seems like a rock and a hard place. I am convinced that hackily working around this in the kubelet isn't the right approach, and rolling forwards to only supporting a newer docker version is probably the sane option. |
Yes, I was mainly talking about the per-container cleanup. Kubelet can handle the pod-level cleanup like you said.
We've seen early adopters trying out new docker versions.
:-) |
Propose rollout for Docker shared PID namespace
Signed-off-by: Andrew Burden <aburden@redhat.com>
This PR is a redirect from kubernetes/kubernetes#37404 where it was reviewed by @euank (LGTM comment) and @brendandburns (LGTM comment)