Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2568: Run control-plane as non-root in kubeadm. #2569

Merged
merged 1 commit into from
Apr 12, 2021

Conversation

vinayakankugoyal
Copy link
Contributor

@vinayakankugoyal vinayakankugoyal commented Mar 13, 2021

This KEP proposes that the control-plane in kubeadm be run as non-root. If containers are running as root an escape from a container may result in the escalation to root in host. CVE-2019-5736 is an example of a container escape vulnerabilty that can be mitigated by running containers/pods as non-root.

xref #2568

@k8s-ci-robot
Copy link
Contributor

Welcome @vinayakankugoyal!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 13, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @vinayakankugoyal. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/security Categorizes an issue or PR as relevant to SIG Security. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 13, 2021
@neolit123
Copy link
Member

the KEP should be under sig-cluster-lifecycle/kubeadm

@vinayakankugoyal vinayakankugoyal force-pushed the kep branch 2 times, most recently from 22b931b to 24d44b1 Compare March 14, 2021 01:12
@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Mar 14, 2021
@vinayakankugoyal
Copy link
Contributor Author

/cc @BenTheElder

@vinayakankugoyal vinayakankugoyal marked this pull request as draft March 17, 2021 18:59
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2021
@pacoxu
Copy link
Member

pacoxu commented Mar 18, 2021

/cc

@k8s-ci-robot k8s-ci-robot requested a review from pacoxu March 18, 2021 02:38
@vinayakankugoyal
Copy link
Contributor Author

vinayakankugoyal commented Mar 19, 2021

@neolit123 and @BenTheElder - I would love to hear your thoughts on the draft?

@vinayakankugoyal vinayakankugoyal force-pushed the kep branch 2 times, most recently from 26bffbb to 2cca427 Compare March 22, 2021 04:47
@vinayakankugoyal vinayakankugoyal marked this pull request as ready for review March 22, 2021 04:47
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 22, 2021

**Stuff that we care about for this kep:-**
| file/directory | Components | File permission |
| -------------------------------------------------|------------|-----------------|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to limit the file permissions to owner read/write and group readonly i.e. 640. Would it break anything if we further limit it to 600 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will break anything. You are right, this should be 600. We should go with the strictest permissions.

| file/directory | Components | File permission |
| -------------------------------------------------|------------|-----------------|
| /etc/kubernetes/pki/etcd/server.crt | etcd | 640 etcd etcd |
| /etc/kubernetes/pki/etcd/server.key | etcd | 640 etcd etcd |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, this table implies that all the components i.e. etcd, kube-api-server, kube-controller-manager, etc will have their own unique UID and GIDs that are named as the component name itself?

e.g. component: etcd, user = etcd, group=etcd

If this is correct, it might be worth highlighting this important caveat, that all UIDs and GIDs for all components will be different and that they share the name with their component.

Copy link
Member

@neolit123 neolit123 Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should prefix the names with kubeadm- too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, your understanding is correct. Ill call it out here to make it clearer. I do call it out in the 1st point on the Proposal section.

In `kubeadm` the control-plane components are run as [static-pods](https://kubernetes.io/docs/concepts/workloads/pods/#static-pods), i.e. pods directly managed by kubelet. We can use the `runAsUser` and `runAsGroup` fields in [SecurityContext](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#securitycontext-v1-core) to run the containers in these Pods with a unique `UID` and `GID` and the `capabilities` field to drop all capabilities other than the ones required.

### Assigning UID and GID
There are 3 options for setting the `UID`/`GID` of the control-plane components:-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here assuming we will be naming the UID and GID the same as the component they own, is to generate them in such a way that UID == GID == Process ID when running these components as non-root the first time. This is not a perfect solution and has following pros and cons:

Pros:

  1. Allows mapping of UID and GID to components via Process ID. This might be helpful and intuitive during production troubleshooting
  2. Ensures uniqueness at a point in time as no two processes at any given point in time will have same process ID.
  3. Most likely will never take UID = 0 as the process ID for each of the components being zero is highly unlikely, since something (e.g. swapper or scheduler) might be running with Process ID == 0

Cons:

  1. Might need some statefulness to pick up the the same UID and GID next time any of these components are started or restarted as the process ID could be different
  2. Might have some overlap with existing UID or Usernames e.g. process ID of api-server is 1001. A user with UID 1001 exists and is running a monitoring service in the same node as api-server. This could lead to some interesting privilege escalation attacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also ensure uniqueness if we used adduser and groupadd.
This would add complications on restarts. restarts which are completely managed by kubelet today would now need to be managed by kubeadm somehow to update not only the manifest with the new uid but also all file permissions, this would be fairly complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. If it is not too much trouble, could you add this as 4th option in KEP markdown and share why this is not something that was chosen. I worry that these comments will get lost after KEP is merged and we may end up revisiting this in future. Putting this in KEP will save some future time for all of us :) Happy to make edits to the KEP myself but I am afraid I do not have permissions to do the same.

Copy link
Contributor Author

@vinayakankugoyal vinayakankugoyal Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to add it as an option. Can you suggest it as an edit? That way its worded the way you would want it.

Copy link
Member

@randomvariable randomvariable Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should say the reasoning for following PAM SYS_UID_MIN & SYS_UID_MAX is to fall in line with distro standards, and will definitely avoid conflicts with actual user accounts. I did link the relevant docs to Debian and Fedora packaging in a previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done!


2. **Use constant values for uid/gid:** The `UID` and `GID`for each of the control-plane components would be set to some predetermined value in https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/constants/constants.go.

3. **Create system users and let users override the defaults:** Use `adduser --system` or equivalent to create `UID`s in the SYS_UID_MIN - SYS_UID_MAX range and groupadd --system or equivalent to create `GID`s in the SYS_GID_MIN - SYS_GID_MAX range. Additionally if users want to specify their own `UID`s or `GID`s we will support that through `kubeadm` v1beta3 API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could offset the UID and GID used here with some really high value as SYS_UID_MIN to prevent chances of overlap. However, for some flavors of kubernetes this may become a problem as they have their own way of assigning unique User IDs for running pods e.g. openshift: https://www.openshift.com/blog/a-guide-to-openshift-and-uids

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents would be to keep this logic simple as to fit everyones requirements by default would be extremely challenging. Giving them the ability to pick their own ids through the API would solve the problem on these flavors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I am definitely in favor of simple and secure.

2. Drop all capabilities from `kube-controller-manager`, `kube-scheduler`, `etcd` Pods. Drop all but cap_net_bind_service from `kube-apiserver` Pod, this will allow us to run as non-root while still being able to bind to ports < 1024. This can be achieved by using the `capabilities` field in the [securityContext](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#securitycontext-v1-core) of the containers in the Pods.

3. Set the seccomp profile to `runtime/default`, this can be done by setting the `seccompProfile` field in the `securityContext`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be extremely valuable to set a default for SELinux or Apparmor profile here as well. Some conformance tests or disclaimers on whether this works in SELinux and apparmor enforced environments would be helpful before admins enable this feature

Copy link
Member

@neolit123 neolit123 Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we found handling SELinux / AppArmor defaults from the side of kubeadm difficult and distro footprint specific in prior discussions with users that are somewhat experienced with these sec. solutions.
for some users something worked, while something didn't, thus at some point we decided that kubeadm should not try to manage these.

unless there are sane things that we can setup that work for everyone in a distro agnostic way (kubeadm should be distro agnostic), maybe we shouldn't.

in today's SIG Auth breakout meeting about the PSP replacement, the same problem was highlighted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having worked extensively on SELinux and containers, happy to connect offline on this @neolit123 to know more.

Regardless, would it make sense to either add a non-goal or under testing to clarify that this feature works or does not work with SELinux or app-armor or that this has not been tested in SELinux or apparmor environments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can enable something by default for the beta.

my vote would be to add it as a non-goal for the time being.
also, advanced users can always customize the default manifest-files. kubeadm has a patches mechanizm for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this as a non-goal for now, but let's reconsider this for beta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might also relate to allowing declarative patching to the pod spec @neolit123 , which might be a way to enable this but in a way that's out of scope of this proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we spoke about enabling inline patches as part of v1beta3 but @fabriziopandini had some reservations.
overall i'm +1 to allow patches to be both inline in the config or read from files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there is a way for customers to configure it to meet their needs I am +1, it doesn't necessarily have to be through the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have updated option 3. to remove the mention of API in favor of the ability of users to patch these in.

- Run `etcd` as non-root in `kubeadm`.

### Non-Goals

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps compatibility with user namespace enabled environments is a non-goal here? If yes, would be good to call this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, Ill call that out.

| /etc/kubernetes/pki/apiserver-kubelet-client.key | kas | 640 kas kas |
| /etc/kubernetes/pki/front-proxy-client.crt | kas | 640 kas kas |
| /etc/kubernetes/pki/front-proxy-client.key | kas | 640 kas kas |
| /etc/kubernetes/pki/front-proxy-ca.crt | kas, kcm | 640 root front_proxy_ca_crt_readers |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon me if I missed this but could you clarify what front_proxy_ca_crt_readers, ca_crt_readers, sa_key_readers groups are ? Are these the shared groups between components?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are groups that the components will specify as supplemental groups so that they can read the file/cert/cred.
So for example: /etc/kubernetes/pki/fromt-proxy-ca.crt is read by both kube-apiserver and kube-controller-manager, so we will create a group called front_proxy_ca_crt_readers and both kube-apiserver and kube-controller-manger will specify this group's id as a supplemental group. So this way they can both read the file.

- Run `kube-controller-manager` as non-root in `kubeadm`.
- Run `kube-scheduler` as non-root in `kubeadm`.
- Run `etcd` as non-root in `kubeadm`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kubelet out of scope for this KEP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Linux (with the exception of OpenRC on Alpine) kubeadm manages the kubelet as a systemd service that requires root to stop/start.
all files written under /var/lib/kubelet are owned by root too.

i don't think there is much we can do for the kubelet case unless we stop using systemd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, Ill explicitly call that out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thanks for the input @neolit123

@tallclair
Copy link
Member

Are there log files that also need their permissions updated?

@neolit123
Copy link
Member

neolit123 commented Apr 8, 2021

no, i don't think kubeadm writes log files.

i just remembered that we do write a temporal static-pod manifest / etcd snapshot backup under /etc/kubernetes in case a control-plane node upgrade fails (kubeadm upgrade). but these should be copies of existing files.
i think we just have to ensure that migration from non-root to root for the mutable kubeadm upgrade must "chown" after a CP node is considered healthy.
EDIT: NVM, the new non-root manifest files will just override the existing ones that require root.

@vinayakankugoyal vinayakankugoyal force-pushed the kep branch 2 times, most recently from c085a2a to 5940618 Compare April 8, 2021 19:36
readOnly: true
... # omitted to save space.
```
<br/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i noticed these BR tags in a few places. can we omit them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@vinayakankugoyal
Copy link
Contributor Author

@neolit123 - I was thinking that we merge this PR as provisional and then iterate on it? That way others will also get an opportunity to contribute to it. It is also the recommended way of operation as highlighted in the comments as Merge early and iterate. WDYT?

- /usr/share/ca-certificates
- /etc/ssl/certs
- /etc/ca-certificates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Multi-OS support
A Windows control plane is out of scope for this proposal for the time being.
OS specific implementations for Linux, would be carefully abstracted behind helper
utilities in kubeadm to not block the support for a Windows control plane in the future.

^ we should add this.

we had a discussion at SIG Windows about the future of a Windows control-plane.
a couple of companies contributing to k8s have customers that want this today, but the demand has been relatively low.

i can see kubeadm supporting a Windows control plane in the future, therefore, i would like us to include a note in the KEP about OS differentiation.

one example that i mentioned on the POC PR is that Chown from go's standard library just errors out on Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Done.

@neolit123
Copy link
Member

neolit123 commented Apr 9, 2021

@neolit123 - I was thinking that we merge this PR as provisional and then iterate on it? That way others will also get an opportunity to contribute to it. It is also the recommended way of operation as highlighted in the comments as Merge early and iterate. WDYT?

SGTM, please add the above "windows" change before that.

we should note that we can start with PRs and their reviews only after the KEP is marked as implementable though.

@vinayakankugoyal
Copy link
Contributor Author

we should note that we can start with PRs and their reviews only after the KEP is marked as implementable though.

Yeah I understand that. I just want to merge what we have and then iterate on the KEP some more, by focusing on the crucial aspects.

@vinayakankugoyal
Copy link
Contributor Author

@neolit123 - If the changes look good to you, please approve so that I can merge this PR with the KEP in provisional state. Thanks!

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

thanks @vinayakankugoyal
merging this as provisional.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neolit123, vinayakankugoyal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2021
@neolit123
Copy link
Member

we also have to assign someone to do the PRR (production readiness review) at some point.
the state of things around PRR are a bit fuzzy to me, ATM.

@k8s-ci-robot k8s-ci-robot merged commit d4cb03d into kubernetes:master Apr 12, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Apr 12, 2021
@vinayakankugoyal
Copy link
Contributor Author

we also have to assign someone to do the PRR (production readiness review) at some point.
the state of things around PRR are a bit fuzzy to me, ATM.

I know someone who recently went through the PRR process for another KEP. I'll talk to them and get more details on the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/security Categorizes an issue or PR as relevant to SIG Security. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants