Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chown cgroup to process uid in container namespace #3057

Merged
merged 1 commit into from
Dec 7, 2021

Conversation

frasertweedale
Copy link
Contributor

@frasertweedale frasertweedale commented Jul 2, 2021

Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads. The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

@kolyshkin
Copy link
Contributor

In general this kinda makes sense, although I'm afraid systemd might overwrite that at some random moment.

This also needs test(s).

@frasertweedale
Copy link
Contributor Author

frasertweedale commented Jul 2, 2021

In general this kinda makes sense, although I'm afraid systemd might overwrite that at some random moment.

This also needs test(s).

You can use systemd transient unit API to set a different user. Unfortunately, the uid needs to
have a corresponding passwd entity or else systemd refuses to proceed.

I filed an issue against systemd to relax this requirement. It was provisionally rejected:
systemd/systemd#19781. Another approach is to ship an NSSwitch
module to synthesise passwd entites for UIDs in recognised subuid ranges. But with changes
to support subuid ranges themselves coming from NSSwitch, I don't think it's a good idea
(right now). See this blog post for more detail: https://frasertweedale.github.io/blog-redhat/posts/2021-06-09-systemd-cgroups-subuid.html#discussion-and-next-steps .

I would like if we can push back and convince systemd project (i.e. Lennart) to just relax the checks
in systemd. Then runc could achieve correct cgroup ownership via systemd transient unit API. But I
don't think that can be hoped for.

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For security reason, the default behavior should not be changed.
We can probably have a feature gate as an annotation

https://github.com/containers/crun/blob/main/crun.1.md#extensions-to-oci

https://github.com/opencontainers/runc/blob/master/docs/systemd.md#auxiliary-properties

@frasertweedale
Copy link
Contributor Author

For security reason, the default behavior should not be changed.
We can probably have a feature gate as an annotation

https://github.com/containers/crun/blob/main/crun.1.md#extensions-to-oci

https://github.com/opencontainers/runc/blob/master/docs/systemd.md#auxiliary-properties

User namespaces is already behind an annotation (io.kubernetes.cri-o.userns-mode). And user namespaces are somewhat hobbled without the ability to manage cgroups (for the workloads that I care about anyway). So, do we really need a separate annotation to gate this feature? I don't mind if the answer is yes, but something to consider...

@AkihiroSuda
Copy link
Member

I'm talking about OCI Runtime Spec annotations, not about Kubernetes annotations.

CRI-O may set the suggest OCI Runtime Spec annotation by default at their own risk.

@frasertweedale
Copy link
Contributor Author

@AkihiroSuda thanks for the clarification. Fair enough. I'll gate it. I'll try to make all the suggested changes next week.

@frasertweedale
Copy link
Contributor Author

@AkihiroSuda what would be an appropriate key/value for the annotation? "org.opencontainers.runc.chown-cgroup": "{true,false}"

@AkihiroSuda
Copy link
Member

"org.opencontainers.runc.chown-cgroup": "{true,false}"

SGTM, but in the long term we should have proper JSON entry in OCI Runtime Spec https://github.com/opencontainers/runtime-spec

@frasertweedale
Copy link
Contributor Author

"org.opencontainers.runc.chown-cgroup": "{true,false}"

SGTM, but in the long term we should have proper JSON entry in OCI Runtime Spec https://github.com/opencontainers/runtime-spec

Better would be to adjust the semantics such that the container process can create scopes/slices in its own cgroup, and manage its child processes/threads. But that is a discussion to be had over there, not here :)

@frasertweedale
Copy link
Contributor Author

frasertweedale commented Jul 15, 2021

Thanks @kolyshkin and @AkihiroSuda for your reviews. I implemented all of your requested changes, and added a test.

By the way, for testing with OpenShift, do not use 4.8.0, there is a regression in CRI-O 1.21 that prevents k8s annotations from being propagated to the OCI configuration. CRI-O 1.20 (OpenShift 4.7.x) is fine.

@frasertweedale frasertweedale force-pushed the feature/chown-cgroup branch 3 times, most recently from 43e5c14 to 0f4ffc4 Compare July 16, 2021 04:40
@frasertweedale
Copy link
Contributor Author

Ping @kolyshkin and @AkihiroSuda - are you able to re-review this PR soon?

@frasertweedale frasertweedale changed the title chown cgroup to uid 0 in container namespace chown cgroup to process uid in container namespace Nov 29, 2021
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good and ready; only left a single suggestion.

libcontainer/cgroups/systemd/v2.go Show resolved Hide resolved
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@frasertweedale
Copy link
Contributor Author

Thanks to all who have reviewed this PR and given feedback. We need more approvals and then someone can press the merge button. Ping @cyphar, @AkihiroSuda, @giuseppe, @jfroy.

@frasertweedale
Copy link
Contributor Author

Any takers? @mrunalp?

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cyphar cyphar merged commit cdce249 into opencontainers:master Dec 7, 2021
@frasertweedale
Copy link
Contributor Author

Many thanks, @cyphar and all reviewers.

@frasertweedale
Copy link
Contributor Author

I created #3311 to backport this feature to the release-1.0 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants