Finding the minimal set of privileges for a docker container to spawn rootless containers #1456

ggoodman · 2017-05-19T19:33:30Z

I've been flailing away at the idea to run a pool of rootless containers as children of a docker container. My intent is to have the docker container run a web server that will spin up a pool of child, rootless containers to which requests can be proxied. These children would be designed to be isolated from each other and the host system from the side-effects of running untrusted code.

I need to pass additional file descriptors to these children which precludes running children as siblings using the host docker daemon. So here I am and I hope I'm not overstepping my bounds by asking for guidance via an issue.

Set up

Create a root filesystem tgz:

$ docker export $(docker create alpine) > rootfs.tgz

Dockerfile with runc, libseccomp2 and the rootfs:

FROM buildpack-deps

RUN apt-get update && apt-get install -y --no-install-recommends \
		libseccomp2 \
	&& rm -rf /var/lib/apt/lists/*

ADD rootfs.tgz /child/rootfs
ADD runc /usr/local/sbin/runc

WORKDIR /child/rootfs

RUN runc spec --rootless

CMD ["runc", "run", "child"]

False starts:

Build and run the container, adding CAP_SYS_ADMIN:

$ docker run --rm -it --cap-add SYS_ADMIN $(docker build -q .)
container_linux.go:265: starting container process caused "process_linux.go:261: applying cgroup configuration
for process caused \"mkdir /sys/fs/cgroup/cpuset/child: read-only file system\""

Same, but mount /sys/fs/cgroup as rw:

$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .)
container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"could
not create session key: operation not permitted\""

Same, but invoke runc with --no-new-keyring:

$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .) runc run --no-new-keyring child
container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"rootfs
_linux.go:104: jailing process inside rootfs caused \\\"pivot_root operation not permitted\\\"\""

Finally 'working':

Same, but also add --no-pivot:

$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .) runc run --no-new-keyring --no-pivot child
/ #

Disclaimer: I'm still wrapping my head around all of the complexity and nuances of all the technologies we call 'containers' so please correct me if I'm wrong.

Removing pivot_root seems like a bad idea given my objectives so I created a copy of the default seccomp profile and added the pivot_root syscall to the big list of SCMP_ACT_ALLOW calls. This let me drop --no-pivot.

What kind of exposure am I creating by opening up by whitelisting the pivot_root syscall?

Also, I'm past my abilities in trying to figure out how I might avoid --no-new-keyring

What kind of exposure am I creating by using the --no-new-keyring flag?

The text was updated successfully, but these errors were encountered:

cyphar · 2017-05-24T16:52:04Z

docker run --rm -it --cap-add SYS_ADMIN

You are already running privileged containers at this point. You'd need to do --cap-drop all and a few other flags to entire drop privileges inside Docker. Basically you're running as mostly-root if you add that capability (and Docker has a bunch enabled by default).

What kind of exposure am I creating by opening up by whitelisting the pivot_root syscall?

None really. pivot_root is a more secure chroot. Docker disables it because it involves messing with mount namespaces (which normal containers shouldn't be doing) but it's not a security issue in principle (maybe @jessfraz might remember why it was added).

What kind of exposure am I creating by using the --no-new-keyring flag?

Processes inside the container can access the host's kernel keyring directly (which contains various crypto stuff that some system components use) if a process is running with privileges. Though processes inside the inner container (if you're using user namespaces) might be blocked (I haven't read that kernel code in a while though).

In general I would discourage it, but to be honest if you don't specify --no-new-keyring I would wager that there's a kernel-level hole that just hasn't been solved yet from the kernel side.

cyphar · 2017-05-24T16:53:32Z

Oh sorry, this too

Same, but mount /sys/fs/cgroup as rw

Rootless containers most definitely cannot do this. The reason that cgroups actually work in the inner container is because your container has CAP_SYS_ADMIN, which is basically the majority of root functionality.

ggoodman · 2017-05-24T19:36:11Z

@cyphar thank you for taking the time. I'm super excited about the whole rootless container story and all the work you've been doing.

I've now been able to get a working prototype of the concept described above. The key differences are that I have been unable to get the networking setup without adding CAP_NET_ADMIN to the parent, docker container and running the container's process as root. Despite running as root, I still seem to need to mount the host cgroup fs and run the child sandboxes with the --no-new-keyring option.

Would it be fair to say that the machinery is not yet in place to be able to run rootless containers with networking as children of unprivileged containers themselves?

To provide network access, I've picked up @jessfraz' netns to great, frictionless effect and run this as a prestart hook. This also meant mounting the host's /lib/modules (from the alpine-based xhyve vm used by docker-for-mac) and installing iptables and kmod (for modprobe) in the debian-based parent docker container.

Given the lack of granularity in CAP_SYS_ADMIN, are we forced to play with seccomp filters as a mechanism to limit access to the components of this flag's capabilities?

Please feel free to close at any time given that this isn't so much of an issue as it is a discussion. I hope that it helps someone else in their wacky experimentation down the road.

jessfraz · 2017-05-25T16:40:30Z

What kind of exposure am I creating by opening up by whitelisting the pivot_root syscall?

This was merely not included to prevent people from shooting themselves in the foot, it should be a privileged operation, also if people were using pivot_root they were probably using other things that were blocked as well like mount or chroot so it kinda just went hand in hand, there wasn't an exact reason

frezbo · 2017-12-27T05:40:55Z

@ggoodman @cyphar Good to see the awesome discussion, I'm also trying to run runc inside a container as a normal user (rootless), where I'm facing much issues, and I do not want to run the base container will all the privileges, I was able to get it working with udocker (https://github.com/indigo-dc/udocker), but I was hoping I could use runc and avoid all the other python dependency, the problem with udocker being it does not support OCI complaint images, so I would like to get your opinions on whats the best approach and the hurdles came across while running them. Thanks.

cyphar · 2017-12-27T06:02:14Z

@frezbo You can use skopeo to convert a Docker image into an OCI image (I've added support to skopeo to allow you to convert a local file that you would generate from docker save). If you then want to unpack the OCI image into an OCI runtime configuration that can be used by runc, you can use umoci, which is a tool I wrote. umoci has a --rootless flag which generates a rootless OCI runtime configuration and root filesystem without needing root (please read the relevant section in umoci's documentation for caveats).

frezbo · 2017-12-27T06:11:55Z

@cyphar I have used both the awesome tools by you, and the relevant issues is here: indigo-dc/udocker#111, udcoker can only understand v1 spec and skopeo copies from oci to v2, so I'm planning to use runc inside a container, with the least possible privilege escalation

rutsky mentioned this issue Nov 20, 2017

Rootless containers don't work from unprivileged non-root Docker container (operation not permitted for mounting procfs) #1658

Closed

kenorb mentioned this issue Oct 8, 2020

Not able to run Docker with cgroups being read-only in rootless mode docker/for-linux#1124

Closed

2 tasks

chenk008 mentioned this issue Jul 13, 2021

[RFC] k8s-native worker pool ray-project/ray#14077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding the minimal set of privileges for a docker container to spawn rootless containers #1456

Finding the minimal set of privileges for a docker container to spawn rootless containers #1456

ggoodman commented May 19, 2017

cyphar commented May 24, 2017

cyphar commented May 24, 2017

ggoodman commented May 24, 2017

jessfraz commented May 25, 2017 •

edited

Loading

frezbo commented Dec 27, 2017

cyphar commented Dec 27, 2017

frezbo commented Dec 27, 2017 •

edited

Loading

Finding the minimal set of privileges for a docker container to spawn rootless containers #1456

Finding the minimal set of privileges for a docker container to spawn rootless containers #1456

Comments

ggoodman commented May 19, 2017

Set up

False starts:

Finally 'working':

cyphar commented May 24, 2017

cyphar commented May 24, 2017

ggoodman commented May 24, 2017

jessfraz commented May 25, 2017 • edited Loading

frezbo commented Dec 27, 2017

cyphar commented Dec 27, 2017

frezbo commented Dec 27, 2017 • edited Loading

jessfraz commented May 25, 2017 •

edited

Loading

frezbo commented Dec 27, 2017 •

edited

Loading