Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc v1.2.1 appears to break Rootless BuildKit (cgroup: open /sys/fs/cgroup/snschvixiy3s74w74fjantrdg: no such file or directory) #4518

Closed
AkihiroSuda opened this issue Nov 9, 2024 · 8 comments · Fixed by #4523

Comments

@AkihiroSuda
Copy link
Member

It looks like PR #5443 has broken rootless builds. The problem seems related to issue #4483.

Is a container being removed twice à la double free? These two cases illustrate the issue.

$ cat Dockerfile
FROM alpine
RUN mkdir /tmp/empty_directory

Case 1: working with 0.17.0-rootless

$ docker run \
  --name buildkitd-v17 \
  -d \
  --security-opt seccomp=unconfined \
  --security-opt apparmor=unconfined \
  moby/buildkit:v0.17.0-rootless --oci-worker-no-process-sandbox

$ buildctl --addr docker-container://buildkitd-v17 build --frontend dockerfile.v0 --local context=. --local dockerfile=.
[+] Building 7.9s (5/5) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                   0.6s
 => => transferring dockerfile: 80B                                                                                                                                                                    0.2s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                                       2.2s
 => [internal] load .dockerignore                                                                                                                                                                      0.3s
 => => transferring context: 2B                                                                                                                                                                        0.1s
 => [1/2] FROM docker.io/library/alpine:latest@sha256:beefdbd8a1da6d2915566fde36db9db0b524eb737fc57cd1367effd16dc0d06d                                                                                 2.9s
 => => resolve docker.io/library/alpine:latest@sha256:beefdbd8a1da6d2915566fde36db9db0b524eb737fc57cd1367effd16dc0d06d                                                                                 0.1s
 => => sha256:43c4264eed91be63b206e17d93e75256a6097070ce643c5e8f0379998b44f170 3.62MB / 3.62MB                                                                                                         0.5s
 => => extracting sha256:43c4264eed91be63b206e17d93e75256a6097070ce643c5e8f0379998b44f170                                                                                                              2.0s
 => [2/2] RUN mkdir /tmp/empty_directory                                                                                                                                                               1.0s

Case 2: regression with master-rootless

$ docker run \
  --name buildkitd \
  -d \
  --security-opt seccomp=unconfined \
  --security-opt apparmor=unconfined \
  moby/buildkit:master-rootless --oci-worker-no-process-sandbox

$ buildctl --addr docker-container://buildkitd build --frontend dockerfile.v0 --local context=. --local dockerfile=.
[+] Building 3.4s (5/5) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                   0.2s
 => => transferring dockerfile: 80B                                                                                                                                                                    0.1s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                                       1.0s
 => [internal] load .dockerignore                                                                                                                                                                      0.3s
 => => transferring context: 2B                                                                                                                                                                        0.2s
 => CACHED [1/2] FROM docker.io/library/alpine:latest@sha256:beefdbd8a1da6d2915566fde36db9db0b524eb737fc57cd1367effd16dc0d06d                                                                          0.2s
 => => resolve docker.io/library/alpine:latest@sha256:beefdbd8a1da6d2915566fde36db9db0b524eb737fc57cd1367effd16dc0d06d                                                                                 0.1s
 => ERROR [2/2] RUN mkdir /tmp/empty_directory                                                                                                                                                         0.9s
------
 > [2/2] RUN mkdir /tmp/empty_directory:
------
Dockerfile:2
--------------------
   1 |     FROM alpine
   2 | >>> RUN mkdir /tmp/empty_directory
   3 |
--------------------
error: failed to solve: process "/bin/sh -c mkdir /tmp/empty_directory" did not complete successfully: buildkit-runc did not terminate successfully: exit status 1: unable to destroy container: unable to remove container's cgroup: open /sys/fs/cgroup/snschvixiy3s74w74fjantrdg: no such file or directory

Originally posted by @samiam in moby/buildkit#5491

@kolyshkin
Copy link
Contributor

The current code flow of RemovePath is:

  1. it calls rmdir which returns EROFS ("read-only file system");
  2. it calls os.ReadDir which returns ENOENT ("no such file or directory);
  3. returns ENOENT (from os.ReadDir -- this is a bug).

I see that the container's /sys/fs/cgroup is a read-only bind mount from the host's /sys/fs/cgroup/system.slice/docker-$ID.scope. So, inside the container, runc can neither create nor destroy any cgroups.

So, the question is

  • what should runc delete do wrt cgroup if /sys/fs/cgroup is read-only?

Or, perhaps,

  • should cgroup managers ignore EROFS in case Rootless is set in cgroup config?

@AkihiroSuda ^^^ PTAL

@AkihiroSuda
Copy link
Member Author

should cgroup managers ignore EROFS in case Rootless is set in cgroup config?

The role of Rootless in cgroup config isn't well-defined and it is kinda misnomer (TODO: rename), so we shouldn't introduce a new code path that depends on this field.

Probably EROFS can be always safely ignored? (at least when running in UserNS)

@kolyshkin
Copy link
Contributor

Found out runc prints the following warning:

{
  "level": "warning",
  "msg": "Creating a rootless container with no cgroup and no private pid namespace. Such configuration is strongly discouraged (as it is impossible to properly kill all container's processes) and will result in an error in a future runc version.",
  "time": "2024-11-12T00:25:07Z"
}

This happens when this codepath is hit:

// Do this before syncing with child so that no children can escape the
// cgroup. We don't need to worry about not doing this and not being root
// because we'd be using the rootless cgroup manager in that case.
if err := p.manager.Apply(p.pid()); err != nil {
if errors.Is(err, cgroups.ErrRootless) {
// ErrRootless is to be ignored except when
// the container doesn't have private pidns.
if !p.config.Config.Namespaces.IsPrivate(configs.NEWPID) {
// TODO: make this an error in runc 1.3.
logrus.Warn("Creating a rootless container with no cgroup and no private pid namespace. " +
"Such configuration is strongly discouraged (as it is impossible to properly kill all container's processes) " +
"and will result in an error in a future runc version.")
}

In turn, ErrRootless is returned from fs2.manager.Apply here:

func (m *Manager) Apply(pid int) error {
if err := CreateCgroupPath(m.dirPath, m.config); err != nil {
// Related tests:
// - "runc create (no limits + no cgrouppath + no permission) succeeds"
// - "runc create (rootless + no limits + cgrouppath + no permission) fails with permission error"
// - "runc create (rootless + limits + no cgrouppath + no permission) fails with informative error"
if m.config.Rootless {
if m.config.Path == "" {
if blNeed, nErr := needAnyControllers(m.config.Resources); nErr == nil && !blNeed {
return cgroups.ErrRootless
}
return fmt.Errorf("rootless needs no limits + no cgrouppath when no permission is granted for cgroups: %w", err)

This means we haven't created the cgroup, and probably should not try to remove it at all; yet we try (and fail.

Also, rmdir(2) returns EROFS instead of ENOENT when the cgroup is mounted read-only.

$ rmdir /sys/fs/cgroup/non-existent
rmdir: '/sys/fs/cgroup/123489763456': Read-only file system

So, we have 3 issues here:

  1. We try to remove a cgroup path which we did not create.
  2. We do not handle EROFS properly.
  3. We return ENOENT from os.ReadDir while we should not.

So, I guess, the biggest issue here is runc is not working correctly with cgroup-less containers. What it should do, I guess, if p.manager.Apply fails with ErrRootless, we should basically disable cgroupManager.

@lifubang
Copy link
Member

lifubang commented Nov 12, 2024

In turn, ErrRootless is returned from fs2.manager.Apply here:

One question, we returned ErrRootless, it means that m.config.Path=="", so the cgroup path is not provided in config.json? So the cgroup path is created by runc? But it's a ro mount for cgroupfs, how we can create a cgroup path?

EDIT: the cgroup path /sys/fs/cgroup/snschvixiy3s74w74fjantrdg is created by whom?

@lifubang
Copy link
Member

In turn, ErrRootless is returned from fs2.manager.Apply here:

One question, we returned ErrRootless, it means that m.config.Path=="", so the cgroup path is not provided in config.json? So the cgroup path is created by runc? But it's a ro mount for cgroupfs, how we can create a cgroup path?

EDIT: the cgroup path /sys/fs/cgroup/snschvixiy3s74w74fjantrdg is created by whom?

Oh, I have test it, the cgroup path is not exist.
But unix.Rmdir will fail with EROFS if the target dir is not exist in a ro mount point.
So we should check the dir exist or not before we remove it.

root@acmubuntu:/opt/bb# mkdir from to
root@acmubuntu:/opt/bb# touch from/test
root@acmubuntu:/opt/bb# mount -o bind,ro from to
root@acmubuntu:/opt/bb# ls to
test
root@acmubuntu:/opt/bb# touch to/test1
touch: cannot touch 'to/test1': Read-only file system
root@acmubuntu:/opt/bb# mkdir to/test1
mkdir: cannot create directory ‘to/test1’: Read-only file system
root@acmubuntu:/opt/bb# ls to/test1
ls: cannot access 'to/test1': No such file or directory
root@acmubuntu:/opt/bb# rmdir to/test1
rmdir: failed to remove 'to/test1': Read-only file system
root@acmubuntu:/opt/bb#

@lifubang
Copy link
Member

What it should do, I guess, if p.manager.Apply fails with ErrRootless, we should basically disable cgroupManager.

I think it's hard to do, maybe not only ErrRootless, we should consider the cgroup path is provided by the user or created by runc.
Then we should serialize them to state.json.

@kolyshkin
Copy link
Contributor

In turn, ErrRootless is returned from fs2.manager.Apply here:

One question, we returned ErrRootless, it means that m.config.Path=="", so the cgroup path is not provided in config.json? So the cgroup path is created by runc? But it's a ro mount for cgroupfs, how we can create a cgroup path?

EDIT: the cgroup path /sys/fs/cgroup/snschvixiy3s74w74fjantrdg is created by whom?

If cgroup path is not set in any way, it is set to container id (in case of fs cgroup driver). First, cgroup.Name is set to container id here:

CgroupName: id,

and then here:

@kolyshkin
Copy link
Contributor

But unix.Rmdir will fail with EROFS if the target dir is not exist in a ro mount point.
So we should check the dir exist or not before we remove it.

Yes, I noted that earlier in #4518 (comment)

I think that the most common code flow is when rmdir succeeds, and we should optimize for that. Meaning, try rmdir first, and if we got an error* try checking if the directory exists. This is actually what os.ReadDir could also do.

In general, I think, it's better to not even try to remove what we haven't even created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants