Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsexec: cloned_binary: remove bindfd logic entirely #3931

Merged
merged 2 commits into from
Aug 5, 2023
Merged

nsexec: cloned_binary: remove bindfd logic entirely #3931

merged 2 commits into from
Aug 5, 2023

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Jul 7, 2023

While the ro-bind-mount trick did eliminate the memory overhead of
copying the runc binary for each "runc init" invocation, on machines
with very significant container churn, creating a temporary mount
namespace on every container invocation can trigger severe lock
contention on namespace_sem that makes containers fail to spawn.

The only reason we added bindfd in commit 16612d7 ("nsenter:
cloned_binary: try to ro-bind /proc/self/exe before copying") was due to
a Kubernetes e2e test failure where they had a ridiculously small memory
limit. It seems incredibly unlikely that real workloads are running
without 10MB to spare for the very short time that runc is interacting
with the container.

In addition, since the original cloned_binary implementation, cgroupv2
is now almost universally used on modern systems. Unlike cgroupv1, the
cgroupv2 memcg implementation does not migrate memory usage when
processes change cgroups (even cgroupv1 only did this if you had
memory.move_charge_at_immigrate enabled). In addition, because we do the
/proc/self/exe clone before synchronising the bootstrap data read, we
are guaranteed to do the clone before "runc init" is moved into the
container cgroup -- meaning that the memory used by the /proc/self/exe
clone is charged against the root cgroup, and thus container workloads
should not be affected at all with memfd cloning.

The long-term fix for this problem is to block the /proc/self/exe
re-opening attack entirely in-kernel, which is something I'm working
on
. Though it should also be noted that because the memfd is
completely separate to the host binary, even attacks like Dirty COW
against the runc binary can be defended against with the memfd approach.
Of course, once we have in-kernel protection against the /proc/self/exe
re-opening attack, we won't have that protection anymore...

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

@113xiaoji
Copy link

Is it possible to implement Exec with a separate container sharing the same rootfs and mount namespace with the original container. The advantage is that the Exec container could have it's own sub-cgroup, so that it will not consume the resource of application container and user could specify dedicated resource for it.

@cyphar
Copy link
Member Author

cyphar commented Jul 11, 2023

Is it possible to implement Exec with a separate container sharing the same rootfs and mount namespace with the original container.

Is it possible? Yes. Is it something we will do? No.

runc's resource allocation is based on the idea that the main container process and runc exec processes are all in the same cgroups and same namespaces. This is mandated in the OCI specification and all of the users of runc expect this to be the case.

@113xiaoji
Copy link

Is there a plan to merge into the main branch? Looking forward to it.

@cyphar
Copy link
Member Author

cyphar commented Jul 31, 2023

I've just done some tests and looked at the memcg source code. It seems that the memfd cloning has no effect if you have cgroupv2 memcg -- the reason is that with v2 memcg, they do not migrate memory usage to the cgroup when the process is migrated.

When we first added the memfd code, everyone was still using v1 memcg (which does do migrations if you have move_charge_at_immigrate enabled) which caused issues -- the memfd_create memory would be applied to the container. However, based on my testing I suspect there's actually no need for memfd-bind on modern cgroupv2 systems -- the memfd_create usage will be accounted in the root cgroup no matter what under cgroupv2 (because we do the memfd re-exec before we join the cgroup).

From the horse's mouth:

static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
	/* ... */
	/* charge immigration isn't supported on the default hierarchy */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return 0;
	/* ... */
}

And this behaviour has existed since the introduction of cgroupv2 memcg (due to the fact that move_charge_at_immigrate is 0 on the default hierarchy).

@113xiaoji I'll clean up this PR a little bit (given this new information) and mark this as read for review. Of course I plan for this to be merged, depending on review from the other maintainers.

@cyphar cyphar marked this pull request as ready for review August 1, 2023 03:44
@cyphar
Copy link
Member Author

cyphar commented Aug 1, 2023

@opencontainers/runc-maintainers This is ready for review. It turns out on cgroupv2 (and even cgroupv1 by default -- unless memory.move_charge_at_immigrate is enabled) there is no memory overhead inside containers when using a memfd-backed cloned binary. In principle there is some memory churn to using memfd_create(2) -- but the overhead of creating a mount namespace and doing bind-mounts just for /proc/self/exe is almost certainly worse. In addition, doing it this way protects runc against Dirty COW -style attacks.

@113xiaoji
Copy link

The fact that a memfd-backed cloned binary doesn't consume container memory is great news. I can't wait for it to be merged and released in a stable version.

@cyphar cyphar mentioned this pull request Aug 1, 2023
4 tasks
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyphar cyphar added this to the 1.2.0 milestone Aug 2, 2023
@cyphar
Copy link
Member Author

cyphar commented Aug 4, 2023

(I should point out that this code is going to be migrated to Go in #3953.)

While the ro-bind-mount trick did eliminate the memory overhead of
copying the runc binary for each "runc init" invocation, on machines
with very significant container churn, creating a temporary mount
namespace on every container invocation can trigger severe lock
contention on namespace_sem that makes containers fail to spawn.

The only reason we added bindfd in commit 16612d7 ("nsenter:
cloned_binary: try to ro-bind /proc/self/exe before copying") was due to
a Kubernetes e2e test failure where they had a ridiculously small memory
limit. It seems incredibly unlikely that real workloads are running
without 10MB to spare for the very short time that runc is interacting
with the container.

In addition, since the original cloned_binary implementation, cgroupv2
is now almost universally used on modern systems. Unlike cgroupv1, the
cgroupv2 memcg implementation does not migrate memory usage when
processes change cgroups (even cgroupv1 only did this if you had
memory.move_charge_at_immigrate enabled). In addition, because we do the
/proc/self/exe clone before synchronising the bootstrap data read, we
are guaranteed to do the clone before "runc init" is moved into the
container cgroup -- meaning that the memory used by the /proc/self/exe
clone is charged against the root cgroup, and thus container workloads
should not be affected at all with memfd cloning.

The long-term fix for this problem is to block the /proc/self/exe
re-opening attack entirely in-kernel, which is something I'm working
on[1]. Though it should also be noted that because the memfd is
completely separate to the host binary, even attacks like Dirty COW
against the runc binary can be defended against with the memfd approach.
Of course, once we have in-kernel protection against the /proc/self/exe
re-opening attack, we won't have that protection anymore...

[1]: https://lwn.net/Articles/934460/

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@cyphar
Copy link
Member Author

cyphar commented Aug 4, 2023

@AkihiroSuda ptal

@cyphar cyphar requested a review from AkihiroSuda August 4, 2023 11:45
/*
* Try to seal with newer seals, but we ignore errors because older
* kernels don't support some of them. For container security only
* RUNC_MEMFD_MIN_SEALS are strictly required, but the rest are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seals seem added in Linux 3.17, but we still need to support kernel 3.10 at least until the EOL of CentOS 7.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Slightly off-topic, but we have to have an official documentation to clarify the minimum supported kernel version)

Copy link
Member Author

@cyphar cyphar Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memfd_create() was added in 3.17, the seals were part of the initial implementation. We handle older kernel versions with O_TMPFILE (3.11 and later) or mkostemp(3) (glibc) for ancient kernels.

This PR doesn't change any of this behaviour, it just removes bindfd and passes MFD_EXEC as well as applying the two newer seals but ignores errors when adding them because they're not necessary. IOW the behaviour on older kernels is unchanged.

With the new vm.memfd_noexec sysctl, we need to make sure we explicitly
request MFD_EXEC, otherwise an admin could inadvertently break
containers in a somewhat-annoying-to-debug fashion.

It should be noted that vm.memfd_noexec=2 is broken on Linux 6.4
(MFD_EXEC works even in the most restrictive mode) and the most severe
breakage is going to be fixed in Linux 6.6[1].

[1]: https://lore.kernel.org/20230705063315.3680666-2-jeffxu@google.com/

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
if (fd >= 0)
return fd;
if (errno != ENOSYS && errno != EINVAL)
if (!is_memfd_unsupported_error(errno))
goto error;

#ifdef O_TMPFILE
Copy link
Member

@lifubang lifubang Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭 Maybe there is a bug for many years? In some old kernels, we will use O_TMPFILE or mkostemp, but at this time, the runc state dir has not been created yet. So we will got an error in these old kernels:
FATA[0000] nsexec[18089]: could not ensure we are a cloned binary: Permission denied

So, we should deal with it.

  1. use /tmp directly;
    or
  2. continue to use getenv("_LIBCONTAINER_STATEDIR"), but we should create first, and delete it when got an error, I think it's very complex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think we should split this function to 3 functions: make_memfd, make_o_tmpfile_fd, make_ostemp_fd, it's convenient to write unit test cases.

Copy link
Member Author

@cyphar cyphar Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's pretty embarrassing 😅.

The logic for using the state directory is that we are guaranteed it is writable -- if we try to use some other directory it might be mounted as ro or the user might not have write permission. I guess it's okay to use /tmp though.

It's very odd that this wasn't caught before -- I tested the O_TMPFILE logic pretty extensively when I first implemented this code (on some SLES versions we don't have memfds). Some aspect of the statedir code must've changed after we added bindfd (which masked the breakage). Permission denied seems like a strange error for a missing directory -- I would expect ENOENT instead...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think we should split this function to 3 functions: make_memfd, make_o_tmpfile_fd, make_ostemp_fd, it's convenient to write unit test cases.

All of this code is going to be deleted and moved to Go in #3953. I'd prefer to keep this patch as simple as possible, and we can talk about how to break up the implementation in that PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep this patch as simple as possible, and we can talk about how to break up the implementation in that PR.

I quite agree with you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug is not relate to this PR, I think this PR could be merged, and open a new PR to fix this error or waiting you fix it in Go code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix the O_TMPFILE bug here. Let me take a look...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang I can't reproduce the issue you mentioned -- if I delete the memfd code, ensure_cloned_binary() still works. If I delete the O_TMPFILE code too, it also still works.

Since you're getting -EACCES it seems likely this is an LSM-related issue. Can you open a separate bug report about that, since it seems unlikely this is a common problem. I think we can merge this as-is. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's merge this and open an new issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened here: #3965

@cyphar cyphar mentioned this pull request Aug 4, 2023
@lifubang lifubang merged commit acab6f6 into opencontainers:main Aug 5, 2023
36 checks passed
@cyphar cyphar deleted the remove-bindfd branch August 5, 2023 15:48
@cyphar cyphar mentioned this pull request Aug 16, 2023
@cyphar cyphar mentioned this pull request Mar 14, 2024
@lifubang lifubang added the backport/1.1-done A PR in main branch which has been backported to release-1.1 label Sep 3, 2024
@lifubang
Copy link
Member

The only reason we added bindfd in commit 16612d7 ("nsenter:
cloned_binary: try to ro-bind /proc/self/exe before copying") was due to
a Kubernetes e2e test failure where they had a ridiculously small memory
limit. It seems incredibly unlikely that real workloads are running
without 10MB to spare for the very short time that runc is interacting
with the container.

As we discussed in #4439, maybe there is another reason we need bindfd, it's the speed of container start time.
The memfd solution is 30%-40% slower than bindfd logic if we are really caring about the memory cgroup accounting.

lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lifubang <lifubang@acmcoder.com>
@lifubang lifubang mentioned this pull request Oct 15, 2024
@cyphar
Copy link
Member Author

cyphar commented Oct 15, 2024

@lifubang

The reason I wrote it was the Kubernetes e2e issue. The problem is that bindfd is actually not safe against CAP_SYS_ADMIN containers (even with user namespaces, because we don't do the extra unshares necessary to produce locked mount flags) and so I've wanted to remove it for a long time but couldn't because I thought the memory usage issue was insurmountable. But it turns out it wasn't (though to be fair -- I didn't realise we did have an impact due to our synchronisation until #4439 -- oops...).

There are some other possible options I have in mind like creating an overlayfs (which cannot be "unsealed" by the container like MS_RDONLY can) and then executing that (we can just use the new mount API so we don't need to worry about mount table issues with systemd). Unfortunately you can't use overlayfs with a single file so this will be a little ugly in practice. EDIT: The nicest way is to create a two-layer lowerdir-only overlayfs.

@cyphar
Copy link
Member Author

cyphar commented Oct 15, 2024

We might also be able to skip copying if using user namespaces and /usr/bin/runc is a read-only bind-mount. However, there is no way of verifying whether a mount flag is locked so I'm worried this could be brittle and we might end up with insecure containers as a result.

lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.
lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lifubang lifubang@acmcoder.com
lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lifubang lifubang@acmcoder.com
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lfbzhm <lifubang@acmcoder.com>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 18, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lfbzhm <lifubang@acmcoder.com>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 18, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lfbzhm <lifubang@acmcoder.com>
kolyshkin pushed a commit to lifubang/runc that referenced this pull request Oct 21, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.1-done A PR in main branch which has been backported to release-1.1 status/4-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants