Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsenter: cloned_binary: "memfd" cleanups #1984

Merged
merged 5 commits into from
Mar 7, 2019
Merged

nsenter: cloned_binary: "memfd" cleanups #1984

merged 5 commits into from
Mar 7, 2019

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Feb 15, 2019

For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).

In order to get around the memfd_create(2) requirement, 0a8e411
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:

  • It required O_TMPFILE which is relatively new (having been added to
    Linux 3.11).

  • The fallback choice was made at compile-time, not runtime. This
    results in several complications when it comes to running binaries
    on different machines to the ones they were built on.

The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).

The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).

The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a very bad idea).

Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked alongside the protection being
present.

Fixes #1979
Fixes #1980
Fixes #1988
Signed-off-by: Aleksa Sarai asarai@suse.de

@cyphar
Copy link
Member Author

cyphar commented Feb 15, 2019

This new setup would be an opportunity to fix #1980, by making the O_TMPFILE be the statedir of runc. But we'd have to allow you to disable memfd_create(2) usage at runtime which is just going to be ugly.

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
libcontainer/nsenter/cloned_binary.c Show resolved Hide resolved
@thaJeztah
Copy link
Member

/cc @kolyshkin

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
@kolyshkin
Copy link
Contributor

kolyshkin commented Feb 18, 2019

Thanks for writing this, you beat me to it :) Left some minor comments in the code. The only issue remaining is memfd_create() is not used in a scenario when compiled with older includes (i.e. no __NR_memfd_create) and later run under a kernel that supports it. The fix would be to add something like

/* Use our own wrapper for memfd_create. */
#if !defined(__NR_memfd_create)
#  ifdef __i386__
#    define __NR_memfd_create 356
#  elif __x86_64__
#    define __NR_memfd_create 319
#  elif __powerpc64__
#    define __NR_memfd_create 360
#  elif __aarch64__
#    define __NR_memfd_create 279
#  elif __arm__
#    define __NR_memfd_create 385
/* FIXME more arches here */
#  endif
#endif /* !__NR_memfd_create */

but this is borderline ugly, and in the light of #1988 and #1980 maybe it makes sense to just not use memfd_create() at all, only leaving other methods? I'm not proposinig it, just an idea to discuss...

@cyphar
Copy link
Member Author

cyphar commented Feb 19, 2019

@kolyshkin We don't do that for any other syscall and I really don't like the idea of hard-coding syscall numbers. Yeah, it's part of the Linux ABI and thus won't break, but I'm really not a fan of it.

@brauner
Copy link
Contributor

brauner commented Feb 19, 2019

@cyphar, did you think about dropping O_TMPFILE? We now have merged the old kernel pieces and now have

https://github.com/lxc/lxc/blob/2d8bd1db234462356c73b12f238a615b2d1cb550/src/lxc/rexec.c#L106

@cyphar
Copy link
Member Author

cyphar commented Feb 19, 2019

@brauner I don't have a strong opinion about whether I should keep O_TMPFILE. On the one hand having an extra fallback just means that the mkostemp path won't be tested as well, but on the other hand O_TMPFILE is in principle the "better" way of doing it.

@cyphar
Copy link
Member Author

cyphar commented Feb 20, 2019

Any strong opinions on this one (in particular, should we keep O_TMPFILE or drop it). Or even drop memfd_create entirely?

/cc @opencontainers/runc-maintainers

@thaJeztah
Copy link
Member

ping @justincormack as well; PTAL

@mrunalp
Copy link
Contributor

mrunalp commented Feb 20, 2019

@cyphar I am in favor of retaining O_TMPFILE.

@mrunalp
Copy link
Contributor

mrunalp commented Feb 20, 2019

I like the approach you have taken in the PR. I am on the fence on whether we should remove memfd_create right away as you provide a way to disable it.

@mrunalp
Copy link
Contributor

mrunalp commented Feb 20, 2019

I am approving this but will be happy to re-approve even if majority thinks that we should drop memfd_create. Thanks!

@mrunalp
Copy link
Contributor

mrunalp commented Feb 20, 2019

LGTM

Approved with PullApprove

@kolyshkin
Copy link
Contributor

kolyshkin commented Feb 21, 2019

TL;DR: it looks like memfd_create() is not properly ported to RHEL7 kernels (including the latest and greatest RHEL 7.6). Tested on CentOS, not real RHEL.

All the gory details

I tested `memfd_create` and friends, using memfd_test.c from the kernel sources.

On CentOS 7.2 kernel (3.10.0-327.el7.x86_64) and it fails right there, giving EINVAL on MFD_ALLOW_SEALING flag.

I tested it on latest CentOS 7.6 kernel (3.10.0-957.5.1.el7.x86_64) and it fails mid-way on SHARE_OPEN test in

https://github.com/torvalds/linux/blob/f6163d67cc31b8f2a946c4df82be3c6dd918412d/tools/testing/selftests/memfd/memfd_test.c#L881

with:

memfd: SHARE-OPEN 
open(/proc/self/fd/3) failed: Text file busy
Aborted

Here's a relevant portion of strace:

write(1, "memfd: SHARE-OPEN \n", 19memfd: SHARE-OPEN 
)    = 19
memfd_create("kern_memfd_share_open", MFD_CLOEXEC|MFD_ALLOW_SEALING) = 3
ftruncate(3, 8192)                      = 0
fcntl(3, F_GET_SEALS)                   = 0
openat(AT_FDCWD, "/proc/self/fd/3", O_RDWR) = 4
fcntl(3, F_GET_SEALS)                   = 0
fcntl(3, F_ADD_SEALS, F_SEAL_WRITE)     = 0
fcntl(3, F_GET_SEALS)                   = 0x8 (seals F_SEAL_WRITE)
fcntl(4, F_GET_SEALS)                   = 0x8 (seals F_SEAL_WRITE)
fcntl(4, F_GET_SEALS)                   = 0x8 (seals F_SEAL_WRITE)
fcntl(4, F_ADD_SEALS, F_SEAL_SHRINK)    = 0
fcntl(3, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
fcntl(4, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
close(3)                                = 0
openat(AT_FDCWD, "/proc/self/fd/4", O_RDONLY) = 3
fcntl(3, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
fcntl(3, F_ADD_SEALS, F_SEAL_SEAL)      = -1 EPERM (Operation not permitted)
fcntl(3, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
fcntl(4, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
close(4)                                = 0
openat(AT_FDCWD, "/proc/self/fd/3", O_RDWR) = -1 ETXTBSY (Text file busy)
write(1, "open(/proc/self/fd/3) failed: Te"..., 45open(/proc/self/fd/3) failed: Text file busy
) = 45

Note that the above test case was part of initial implementation of memfd_create(). I have retried test with selinux disabled -- same failure.

I didn't look whether this is the way that this runc patch uses memfd_create (probably not).

All this gives me an impression that memfd_create() and friends are in bad shape even in latest RHEL7 kernels.

@kolyshkin
Copy link
Contributor

Now, for my great surprise, I ran the same test binary on Debian Jessie (which is 3.16 but have memfd stuff backported), and it passes with flying colors:

root@kir-deb-jessie:~# cat /etc/debian_version 
8.10
root@kir-deb-jessie:~# uname -a
Linux kir-deb-jessie 3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08) x86_64 GNU/Linux
root@kir-deb-jessie:~# ./memfd_test 
memfd: CREATE
memfd: BASIC
memfd: SEAL-WRITE
memfd: SEAL-SHRINK
memfd: SEAL-GROW
memfd: SEAL-RESIZE
memfd: SHARE-DUP 
memfd: SHARE-MMAP 
memfd: SHARE-OPEN 
memfd: SHARE-FORK 
memfd: SHARE-DUP (shared file-table)
memfd: SHARE-MMAP (shared file-table)
memfd: SHARE-OPEN (shared file-table)
memfd: SHARE-FORK (shared file-table)
memfd: DONE

@kolyshkin
Copy link
Contributor

Filed a bug to Red Hat: https://bugzilla.redhat.com/show_bug.cgi?id=1679829

@lifubang
Copy link
Member

lifubang commented Feb 22, 2019

When use _LIBCONTAINER_DISABLE_MEMFD_CLONE=1 runc create test1
It will get an error:

nsenter: could not ensure we are a cloned binary: Permission denied
container_linux.go:344: starting container process caused "process_linux.go:293: copying bootstrap data to pipe caused \"write init-p: broken pipe\""

The error is raised in here:

fexecve(execfd, argv, environ);
return -ENOEXEC;

I use root permission to run runc. May be need some special permission?
Or is still on working for _LIBCONTAINER_DISABLE_MEMFD_CLONE=1?

@cyphar
Copy link
Member Author

cyphar commented Feb 23, 2019

nsenter: could not ensure we are a cloned binary: Permission denied

I'm really not sure, can you please provide some more information about your setup -- an strace log would also be super useful. I am aware of an AppArmor bug which results in writing to deleted files being blocked (either that or you have noexec on the mount point for the runc state directory in /run).

I'm currently working on moving the cgroup setup code so that it's not necessary to disable memfd_create (and so I will remove the ability to disable it).

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar
Copy link
Member Author

cyphar commented Mar 8, 2019

Ah dammit, this wasn't meant to be merged -- the bind-mount approach actually isn't fool-proof (if you can do a mount, you can get around it). However, the default profiles actually disallow mounting so this isn't the end of the world. I'll submit a follow-up PR that uses overlayfs rather than bind-mounts.

@Ace-Tang
Copy link
Contributor

Ace-Tang commented Mar 8, 2019

Bad news... we plan to appy this patch to our production environment, looks like still need to wait

@AkihiroSuda
Copy link
Member

I'll submit a follow-up PR that uses overlayfs rather than bind-mounts.

It won't work in userns?

@cyphar
Copy link
Member Author

cyphar commented Mar 8, 2019

It won't work in userns?

No, but that's why we have fallbacks. I've submitted #2006.

However I've come to the conclusion that running a privileged container with CAP_SYS_ADMIN is simply an insecure setup, so maybe it's better if we just leave it in this state -- where arguably it's not secure with CAP_SYS_ADMIN (but there's almost certainly other issues with running CAP_SYS_ADMIN). CAP_SYS_ADMIN allows you to disable seccomp protections, do privileged device ioctls, among other things.

{
switch (fdtype) {
case EFD_MEMFD:
return fcntl(*fd, F_ADD_SEALS, RUNC_MEMFD_SEALS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about modify fallback like try_bindfd? For example:
try_memfd()
try_otmpfile()
try_oldschooltemp()

Is there a fail situation like this: memfd_create success, but F_ADD_SEALS fail?

Copy link
Member Author

@cyphar cyphar Mar 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to submit a patch to do that, sure.

However, I spent several weeks working on it or answering questions about it, and given painfully trivial the fix for the CVE is ("make a copy and execute that"), working on it for any more time feels like a waste of time to me.

If F_ADD_SEALS fails (with memfd_create and MFD_ALLOW_SEALING working) then the kernel is broken -- and while that can happen with some distributions I'd prefer to actually see it be an existing problem before trying to fix it.

tianon added a commit to infosiftr/snapd that referenced this pull request Jul 10, 2019
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about).

See also canonical#6610.

Signed-off-by: Tianon Gravi <tianon@debian.org>
tianon added a commit to infosiftr/snapd that referenced this pull request Jul 11, 2019
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about).

See also canonical#6610.

Signed-off-by: Tianon Gravi <tianon@debian.org>
tianon added a commit to infosiftr/snapd that referenced this pull request Jul 12, 2019
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about).

See also canonical#6610.

Signed-off-by: Tianon Gravi <tianon@debian.org>
anonymouse64 added a commit to anonymouse64/snapd that referenced this pull request Jul 12, 2019
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about).

See also canonical#6610.

(originally from Tianon Gravi, but re-committed due to CLA issues with the PR checks)

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
mvo5 pushed a commit to canonical/snapd that referenced this pull request Jul 15, 2019
Newer runC applied further improvements to their CVE-2019-5736 mitigation in opencontainers/runc#1984 which change the nature of our apparmor denial from `/` to `/bin/runc` (which I have also commented on https://bugs.launchpad.net/apparmor/+bug/1820344 about).

See also #6610.

(originally from Tianon Gravi, but re-committed due to CLA issues with the PR checks)

Signed-off-by: Ian Johnson <ian.johnson@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet