Skip to content
This repository has been archived by the owner on Jun 11, 2020. It is now read-only.

[17.06 backport] remove hot-fix, and apply latest upstream patch for CVE-2019-5736 #9

Merged
merged 9 commits into from
Mar 28, 2019

Conversation

thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Feb 13, 2019

The previous fix did not yet have the alternative approach for 3.13 kernels

This removes the old patch, and applies the latest upstream path that should also support 3.13 kernels

git checkout -b 17.06_backport_runc_cve docker/17.06
git revert -m1 -s -S fc48a25bde6fb041aae0977111ad8141ff396438
git cherry-pick -s -S -x 0a8e4117e7f715d5fbeef398405813ce8e88558b

thaJeztah and others added 2 commits February 14, 2019 00:14
This reverts commit fc48a25, reversing
changes made to 519d2ac.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).

We require memfd_create(2) -- though there is an O_TMPFILE fallback --
but we can always extend this to use a scratch MNT_DETACH overlayfs or
tmpfs. The main downside to this approach is no page-cache sharing for
the runc binary (which overlayfs would give us) but this is far less
complicated.

This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).

Fixes: CVE-2019-5736
Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
(cherry picked from commit 0a8e411)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah
Copy link
Member Author

ping @seemethere @kolyshkin @andrewhsu PTAL

@andrewhsu andrewhsu requested a review from kolyshkin February 14, 2019 03:56
@kolyshkin
Copy link

There is one thing that might be problematic here. The version of cloned_binary.c in this PR detects memfd_create() availability during compile time, based on what's available in <sys/syscall.h> header. This can lead to two problematic scenarios:

  1. memfd_create syscall number is available from the header during compilation, but runtime kernel is old so everything will fall apart.
  2. memfd_create is not available during compile time, so the code compiled will not use it, even if the runtime kernel supports it.

Scenario 1 is a total disaster. Scenario 2 is potentially a problem (the alternative fix, which is not using memfd_create(), might be less robust or more error-prone, or something along the lines).

Given the variety of distros and kernels we support, maybe it makes sense to do runtime detection (call memfd_create(), if it returns ENOSYS, fall back to alternative method.

@thaJeztah
Copy link
Member Author

Scenario 2 is potentially a problem (the alternative fix, which is not using memfd_create(), might be less robust or more error-prone, or something along the lines).

Yes; wondering if we can enable the fallback only for Ubuntu 14.04 and Debian Jessie (the only distros we support that have a kernel that's older than 3.17)

Given the variety of distros and kernels we support, maybe it makes sense to do runtime detection (call memfd_create(), if it returns ENOSYS, fall back to alternative method.

That would be the alternative (although likely only needed for the few old distros that we still support). If you have a suggestion for such a patch, should we upstream that patch?

@seemethere
Copy link

This is what I get whenever I try to run it on ubuntu trusty with 3.13.x:

vagrant@vagrant-ubuntu-trusty-64:~/go/src/github.com/opencontainers/runc$ sudo docker run --rm -it busybox echo hello
docker: Error response from daemon: OCI runtime create failed: target os  mismatch with current os linux: unknown.
ERRO[0000] error waiting for container: context canceled

@thaJeztah
Copy link
Member Author

Looks to be related to opencontainers@e114618 (opencontainers#1495)

@seemethere
Copy link

@thaJeztah is this something we should add in?

@thaJeztah
Copy link
Member Author

Cherry-picking it while we speak; looks like we cherry-picked the other changes of that PR, but left this one out.

This was never used, just validated, so was removed from spec.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
(cherry picked from commit e114618)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah
Copy link
Member Author

cherry-picked opencontainers@e114618

conflict looks to be due to opencontainers@854b41d not being in this branch

diff --cc spec.go
index a15c84e6,876937d2..00000000
--- a/spec.go
+++ b/spec.go
@@@ -131,10 -130,7 +130,14 @@@ func loadSpec(cPath string) (spec *spec
        if err = json.NewDecoder(cf).Decode(&spec); err != nil {
                return nil, err
        }
++<<<<<<< HEAD
 +      if err = validatePlatform(&spec.Platform); err != nil {
 +              return nil, err
 +      }
 +      return spec, validateProcessSpec(&spec.Process)
++=======
+       return spec, validateProcessSpec(spec.Process)
++>>>>>>> e1146182... Remove Platform as no longer in OCI spec
  }
  
  func createLibContainerRlimit(rlimit specs.LinuxRlimit) (configs.Rlimit, error) {

Relevant change there is

return spec, validateProcessSpec(&spec.Process)

versus

return spec, validateProcessSpec(spec.Process)

@thaJeztah
Copy link
Member Author

ping @tiborvass @justincormack @crosbymichael PTAL

@thaJeztah
Copy link
Member Author

not sure though why it would fail now on that error; perhaps depends on the image being pulled?

@thaJeztah
Copy link
Member Author

perhaps we should update with opencontainers#1984 (either in this PR, or as a follow-up)

@justincormack
Copy link

yes I think updating to the latest version is better now its merged.

Christian Brauner and others added 6 commits March 12, 2019 20:23
My first attempt to simplify this and make it less costly focussed on
the way constructors are called. I was under the impression that the ELF
specification mandated that arg, argv, and actually even envp need to be
passed to functions located in the .init_arry section (aka
"constructors"). Actually, the specifications is (cf. [2]):

SHT_INIT_ARRAY
This section contains an array of pointers to initialization functions,
as described in ``Initialization and Termination Functions'' in Chapter
5. Each pointer in the array is taken as a parameterless procedure with
a void return.

which means that this becomes a libc specific decision. Glibc passes
down those args, musl doesn't. So this approach can't work. However, we
can at least remove the environment parsing part based on POSIX since
[1] mandates that there should be an environ variable defined in
unistd.h which provides access to the environment. See also the relevant
Open Group specification [1].

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/
[2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array

Fixes: CVE-2019-5736
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
(cherry picked from commit bb7d8b1)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
(cherry picked from commit 5b775bf)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In order to get around the memfd_create(2) requirement, 0a8e411
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:

 * It required O_TMPFILE which is relatively new (having been added to
   Linux 3.11).

 * The fallback choice was made at compile-time, not runtime. This
   results in several complications when it comes to running binaries
   on different machines to the ones they were built on.

The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
(cherry picked from commit 2429d59)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Writing a file to tmpfs actually incurs a memcg penalty, and thus the
benefit of being able to disable memfd_create(2) with
_LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
be noted that quite a few distributions don't use tmpfs for /tmp (and
instead have it as a regular directory or subvolume of the host
filesystem).

Since runc must have write access to the state directory anyway (and the
state directory is usually not on a tmpfs) we can use that instead of
/tmp -- avoiding potential memcg costs with no real downside.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
(cherry picked from commit af9da0a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).

The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a *very bad idea*).

Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked *alongside* the protection being
present.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
(cherry picked from commit 16612d7)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There are some circumstances where sendfile(2) can fail (one example is
that AppArmor appears to block writing to deleted files with sendfile(2)
under some circumstances) and so we need to have a userspace fallback.
It's fairly trivial (and handles short-writes).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
(cherry picked from commit 2d4a37b)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah
Copy link
Member Author

Updated; added opencontainers#1982 and opencontainers#1984

@kolyshkin
Copy link

LGTM

@andrewhsu andrewhsu merged commit dbf862c into docker-archive:17.06 Mar 28, 2019
@thaJeztah thaJeztah deleted the 17.06_backport_runc_cve branch March 28, 2019 15:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants