dmz: use overlayfs to write-protect /proc/self/exe if possible #4448

cyphar · 2024-10-16T06:33:23Z

Commit b999376 ("nsenter: cloned_binary: remove bindfd logic
entirely") removed the read-only bind-mount logic from our cloned binary
code because it wasn't really safe because a container with
CAP_SYS_ADMIN could remove the MS_RDONLY bit and get write access to
/proc/self/exe (even with user namespaces this could've been an issue
because it's not clear if the flags are locked).

However, copying a binary does seem to have a minor performance impact.
The only way to have no performance impact would be for the kernel to
block these write attempts, but barring that we could try to reduce the
overhead by coming up with a mount that cannot have it's read-only bits
cleared.

The "simplest" solution is to create a temporary overlayfs using
fsopen(2) which uses the directory where runc exists as a lowerdir,
ensuring that the container cannot access the underlying file -- and we
don't have to do any copies.

While fsopen(2) is not free because mount namespace cloning is usually
expensive (and so it seems like the difference would be marginal), some
basic performance testing seems to indicate there is a ~60% improvement
doing it this way and that it has effectively no overhead even when
compared to just using /proc/self/exe directly:

% hyperfine --warmup 50 \
>           "./runc-noclone run -b bundle ctr" \
>           "./runc-overlayfs run -b bundle ctr" \
>           "./runc-memfd run -b bundle ctr"

Benchmark 1: ./runc-noclone run -b bundle ctr
  Time (mean ± σ):      13.7 ms ±   0.9 ms    [User: 6.0 ms, System: 10.9 ms]
  Range (min … max):    11.3 ms …  16.1 ms    184 runs

Benchmark 2: ./runc-overlayfs run -b bundle ctr
  Time (mean ± σ):      13.9 ms ±   0.9 ms    [User: 6.2 ms, System: 10.8 ms]
  Range (min … max):    11.8 ms …  16.0 ms    180 runs

Benchmark 3: ./runc-memfd run -b bundle ctr
  Time (mean ± σ):      22.6 ms ±   1.3 ms    [User: 5.7 ms, System: 20.7 ms]
  Range (min … max):    19.9 ms …  26.5 ms    114 runs

Summary
  ./runc-noclone run -b bundle ctr ran
    1.01 ± 0.09 times faster than ./runc-overlayfs run -b bundle ctr
    1.65 ± 0.15 times faster than ./runc-memfd run -b bundle ctr

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

cyphar · 2024-10-16T07:09:34Z

FWIW, I did some tests and this is about the same performance as 1.1.14, while 1.1.15 and all of 1.2.x have the memfd ~60% performance hit. Seems like this might be the way to go...

libcontainer/dmz/overlayfs_linux.go

libcontainer/dmz/cloned_binary_linux.go

AkihiroSuda · 2024-10-16T09:21:38Z

libcontainer/dmz/cloned_binary_linux.go

+		logrus.Debugf("using overlayfs for /proc/self/exe sealing")
+		return overlayFile, nil
+	}
+	logrus.Debugf("could not use overlayfs for /proc/self/exe sealing (%v) -- falling back to standard memfd copy", err)


Needs comment lines to compare overlayfs with standard memfd copy (and also with bind-mount)

I added a short paragraph or two. Let me know if you think it needs more information.

libcontainer/dmz/overlayfs_linux.go

kolyshkin · 2024-10-17T02:24:59Z

I suggest using a benchmark I added in #4432, it's closer to runc init, so to say.

kolyshkin

Have yet to take a closer look, but my biggest concert is having a new mount will trigger systemd which is subscribed to mount/umount events and re-reads /proc/self/mountinfo to sync the internal state of mount units. With many quick runc exec invocations this degrades system performance a lot.

Or is this done in a mount namespace which host systemd can't see?

kolyshkin · 2024-10-17T02:32:04Z

libcontainer/dmz/overlayfs_linux.go

+	// The only reasonable option would be to hash both files and compare them,
+	// but this would require fully reading both files which would produce a
+	// similar performance overhead to memfd cloning.


Would kcmp(2) work here?

Unfortunately no, the two files are completely different from the kernel's perspective (if they were the same we could just check the inode number and device).

tests/integration/run.bats

libcontainer/dmz/overlayfs_linux.go

cyphar · 2024-10-17T08:37:47Z

@kolyshkin The mount is created with fsopen and is never placed on the filesystem, so systemd can't see it (nor can any other process). The main benefit of using fsopen is that we don't need to configure a custom mount namespace because the "new" mount infrastructure creates an anonymous mount namespace internally.

I have a feeling that benchmarking that way would give unrealistically good performance numbers for this approach, and comparing a full container creation/shutdown with the actual runc binary is more indicative of the impact on a real workload. But I'll run the benchmarks and post them here.

cyphar · 2024-10-17T09:05:20Z

Using #4432, here are the performance numbers for my machine:

baseline (no `/proc/self/exe` protection)

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                1430           4196755 ns/op
BenchmarkExecTrue-16                1430           4198300 ns/op
BenchmarkExecTrue-16                1416           4205281 ns/op
BenchmarkExecTrue-16                1408           4212323 ns/op
BenchmarkExecTrue-16                1432           4217341 ns/op
PASS

memfd

!runc_nodmz

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                 460          12835217 ns/op
BenchmarkExecTrue-16                 470          12892909 ns/op
BenchmarkExecTrue-16                 464          12985902 ns/op
BenchmarkExecTrue-16                 469          12892199 ns/op
BenchmarkExecTrue-16                 464          13099146 ns/op
PASS

runc_nodmz

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                 458          12993715 ns/op
BenchmarkExecTrue-16                 463          13001347 ns/op
BenchmarkExecTrue-16                 463          12999835 ns/op
BenchmarkExecTrue-16                 468          13006262 ns/op
BenchmarkExecTrue-16                 463          12969887 ns/op
PASS

runc-dmz

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                1393           4247870 ns/op
BenchmarkExecTrue-16                1404           4271359 ns/op
BenchmarkExecTrue-16                1406           4274319 ns/op
BenchmarkExecTrue-16                1414           4250608 ns/op
BenchmarkExecTrue-16                1392           4227626 ns/op
PASS

memfd-bind

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                1441           4192545 ns/op
BenchmarkExecTrue-16                1423           4238741 ns/op
BenchmarkExecTrue-16                1418           4201058 ns/op
BenchmarkExecTrue-16                1405           4216821 ns/op
BenchmarkExecTrue-16                1438           4207067 ns/op
PASS

overlayfs

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                1374           4349093 ns/op
BenchmarkExecTrue-16                1372           4363986 ns/op
BenchmarkExecTrue-16                1354           4351277 ns/op
BenchmarkExecTrue-16                1386           4382160 ns/op
BenchmarkExecTrue-16                1336           4368590 ns/op
PASS

The key takeaways are that overlayfs has ~2% overhead (with some error bars) compared to memfd-bind or runc-dmz but without the downsides (memfd-bind has administrative complexity and limits what users can run the runc binary, and runc-dmz has all of the fun capability issues). memfd-bind and runc-dmz are have ~1% overhead compared to doing nothing. That being said, on my machine the overhead of copying appears to be >300% if you use this benchmark, which seems a little suspect.

My previous testing seems to indicate that all of these single-digit-percentage overheads basically become noise when you actually run runc as a binary.

rata · 2024-10-17T12:45:37Z

libcontainer/dmz/cloned_binary_linux.go

+	// effectively no performance overhead (it is on par with both
+	// MS_BIND+MS_RDONLY and no binary cloning at all) while memfd copying adds
+	// around ~60% overhead during container startup.
+	overlayFile, err := sealedOverlayfs("/proc/self/exe", tmpDir)


This doesn't need the kernel locks that affected clusters with churn using the bind mount logic before? I mean, that was causing this issue: #2532

That particular issue was because of systemd's mount tracking, which doesn't apply here (the mounts are internal to the process -- systemd will never see them).

However, to be honest I expected it to have an impact, but it doesn't have a noticeable one. Looking into it, the main reason is that CLONE_NEWNS actually requires taking namespace_lock() to make a copy of the existing mount table, but the anonymous allocation of a private mount namespace doesn't and so there is no lock contention on global locks. So there is no locking issue.

It's a bit hard to compare the performance of bind-mount because the code was removed in 1.2.x, so I'd need to write a new version, but if you compare the performance with 1.1.14 this patch is ~5% faster (using the hyperfine benchmark). I can try to benchmark against a synthetic version of bindfd on 1.2.x.

Nah, I don't think we need it in 1.2 just to compare. IMHO it is more interesting to create a way to repro #2532 and make sure this doesn't cause similar consequences.

We probably can't reproduce that exact issue, but we could have a daemon in our tests that tracks whether there are any new mounts created during testing (which is what that issue is really about) and error out if runc ever creates mounts on the host.

Unfortunately, there is still no notification API for mounts (we discussed designing one based on fanotify at LSF/MM this year but that won't be in a kernel we can use in CI for a while) so we would need to either loop over /proc/self/mountinfo or preferably use listmount(2) to just get a list of mount IDs once CI supports it (Linux 6.8). We might not be able to catch races though, and I can't see an obvious way of avoiding that (mount propagation could be used to copy the mount elsewhere but MS_SLAVE wouldn't stop the umount from propagating). We might need to run runc several hundred times to try to catch a race.

A slightly crazy idea would be to use bpftrace to detect it, but that would be far too brittle for CI.

Thanks. Yeah, I think CI for this sounds complicated, at least for now. I'll check if I can repro locally and then give this a try, this sounds enough for now. I'll update if I manage to do it :)

Unfortunately, there is still no notification API for mounts

The one we have (and which systemd is using) is epoll on /proc/self/mountinfo.

Caveats:

still have to read the file (or use listmount(2)) once an event is received;

easy to miss a short-lived mount.

Ah, right! In that case we can just listen for any epoll event and error out if we see one. If we assume the system is not doing other mounts on the host during tests (a reasonable assumption) we should not see any mount events at all during the test.

rata · 2024-10-17T13:00:05Z

A completely different idea that I want to experiment with is using landlock (although not all kernels support it, it is supported for a while already). Landlock does work with magic-links too, which is great.

I'm not sure if it would be easy to integrate, though, for this protection. But probably if we can make it work, it would be a very light protection (when the kernel supports it).

cyphar · 2024-10-17T13:21:03Z

I'm not sure landlock can protect against every possible way of accessing /proc/self/exe. Unless I'm missing something you would need to restrict /proc access in pretty serious ways for the entire container (in ways that would break container programs that use /proc/self/fd for re-opening or open /proc/self/exe). Landlock does have the ability to track some restrictions on file descriptors but not the ones we need (and I'm not entirely sure it would be as fool-proof as overlayfs).

libcontainer/dmz/overlayfs_linux.go

kolyshkin

LGTM. This is nice 👍🏻

rata

LGTM, nice idea! The code LGTM and the tests I did seem to work just fine too :)

I did some stress-tests for possible regressions on #2532. This is what I used to test:

configure containerd to use systemd cgroups
start a k8s cluster
create a pod with 100 replicas and readinessProbe that exec's into the container, every 1 second

I did this with runc 1.1.14, 1.1.15 and runc main with this PR applied.

runc 1.1.15 and runc main showed only containerd/kubelet with high CPU usage (20-40%), after the containers were created udisksd is not using significant CPU (during creating it uses more CPU, but we are doing mounts for the new containers there) and I don't see any mount things in "journalctl -f".

With runc 1.1.14, the udisks process is taking ~40%CPU, with peaks ~55%, constantly (after the containers are running too), the kubelet/containerd take 20/40%CPU and I see entries in the journalctl -f output as follows, even after all containers are running:

Oct 18 14:15:50 lindsay systemd[1]: run-containerd-runc-k8s.io-b1deab2f119eccb4447dd9afbca48c615f7a24246e76840725cb3b7e895b6c5e-runc.zSElNB.mount: Deactivated successfully.

Therefore, I can't notice any regressions here for that issue. This is great :)

click to see the pod.yaml and containerd config

pod.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache
  namespace: default
spec:
  replicas: 100
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - image: registry.k8s.io/hpa-example
        imagePullPolicy: Always
        livenessProbe:
          exec:
            command:
            - curl
            - http://localhost:80
          failureThreshold: 1
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 10
        name: php-apache
        ports:
        - containerPort: 80
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - curl
            - http://localhost:80
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 30m

containerd config:

version = 3
root = '/var/lib/containerd-rata'
state = '/run/containerd-rata'
temp = ''
plugin_dir = ''
disabled_plugins = []
required_plugins = []
oom_score = 0
imports = []

[grpc]
  address = '/run/containerd-rata/containerd.sock'
  tcp_address = ''
  tcp_tls_ca = ''
  tcp_tls_cert = ''
  tcp_tls_key = ''
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ''
  uid = 0
  gid = 0

[debug]
  address = ''
  uid = 0
  gid = 0
  level = ''
  format = ''

[metrics]
  address = ''
  grpc_histogram = false

[plugins]
  [plugins.'io.containerd.cri.v1.images']
    snapshotter = 'overlayfs'
    disable_snapshot_annotations = true
    discard_unpacked_layers = false
    max_concurrent_downloads = 3
    image_pull_progress_timeout = '5m0s'
    image_pull_with_sync_fs = false
    stats_collect_period = 10

    [plugins.'io.containerd.cri.v1.images'.pinned_images]
      sandbox = 'registry.k8s.io/pause:3.10'

    [plugins.'io.containerd.cri.v1.images'.registry]
      config_path = ''

    [plugins.'io.containerd.cri.v1.images'.image_decryption]
      key_model = 'node'

  [plugins.'io.containerd.cri.v1.runtime']
    enable_selinux = false
    selinux_category_range = 1024
    max_container_log_line_size = 16384
    disable_apparmor = false
    restrict_oom_score_adj = false
    disable_proc_mount = false
    unset_seccomp_profile = ''
    tolerate_missing_hugetlb_controller = true
    disable_hugetlb_controller = true
    device_ownership_from_security_context = false
    ignore_image_defined_volumes = false
    netns_mounts_under_state_dir = false
    enable_unprivileged_ports = true
    enable_unprivileged_icmp = true
    enable_cdi = true
    cdi_spec_dirs = ['/etc/cdi', '/var/run/cdi']
    drain_exec_sync_io_timeout = '0s'
    ignore_deprecation_warnings = []

    [plugins.'io.containerd.cri.v1.runtime'.containerd]
      default_runtime_name = 'runc'
      ignore_blockio_not_enabled_errors = false
      ignore_rdt_not_enabled_errors = false

      [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes]
        [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
          runtime_type = 'io.containerd.runc.v2'
          runtime_path = ''
          pod_annotations = []
          container_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          base_runtime_spec = ''
          cni_conf_dir = ''
          cni_max_conf_num = 0
          snapshotter = ''
          sandboxer = 'podsandbox'
          io_type = ''

          [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
            BinaryName = ''
            CriuImagePath = ''
            CriuWorkPath = ''
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            Root = ''
            ShimCgroup = ''
            SystemdCgroup = true

    [plugins.'io.containerd.cri.v1.runtime'.cni]
      bin_dir = '/opt/cni/bin'
      conf_dir = '/etc/cni/net.d'
      max_conf_num = 1
      setup_serially = false
      conf_template = ''
      ip_pref = ''
      use_internal_loopback = false

  [plugins.'io.containerd.gc.v1.scheduler']
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = '0s'
    startup_delay = '100ms'

  [plugins.'io.containerd.grpc.v1.cri']
    disable_tcp_service = true
    stream_server_address = '127.0.0.1'
    stream_server_port = '0'
    stream_idle_timeout = '4h0m0s'
    enable_tls_streaming = false

    [plugins.'io.containerd.grpc.v1.cri'.x509_key_pair_streaming]
      tls_cert_file = ''
      tls_key_file = ''

  [plugins.'io.containerd.image-verifier.v1.bindir']
    bin_dir = '/opt/containerd/image-verifier/bin'
    max_verifiers = 10
    per_verifier_timeout = '10s'

  [plugins.'io.containerd.internal.v1.opt']
    path = '/opt/containerd'

  [plugins.'io.containerd.internal.v1.tracing']

  [plugins.'io.containerd.metadata.v1.bolt']
    content_sharing_policy = 'shared'

  [plugins.'io.containerd.monitor.container.v1.restart']
    interval = '10s'

  [plugins.'io.containerd.monitor.task.v1.cgroups']
    no_prometheus = false

  [plugins.'io.containerd.nri.v1.nri']
    disable = false
    socket_path = '/var/run/nri/nri.sock'
    plugin_path = '/opt/nri/plugins'
    plugin_config_path = '/etc/nri/conf.d'
    plugin_registration_timeout = '5s'
    plugin_request_timeout = '2s'
    disable_connections = false

  [plugins.'io.containerd.runtime.v2.task']
    platforms = ['linux/amd64']

  [plugins.'io.containerd.service.v1.diff-service']
    default = ['walking']
    sync_fs = false

  [plugins.'io.containerd.service.v1.tasks-service']
    blockio_config_file = ''
    rdt_config_file = ''

  [plugins.'io.containerd.shim.v1.manager']
    env = []

  [plugins.'io.containerd.snapshotter.v1.blockfile']
    root_path = ''
    scratch_file = ''
    fs_type = ''
    mount_options = []
    recreate_scratch = false

  [plugins.'io.containerd.snapshotter.v1.btrfs']
    root_path = ''

  [plugins.'io.containerd.snapshotter.v1.devmapper']
    root_path = ''
    pool_name = ''
    base_image_size = ''
    async_remove = false
    discard_blocks = false
    fs_type = ''
    fs_options = ''

  [plugins.'io.containerd.snapshotter.v1.native']
    root_path = ''

  [plugins.'io.containerd.snapshotter.v1.overlayfs']
    root_path = ''
    upperdir_label = false
    sync_remove = false
    slow_chown = false
    mount_options = []

  [plugins.'io.containerd.tracing.processor.v1.otlp']

  [plugins.'io.containerd.transfer.v1.local']
    max_concurrent_downloads = 3
    max_concurrent_uploaded_layers = 3
    config_path = ''

[cgroup]
  path = ''

[timeouts]
  'io.containerd.timeout.bolt.open' = '0s'
  'io.containerd.timeout.metrics.shimstats' = '2s'
  'io.containerd.timeout.shim.cleanup' = '5s'
  'io.containerd.timeout.shim.load' = '5s'
  'io.containerd.timeout.shim.shutdown' = '3s'
  'io.containerd.timeout.task.state' = '2s'

[stream_processors]
  [stream_processors.'io.containerd.ocicrypt.decoder.v1.tar']
    accepts = ['application/vnd.oci.image.layer.v1.tar+encrypted']
    returns = 'application/vnd.oci.image.layer.v1.tar'
    path = 'ctd-decoder'
    args = ['--decryption-keys-path', '/etc/containerd/ocicrypt/keys']
    env = ['OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf']

  [stream_processors.'io.containerd.ocicrypt.decoder.v1.tar.gzip']
    accepts = ['application/vnd.oci.image.layer.v1.tar+gzip+encrypted']
    returns = 'application/vnd.oci.image.layer.v1.tar+gzip'
    path = 'ctd-decoder'
    args = ['--decryption-keys-path', '/etc/containerd/ocicrypt/keys']
    env = ['OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf']

lifubang · 2024-10-18T13:49:53Z

Need rebase.

cyphar · 2024-10-18T14:33:07Z

Btw this approach can also work for rootless containers as well on new enough kernels (overlayfs was enabled inside user namespaces in 5.11), but it would require doing some awful fork+CGo stuff so we can punt on this for now.

AkihiroSuda · 2024-10-19T16:04:26Z

tests/integration/run.bats

+		# use a temporary overlayfs instead of making a memfd clone of
+		# /proc/self/exe.
+		[[ "$output" = *"runc-dmz: using overlayfs for sealed /proc/self/exe"* ]]
+	fi


This doesn’t seem to work when running as the UID 0 in a userNS with a vanilla kernel 5.1-5.10 that lacks support for mounting overlayfs in a userNS

Right, but does anyone run the whole test suite inside a userns (maybe rootlesskit?)? I can change this test to check for that, but runc it self will gracefully fall back to doing memfds in that case.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Commit b999376 ("nsenter: cloned_binary: remove bindfd logic entirely") removed the read-only bind-mount logic from our cloned binary code because it wasn't really safe because a container with CAP_SYS_ADMIN could remove the MS_RDONLY bit and get write access to /proc/self/exe (even with user namespaces this could've been an issue because it's not clear if the flags are locked). However, copying a binary does seem to have a minor performance impact. The only way to have no performance impact would be for the kernel to block these write attempts, but barring that we could try to reduce the overhead by coming up with a mount that cannot have it's read-only bits cleared. The "simplest" solution is to create a temporary overlayfs using fsopen(2) which uses the directory where runc exists as a lowerdir, ensuring that the container cannot access the underlying file -- and we don't have to do any copies. While fsopen(2) is not free because mount namespace cloning is usually expensive (and so it seems like the difference would be marginal), some basic performance testing seems to indicate there is a ~60% improvement doing it this way and that it has effectively no overhead even when compared to just using /proc/self/exe directly: % hyperfine --warmup 50 \ > "./runc-noclone run -b bundle ctr" \ > "./runc-overlayfs run -b bundle ctr" \ > "./runc-memfd run -b bundle ctr" Benchmark 1: ./runc-noclone run -b bundle ctr Time (mean ± σ): 13.7 ms ± 0.9 ms [User: 6.0 ms, System: 10.9 ms] Range (min … max): 11.3 ms … 16.1 ms 184 runs Benchmark 2: ./runc-overlayfs run -b bundle ctr Time (mean ± σ): 13.9 ms ± 0.9 ms [User: 6.2 ms, System: 10.8 ms] Range (min … max): 11.8 ms … 16.0 ms 180 runs Benchmark 3: ./runc-memfd run -b bundle ctr Time (mean ± σ): 22.6 ms ± 1.3 ms [User: 5.7 ms, System: 20.7 ms] Range (min … max): 19.9 ms … 26.5 ms 114 runs Summary ./runc-noclone run -b bundle ctr ran 1.01 ± 0.09 times faster than ./runc-overlayfs run -b bundle ctr 1.65 ± 0.15 times faster than ./runc-memfd run -b bundle ctr Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

tests/integration/helpers.bash

kolyshkin · 2024-10-28T21:38:43Z

tests/integration/helpers.bash

@@ -396,7 +402,7 @@ function requires() {
 			fi
 			;;
 		root)
-			if [ $EUID -ne 0 ]; then
+			if [ $EUID -ne 0 ] || in_userns; then


Hmm, this change probably results in many tests being skipped which were not skipped before.

We don't run the suite in user namespaces, though? But yeah this change wasn't strictly necessary so we can revert it if it's causing issues.

cyphar force-pushed the cloned-binary-overlayfs branch 3 times, most recently from a16141f to 581d923 Compare October 16, 2024 07:07

lifubang reviewed Oct 16, 2024

View reviewed changes

libcontainer/dmz/overlayfs_linux.go Outdated Show resolved Hide resolved

AkihiroSuda reviewed Oct 16, 2024

View reviewed changes

libcontainer/dmz/cloned_binary_linux.go Outdated Show resolved Hide resolved