Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmz: use overlayfs to write-protect /proc/self/exe if possible #4448

Merged
merged 2 commits into from
Oct 20, 2024

Commits on Oct 20, 2024

  1. tests: integration: add helper to check if we're in a userns

    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Oct 20, 2024
    Configuration menu
    Copy the full SHA
    8cfbccb View commit details
    Browse the repository at this point in the history
  2. dmz: use overlayfs to write-protect /proc/self/exe if possible

    Commit b999376 ("nsenter: cloned_binary: remove bindfd logic
    entirely") removed the read-only bind-mount logic from our cloned binary
    code because it wasn't really safe because a container with
    CAP_SYS_ADMIN could remove the MS_RDONLY bit and get write access to
    /proc/self/exe (even with user namespaces this could've been an issue
    because it's not clear if the flags are locked).
    
    However, copying a binary does seem to have a minor performance impact.
    The only way to have no performance impact would be for the kernel to
    block these write attempts, but barring that we could try to reduce the
    overhead by coming up with a mount that cannot have it's read-only bits
    cleared.
    
    The "simplest" solution is to create a temporary overlayfs using
    fsopen(2) which uses the directory where runc exists as a lowerdir,
    ensuring that the container cannot access the underlying file -- and we
    don't have to do any copies.
    
    While fsopen(2) is not free because mount namespace cloning is usually
    expensive (and so it seems like the difference would be marginal), some
    basic performance testing seems to indicate there is a ~60% improvement
    doing it this way and that it has effectively no overhead even when
    compared to just using /proc/self/exe directly:
    
      % hyperfine --warmup 50 \
      >           "./runc-noclone run -b bundle ctr" \
      >           "./runc-overlayfs run -b bundle ctr" \
      >           "./runc-memfd run -b bundle ctr"
    
      Benchmark 1: ./runc-noclone run -b bundle ctr
        Time (mean ± σ):      13.7 ms ±   0.9 ms    [User: 6.0 ms, System: 10.9 ms]
        Range (min … max):    11.3 ms …  16.1 ms    184 runs
    
      Benchmark 2: ./runc-overlayfs run -b bundle ctr
        Time (mean ± σ):      13.9 ms ±   0.9 ms    [User: 6.2 ms, System: 10.8 ms]
        Range (min … max):    11.8 ms …  16.0 ms    180 runs
    
      Benchmark 3: ./runc-memfd run -b bundle ctr
        Time (mean ± σ):      22.6 ms ±   1.3 ms    [User: 5.7 ms, System: 20.7 ms]
        Range (min … max):    19.9 ms …  26.5 ms    114 runs
    
      Summary
        ./runc-noclone run -b bundle ctr ran
          1.01 ± 0.09 times faster than ./runc-overlayfs run -b bundle ctr
          1.65 ± 0.15 times faster than ./runc-memfd run -b bundle ctr
    
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Oct 20, 2024
    Configuration menu
    Copy the full SHA
    515f09f View commit details
    Browse the repository at this point in the history