Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit writable WCOW layers as read-only parent layers #4415

Merged
merged 1 commit into from
Apr 19, 2021

Conversation

TBBle
Copy link
Contributor

@TBBle TBBle commented Jul 22, 2020

This builds on #4399 (itself on top of #4395) to implement func (s *snapshotter) Commit for Windows layers.

Contributes towards #1920, with the goal of resolving moby/buildkit#616

It's a bit odd, but it seems that the only way to get this to work is to restream the read-write layer back on top of itself. I suspect this means that it is consuming extra space, but that appears to be necessary. Restreaming into a new snapshot means you cannot get a diff against the parent layer, which we often need to do.

The main weirdness here is that it seems that a windows-layer has two completely distinct filesystems, and you cannot diff all the combinations of them that you'd like, so we have to keep the writable-layer file around even after converting into the read-only-but-parent-usable format.

Note that support for parentless Active layers is not part of this PR, that's currently mixed into #4419 as it's both a separate discussion, and depends on a proposed hcsshim feature to implement. This only covers committing layers that you could already create, as Active layers with a parent, generally a chain starting with an image from mcr.microsoft.com.


Everything beyond this point is details and a record of experiments for how I got to this particular approach

A Scratch layer, i.e. the result of (s *snapshotter) Prepare, contains only sandbox.vhdx. It can only be used as mount.Mount.Source, and the upper for Comparer.Compare.

A read-only layer has a Files/ directory, a Hives/ directory, and perhaps a UtilityVM/ directory and some other boot-related stuff. This is what you get after calling Applier.Apply. This sort of layer can appear in Snapshot.ParentIDs, and the parents list that Comparer.Compare for Windows sneaks through to the Windows archiver, introduced in #4399 based on the same sneaking done in Applier.Apply.

There's no problem with having both sandbox.vhdx and Files/ et. al. in the same directory. containerd already does this when creating snapshots out of the content store, because we Apply the pulled layer diff to a Prepared snapshot. However, it's important to note that the sandbox.vhdx in such a layer is an empty diff against the parent.

However, snapshots created from the content store cannot produce diffs against their parents, because the diffing done by hcsshim appears to only look at the sandbox.vhdx of the upper. They will instead produce a valid diff with only one small metadata change to the 'Documents and Settings' symlink, and a set of (probably NULL) registry hive updates.

On that note, I have observed (using snapshots diff) that diffing two snapshots for pulled layers will actually produce practically-empty diffs at the moment, because the diff works with the sandbox.vhdx in the upper, not the extracted files, and when a layer is populated by Apply, the sandbox.vhdx is not updated.

snapshots diff also changes the modification data of the upper's sandbox.vhdx, which is odd.

The upshot for me is that a read-only directory created with Applier.Apply should not have a sandbox.vhdx, but that is hard because Snapshotter.Prepare does not currently have a way to know if the snapshot is being prepared for use read-write, or to be Apply'd to and then Commit'd.

The top commit also removes the sandbox.vhdx from Snapshots populated by Apply rather than by mounting and editing them, so that attempts to diff such snapshots fail (esoterically, for now) rather than returning bad data, which may leak into exported images if you do things wrongly enough. It's a separate commit, because this affects the on-disk result of pulling images, while the other changes only affect the on-disk result of committing read-write layers.

All the crossed-out text above was fixed by #4643, which recognises snapshots which will be Apply'd to, and doens't create the sandbox.vhdx in the first place.

I feel like there's some kind of misbehaviour or bug in hcsshim, that it cannot diff two directories in read-only format. It's possible I'm using it wrong though.

I was also mostly able to reproduce all of the above using the wclayer utility from hcsshim (master), plus a little utility to call SetVolumeMount and DeleteVolumeMount so I could see what's actually going on. This also helped me understand that the "mount-in-place" workaround I used in moby/buildkit for lack of #2366 is 100% wrong, and I've just been lucky that I haven't corrupted data using it yet. -_-

I did some comparisons with Docker Engine's behaviour, and when idle, none of the directories in %PROGRAMDATA%\Docker\windowsfilter have sandbox.vhdx. During a container build, the current layer has a sandbox.vhdx, as you'd expect. However, I also noticed that its read-only directories are not hard-links, but only contain modified files, so it's possible that something different is happening there. I noted that they pass a "windowsFilter" flag through hcsshim's DriverInfo that current hcsshim builds don't appear to honour in many code-paths, and none of the containerd code appears to use.

I'm not sure if this is a V1/V2 difference, or indicative of a larger issue.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jul 22, 2020

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jul 22, 2020

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jul 22, 2020

Build succeeded.

@TBBle
Copy link
Contributor Author

TBBle commented Jul 22, 2020

Build failures due to some GCR weirdness? I don't see how different runtime choice would affect the results of pulling images, and the Windows tests (which are the only ones that should be affected by my changes) passed.

@jterry75
Copy link
Contributor

@kevpar - PTAL

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jul 23, 2020

Build succeeded.

@TBBle
Copy link
Contributor Author

TBBle commented Jul 23, 2020

Reviewing this, and trying to work out how best to deal with #2366, I've realised that the Files/ test is not the right test. Specifically, I've observed the following usage patterns:

  1. Windows base layer from content:
  • (s *snapshotter) Prepare (no parents) => Active Snapshot, empty directory
  • (s windowsDiff) Apply => Goes through baseLayerWriter in hcsshim => has Files/, Hives/, maybe UtilityVM/ and *.bcd; has some other base-specific files like layout
  • (s *snapshotter) Commit (no sandbox.vhdx) => Committed Snapshot, filesystem unchanged.
  1. Windows non-base layer from content:
  • (s *snapshotter) Prepare (parents) => Active Snapshot, has sandbox.vhdx
  • (s windowsDiff) Apply => Goes through legacyLayerWriter in hcsshim => has Files/, Hives/, maybe UtilityVM/ and *.bcd
  • (s *snapshotter) Commit (has sandbox.vhdx) => Committed Snapshot, sandbox.vhdx deleted
  1. Windows non-base layer edited by mounting:
  • (s *snapshotter) Prepare (parents) => Active Snapshot, has sandbox.vhdx
  • (m *Mount) Mount => <TODO [carry] Mount Windows layers as volumes on the host #2366/Mount snapshots on Windows #4419>
  • (content of sandbox.vhdx modified by user)
  • (s *snapshotter) Commit (has sandbox.vhdx) => Committed Snapshot, created Files/, Hives/, maybe UtilityVM/ and *.bcd, sandbox.vhdx left in-place with same content as the snapshot.
    • Leaving sandbox.vhdx in-place means that diff works, but increases the disk space usage.
  1. Random directory for content uploads by BuildKit:

The main concern I have is how to tell a random directory with a Files/ sub-directory, from an applied Base-layer. Mounting a View should always be a bind-mount (symlink, I guess?), but may (3) or may not (4) need to point at Files/. Mounting Active is easier, because if it has a sandbox.vhdx, then it's a "windows-layer", and if it does not have one, then it's a bind-mount.

That said, I would be more-comfortable if we could actually tell the two apart by something other than a commonly-named subdirectory, but I can't see a definitive flow of information from (s windowsDiff) Apply back to (s *snapshotter) Commit. So my only apparent option is to teach everything that calls Apply to also pass an option to Commit to mark it as an apply-created layer. Since the only places I can see doing this are all Prepare->Apply->Commit, then it'd just be a new snapshots.Opt.

I'm not sure if I'm going about this the right way, or if I'm overlooking something else important.

If (s *snapshotter) Prepare knows the layer coming is for applying-to, then it wouldn't need to create a sandbox.vhdx at all. I'm not sure if other versions of the snapshotter offer similar opportunities to benefit from knowing that beforehand though.

Edit: Per the note in #4419, it's possible we can leverage hcsshim to create sandbox.vhdx for the two cases (1, 4) where we don't already have one due to "no parents". That would simplify a lot of things here. This would turn case 1 into case and 2 (Apply, then Commit deletes sandbox.vhdx), and turn case 4 into case 3 (Mount, Commit does not delete sandbox.vhdx). The largest risk with this approach is that it might not work on Windows Server LTSC 2019/Windows 10 1809, or might need more work to be done in hcsshim to make it work on that version.

On the plus side, this latter approach removes the 'bind-mount' concept, and hence makes Files/ the correct test for an applied layer, because the user is no longer able to unintentionally fool that test with a Files/ directory of their own.

This could be done additionally or subsequently to this PR, and then #4419 rebased on top, simplified by removing the bind-mount support.

Edit again: Initial testing suggests the approach for creating parentless sandboxes (HcsFormatWritableLayerVhd) doesn't work, or at least I couldn't work out how to do use it correctly. See microsoft/hcsshim#853

@TBBle
Copy link
Contributor Author

TBBle commented Jul 23, 2020

Thinking about readonlyness too, perhaps Views should actually be read-write sandbox layers with tiny sandboxes. That would remove another source of bind-mounts. This is starting to feel like another PR on top of this one.

Edit: That's the implementation approach I took in #4419, as there seems to be no other way to implement read-only mounts for the local system.

@TBBle TBBle mentioned this pull request Jul 23, 2020
10 tasks
@jterry75
Copy link
Contributor

@ambarve - PTAL

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jul 28, 2020

Build succeeded.

@TBBle TBBle marked this pull request as draft September 9, 2020 02:37
@TBBle
Copy link
Contributor Author

TBBle commented Sep 9, 2020

Marked as draft pending rebase on #4399 once it has been rebased on updated #4395.

Edit: Done

@TBBle TBBle marked this pull request as ready for review September 9, 2020 11:51
@theopenlab-ci
Copy link

theopenlab-ci bot commented Sep 9, 2020

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Sep 9, 2020

Build succeeded.

@TBBle TBBle mentioned this pull request Sep 14, 2020
8 tasks
snapshots/windows/windows.go Outdated Show resolved Hide resolved
@TBBle TBBle marked this pull request as draft November 29, 2020 15:00
@theopenlab-ci
Copy link

theopenlab-ci bot commented Nov 29, 2020

Build succeeded.

@TBBle
Copy link
Contributor Author

TBBle commented Apr 3, 2021

tl;dr: It worked as a container root last time I tested it, and not much has changed since, AFAIK.


I was testing this by using BuildKit with the containerd backend to build multi-layer images.

I haven't touched the BuildKit side in a while (not much point until the containerd changes land and I have have BuildKit's CI preventing further breakages in the containerd backend), but moby/buildkit#616 (comment) (July 2020) shows a couple of Dockerfiles that required this change, and included multiple RUN layers, so presumably the later RUN layers were mounting container images generated by this change, as before this, the second RUN layer would fail.

Since then, testing, i.e. via #4419, has been host-mounting the layers, rather than container-mounting. #4419 enables the Snapshot test suite on Windows which exercises this change pretty thoroughly for host-mounts.

The behaviour here (exporting out a tar-layer and then importing back over the top) is also how Docker commits writeable layers, so I'd be surprised if it didn't work.

Unlike #4399 and #4419 (the PRs on either side of this) I'm not aware of any existing containerd tests that actually exercise this code-path and mounts so-created layers as a container's root (because that's not really containerd's job...), at least in the list of "disabled on Windows" tests.

I'd love to know if I've overlooked such a test, because getting it enabled would ensure that we don't suffer bit-rot over time.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 5, 2021

Build succeeded.

@dmcgowan dmcgowan added the platform/windows Windows label Apr 5, 2021
@ambarve
Copy link
Contributor

ambarve commented Apr 7, 2021

That's cool. I was wondering if it would cause any problems but it doesn't so that's nice.

I think the main changes in this PR are only in windows/snapshotter.go in the Commit function and those look good to me.

@jterry75 This looks good to me. Do you have any concerns? If not, can you merge this?

@TBBle
Copy link
Contributor Author

TBBle commented Apr 7, 2021

@ambarve You're correct about the changes. To be clear, it's only the top commit (i.e. 871c80f92b3b1fa1ee6868956043382dd2704785) that is part of this PR, the rest are already under review in #4399. I'll rebase this once that lands, which I'm hoping will be really soon now. ^_^

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 9, 2021

Build succeeded.

@TBBle
Copy link
Contributor Author

TBBle commented Apr 13, 2021

Rebased now that #4399 has been merged.

@TBBle TBBle force-pushed the wcow_commit_layers branch 2 times, most recently from c04182b to e8a0505 Compare April 13, 2021 10:49
@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 13, 2021

Build succeeded.

snapshots/windows/windows.go Outdated Show resolved Hide resolved
snapshots/windows/windows.go Outdated Show resolved Hide resolved
@TBBle
Copy link
Contributor Author

TBBle commented Apr 13, 2021

Because there are no tests (AFAIK) that exercise this code-path outside the snapshotter test suite, I've rebased #4419 onto it, because that PR does enable the snapshotter test suite, to test the slight refactoring I just pushed based on review feedback.

Edit: The test suite and integration run passed, but it's taken long enough that I suspect the second integration suite run will hit the 30-minute timeout and fail. But I'm satisfied I didn't break anything obvious when I moved the 'restream' out into a function.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 13, 2021

Build succeeded.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

with one comment~

A Scratch layer only contains a sandbox.vhdx, but to be used as a parent
layer, it must also contain the files on-disk.

Hence, we Export the layer from the sandbox.vhdx and Import it back into
itself, so that both data formats are present.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 14, 2021

Build succeeded.

@jterry75
Copy link
Contributor

SGTM - @ambarve / @kevpar - Could one of you PTAL

@AkihiroSuda AkihiroSuda merged commit bbbd851 into containerd:master Apr 19, 2021
@TBBle TBBle deleted the wcow_commit_layers branch April 19, 2021 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants