Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlay: use private merged directory for AIS #2036

Merged

Conversation

giuseppe
Copy link
Member

use a private "merged" directory when mounting from an additional store.

Operations like "Diff()" and "Changes()" cause an implicit mount when the naive differ is used.

The issue was not observed earlier because native overlay can achieve these operations without using a mount.

Since these mounts are performed read-only, and overlay supports multiple mounts using the same lowerdirs, use a private location for the "merged" directory. The location is owned by the current writeable store, that is locked for writing.

Closes: #2033

@giuseppe
Copy link
Member Author

not ready for review, I am still validating it

@giuseppe giuseppe force-pushed the fix-race-condition-naive-diff branch from 6fa7b85 to b4ad83d Compare July 17, 2024 15:56
@giuseppe giuseppe force-pushed the fix-race-condition-naive-diff branch from b4ad83d to 5c98373 Compare July 17, 2024 15:59
store.go Outdated
store.stopReading()
if exists {
return nil, fmt.Errorf("mounting read/only store images is not allowed: %w", ErrLayerUnknown)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t read this carefully, just a skim: Doesn’t this mean that it will be completely impossible to push an image which contains base layers in an additional image store and some child layers in the primary store, because the base layers will not be exportable?

Or is the assumption that the graph driver will iterate through the additional stores without the top levels knowing? In that case 1) this might not be reachable, but also 2) that completely gives up on store locking; why is that OK?

(I know I don’t understand the AIS indirections, possible combinations, and locking, and quite possibly I would save everyone time by just staying out of this.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Diff/Changes operation can trigger a mount, we didn't notice the race before because native overlay doesn't need it (we just access the layer diff), but how can we ensure two processes do not step on each other when mount/unmount are involved?

Copy link
Collaborator

@mtrmac mtrmac Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I don’t know how to solve this.

I continue to be confused about how AIS is supposed to work, I’m just noting that the PR, as is now, means that either things that used to work will stop working, or the PR is adding unnecessary code (and perhaps locking problems).


It seems to me that the basic situation is not even really additional-store related that much. See the LOCKING BUG comments in layers.go, isn’t that the same issue? (Again, saying that doesn’t mean I know how to fix it.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I relax the condition for allow mount from other stores but keep the writing lock for the rw store?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the primary driver is consulting AIS layers anyway, I don’t think that checking the AIS layers here, again, makes things any better.

But really my first intuition is that the graph driver should not be accessing the AIS layers itself, in the first place — but then that’s clearly not reasonable for mounting a child layer depending on a parent in AIS.

I’m just too confused about the code to have an opinion.

@@ -1875,13 +1877,21 @@ func (d *Driver) get(id string, disableShifting bool, options graphdriver.MountO
return mergedDir, nil
}

func (d *Driver) getMergedDir(id, dir string, inAdditionalStore bool) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems worthy of a doc comment, say

/// getMergedDir returns the directory path that should be used as the mount point for the overlayfs

@@ -1875,13 +1877,21 @@ func (d *Driver) get(id string, disableShifting bool, options graphdriver.MountO
return mergedDir, nil
}

func (d *Driver) getMergedDir(id, dir string, inAdditionalStore bool) string {
if inAdditionalStore {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also

Suggested change
if inAdditionalStore {
// The merged directory may not exist in an additional store, so use a transient mount point in `/run` instead.
if inAdditionalStore {

or so?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… and if locking is a concern, and is happening in a non-obvious way, that also should be written down somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for using the path under /run is that we hold only a read-only lock for the additional stores, so different processes can race when mounting/unmounting there and that is the root cause for #2033

if inAdditionalStore {
// check the base name for extra safety
if strings.HasPrefix(mountpoint, d.runhome) && filepath.Base(mountpoint) == "merged" {
err := os.RemoveAll(filepath.Dir(mountpoint))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should never actually be any content underneath the merged directory, right? We should be able to safely just use os.Remove I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can split this one in os.Remove(mountpoint); os.Remove(filepath.Dir(mountpoint)), I thought the RemoveAll version was more readable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clearly more readable to use RemoveAll, but the two-step Remove is a much stronger "extra safety" than checking the pathname. I don't have a strong opinion though.

store.go Outdated Show resolved Hide resolved
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the fix-race-condition-naive-diff branch from 5c98373 to 120642a Compare July 17, 2024 17:42
@giuseppe giuseppe force-pushed the fix-race-condition-naive-diff branch from 120642a to 05659bf Compare July 17, 2024 18:24
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A higher level comment, looking at the store changes:

  • This adds exclusive locking to Diff. IIRC previously we have tried fairly hard to make that only hold a shared lock. I don’t know anything specific about the performance impact, apart from vaguely remembering this is supposed to be a big deal. Do we know that this is actually fine?

  • A blocker for formal reasons: If I understand it correctly, the overlay graph driver how requires layerStore.Diff / layerStore.Changes to hold the layer store lock for writing for the primary store. If that’s correct, I think that must be documented in the layerStore methods (// Requires startReading or startWriting.) — and due to the unusual design, including the reason; typically we only require “writing” when modifying layerStore state, so if this is the overlay driver requiring exclusivity, that reason must be documented there as well.

    • On second thought, wouldn’t it be viable for overlay to introduce its own special-purpose lock? Such a lock could only be used use it in the relevant cases, instead of this being visible at the top level.

@@ -2930,15 +2930,40 @@ func (s *store) Unmount(id string, force bool) (bool, error) {
}

func (s *store) Changes(from, to string) ([]archive.Change, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn’t find anything, apart from cmd here, that is using Changes. It’s tempting to think that it should be deprecated or somehow made private instead of expanding functionality. That’s probably not this PR.

drivers/overlay/overlay.go Outdated Show resolved Hide resolved
@mtrmac
Copy link
Collaborator

mtrmac commented Jul 17, 2024

IIRC previously we have tried fairly hard to make that only hold a shared lock.

The obvious impact is that serializing the Diff generation would also serialize pushes to basically happen one layer at a time. (I don’t know whether there are other impacts, maybe on builds.)

  • On second thought, wouldn’t it be viable for overlay to introduce its own special-purpose lock? Such a lock could only be used use it in the relevant cases, instead of this being visible at the top level.

Thinking about this a bit more, for concurrency, we would want the mounts to be reference-counted; and automatically cleaned up on crash; and at that point sharing cross-process is ~free. And we already have that kind of infrastructure in layerStore around mountsPath

… except that it does not apply to the read-only AIS layers, at all. And, of course, the LOCKING BUG situation would apply.

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 17, 2024

And we already have that kind of infrastructure in layerStore around mountsPath

… and it’s actually being used in this Diff path. (Except that I don’t know how that works for the AIS layers.) Anyway, I am clearly not paying enough attention, I’ll shut up.

when NaiveDiff is used, the Diff/Changes operations can trigger the
mount of the layer.  Prevent that multiple processes step on each
other and one of them performs an unmount while the other one is still
accessing the mount.

Closes: containers#2033

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the fix-race-condition-naive-diff branch from 05659bf to 1e60d87 Compare July 17, 2024 20:12
giuseppe added a commit to giuseppe/libpod that referenced this pull request Jul 17, 2024
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
giuseppe added a commit to giuseppe/buildah that referenced this pull request Jul 17, 2024
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member Author

giuseppe commented Jul 17, 2024

A higher level comment, looking at the store changes:

  • This adds exclusive locking to Diff. IIRC previously we have tried fairly hard to make that only hold a shared lock. I don’t know anything specific about the performance impact, apart from vaguely remembering this is supposed to be a big deal. Do we know that this is actually fine?

but a Diff can trigger a mount, as it is in the race condition I am trying to fix; so the locking here shouldn't be too different from what we do in store.mount. A shared lock works only under the assumption that there are no mounts involved, so it is practically broken for anything else than plain native overlay and vfs (and even vfs calls into NaiveDiff, but the mount for vfs is a nop).
. This could be improved later, exposing whether the driver needs a new mount or can do the diff without changing the state of the store.

  • A blocker for formal reasons: If I understand it correctly, the overlay graph driver how requires layerStore.Diff / layerStore.Changes to hold the layer store lock for writing for the primary store. If that’s correct, I think that must be documented in the layerStore methods (// Requires startReading or startWriting.) — and due to the unusual design, including the reason; typically we only require “writing” when modifying layerStore state, so if this is the overlay driver requiring exclusivity, that reason must be documented there as well.

it is every driver except plain native overlay to require that, and vfs since there is no "mount" operation for vfs.

  • On second thought, wouldn’t it be viable for overlay to introduce its own special-purpose lock? Such a lock could only be used use it in the relevant cases, instead of this being visible at the top level.

Every driver, except vfs and native overlay, require a "mount" operation to access the files in the layer, so IMO it is better to make it the default behavior

use a private "merged" directory when mounting from an additional
store.

Operations like "Diff()" and "Changes()" cause an implicit mount when
the naive differ is used.

The issue was not observed earlier because native overlay can achieve
these operations without using a mount.

Since these mounts are performed read-only, and overlay supports
multiple mounts using the same lowerdirs, use a private location for
the "merged" directory.  The location is owned by the current
writeable store, that is locked for writing.

Closes: containers#2033

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the fix-race-condition-naive-diff branch from 1e60d87 to 683e806 Compare July 17, 2024 21:29
@giuseppe
Copy link
Member Author

@giuseppe
Copy link
Member Author

@cgwalters @rhatdan @mtrmac if there are no blockers, let's get this one and #2031 merged so we can enable the Podman CI, we can fix/improve things later but at least we have a more solid base knowing the e2e tests pass

@cgwalters cgwalters changed the title overlay: use private merged directory overlay: use private merged directory for AIS Jul 18, 2024
@cgwalters
Copy link
Contributor

I'm not going to pretend I understand all this code well, but narrowing in on this comment:

Since these mounts are performed read-only, and overlay supports multiple mounts using the same lowerdirs, use a private location for the "merged" directory. The location is owned by the current writeable store, that is locked for writing.

The model we have with AIS is basically that there's only one writable store, correct? In that case, I think I understand how this fixes the issue.

Thinking about this a bit more, for concurrency, we would want the mounts to be reference-counted; and automatically cleaned up on crash; and at that point sharing cross-process is ~free. And we already have that kind of infrastructure in layerStore around mountsPath …

And you're saying we're not using that code though here? I'm not sure if we are or not, but it looks to me like this patch is mostly just modifying the location of the mounts.

@cgwalters
Copy link
Contributor

I guess bottom line I can just say
/approve
from my PoV, not totally comfortable with a lgtm myself

Copy link
Contributor

openshift-ci bot commented Jul 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 18, 2024

A higher level comment, looking at the store changes:

  • This adds exclusive locking to Diff. IIRC previously we have tried fairly hard to make that only hold a shared lock. I don’t know anything specific about the performance impact, apart from vaguely remembering this is supposed to be a big deal. Do we know that this is actually fine?

but a Diff can trigger a mount, as it is in the race condition I am trying to fix; so the locking here shouldn't be too different from what we do in store.mount.

Right, and we can’t do an exclusive lock to mount, then downgrade to a shared lock for creating the diff (that part could work), and then upgrade again to unmount (we don’t have code that could do this, at least ATM, and I’m not immediately sure that can be done without risking deadlocks). I understand, at least in general terms, the difficulty.

This could be improved later, exposing whether the driver needs a new mount or can do the diff without changing the state of the store.

Right now, making Diffs for all users of overlay slower, in order to fix composefs (which doesn’t have in-production users) seems to me to be the wrong tradeoff.

If composefs Diffs have to be serialized (or if they have to be serialized for a few months until we figure something out), fine, that’s a valid trade-off that users can make when deciding whether to deploy composefs. It seems to me that this just doesn’t justify slowing down overlay for all existing systems, including existing enterprise deployments.

So I think we need to figure out a way to keep this read-locked for overlay, and until we do, composefs just isn’t ready. But I’m fully open to being overridden on this.


  • A blocker for formal reasons: If I understand it correctly, the overlay graph driver how requires layerStore.Diff / layerStore.Changes to hold the layer store lock for writing for the primary store. If that’s correct, I think that must be documented in the layerStore methods (// Requires startReading or startWriting.) — and due to the unusual design, including the reason; typically we only require “writing” when modifying layerStore state, so if this is the overlay driver requiring exclusivity, that reason must be documented there as well.

it is every driver except plain native overlay to require that, and vfs since there is no "mount" operation for vfs.

The reason I am considering this paragraph “blocker” is that I want layerStore comments about the locking to precisely match what the rules are. This PR has changed the rules but not updated the comments. (Sure, that might be premature if we decide to instead change the approach entirely.)

(This is the only “blocker” from my side, to the extent that I’m ignorant about drivers/overlay and AIS, and staying out of that problem space. But I do insist on the locking documentation being maintained.)

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 18, 2024

we would want the mounts to be reference-counted …. And we already have that kind of infrastructure in layerStore around mountsPath …

And you're saying we're not using that code though here?

I was mistaken here. We are using the mounts reference counting … but that reference counting at the primary store also, AFAICS, assumes that layer records exist on the primary store; looking at that yesterday, I didn’t see how it can work for counting additional-store layers that are not known to the primary store. But, also, I’m unfamiliar with the details, and strongly time-constrained right now; this might well be fine.

@cgwalters
Copy link
Contributor

One thing I would say here is I see an intersection with #2018 ...if what we mounted was always a single flattened composefs image, the only cross-store dependency would be refcounting of the underlying files behind the composefs (cc containers/composefs#125 ), but with a read-only additional store we would be guaranteed that the underlying objects wouldn't go away.

@giuseppe
Copy link
Member Author

If composefs Diffs have to be serialized (or if they have to be serialized for a few months until we figure something out), fine, that’s a valid trade-off that users can make when deciding whether to deploy composefs. It seems to me that this just doesn’t justify slowing down overlay for all existing systems, including existing enterprise deployments.

this is probably already broken with fuse-overlayfs, just nobody hit the race, because with fuse-overlayfs we are using the naive diff (thus causing the mount)

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 18, 2024

If composefs Diffs have to be serialized (or if they have to be serialized for a few months until we figure something out), fine, that’s a valid trade-off that users can make when deciding whether to deploy composefs. It seems to me that this just doesn’t justify slowing down overlay for all existing systems, including existing enterprise deployments.

this is probably already broken with fuse-overlayfs, just nobody hit the race, because with fuse-overlayfs we are using the naive diff (thus causing the mount)

If this is pointing at the start of overlay.Driver.Diff’s call of useNaiveDiff, that is a convincing argument.

… but I’m a bit lost at how that happens. AFAICS all calls (in Podman) are Diff("", id), i.e. against the immediate parent, and overlay (unconditionally) implements DiffGetter; i.e. overlay.Driver.Diff is not typically used at all, and that useNaiveDiff doesn’t make a difference. The only common way overlay.Driver.Diff could be invoked is if the layer doesn’t have tar-split data (i.e. it is a running container??). Is that what we are talking about here? Or is there something completely different going on?

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 18, 2024

we would want the mounts to be reference-counted …. And we already have that kind of infrastructure in layerStore around mountsPath …

And you're saying we're not using that code though here?

I was mistaken here.

… and mistaken again? The layerStore.DiffnewFileGetter path triggers reference-counted mounts. But we seem to be talking about layerStore.Diffoverlay.Driver.Diffdrivers.NaiveDiffDriver.Diff, and that one does not involve the reference counting.

I’m sorry. I’ve not been helping.

@rhatdan
Copy link
Member

rhatdan commented Jul 22, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 22, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 1bf05dd into containers:main Jul 22, 2024
18 checks passed
mtrmac added a commit that referenced this pull request Aug 30, 2024
... as modified by #2036 .

Also document a LOCKING BUG I don't now care to fix.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
@mtrmac
Copy link
Collaborator

mtrmac commented Aug 30, 2024

The reason I am considering this paragraph “blocker” is that I want layerStore comments about the locking to precisely match what the rules are.… But I do insist on the locking documentation being maintained.)

#2080 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

composefs?? random ENOENTS in CI
4 participants