"podman image rm - concurrent with shared layers" seems racy #18659

mtrmac · 2023-05-22T22:14:13Z

Issue Description

AFAICS the test was added in #9266, with code that was supposed to make podman rm more robust against concurrent removals.

But the test does not quite test that: it tests concurrent (build + remove).

AFAICS this means that during a build, a layer can be created, and concurrently an image removal can remove that layer as dangling. (During builds, we don’t have a “reference” pointing to a created layer until an image is created pointing at a layer stack; that’s racy vs. concurrent removals.)

Arguably this build-vs-rm race is an user-visible problem we should fix, but it’s not trivially obvious how.

Meanwhile, the test is failing on something it, AFAICS, it wasn’t intended to trigger.

Steps to reproduce the issue

Read https://api.cirrus-ci.com/v1/artifact/task/5379214719844352/html/int-podman-fedora-38-root-host-boltdb.log.html#t--podman-image-rm-concurrent-with-shared-layers--1
Read the test code
Ponder

Describe the results you received

A test fails in the “build” phase, after at least one removal has happened.

Describe the results you expected

The test not failing

podman info output

I guess https://api.cirrus-ci.com/v1/artifact/task/5379214719844352/html/int-podman-fedora-38-root-host-boltdb.log.html#t--podman-image-rm-concurrent-with-shared-layers--1 implicitly contains that data

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

Additional information

Originally discussed in #18631 (comment) .

The text was updated successfully, but these errors were encountered:

flouthoc · 2023-05-23T09:36:57Z

Shouldn't storage have write-locks on layers created by a build, till the build ends ? i.e other instances should be able to read but not delete layers till build process unlocks it.

edsantiago · 2023-05-23T11:16:09Z

So, I think this might be the same as #18449. Should I close that one, pointing here?

mtrmac · 2023-05-23T16:08:28Z

Shouldn't storage have write-locks on layers created by a build, till the build ends ?

c/storage locks are only held for the duration of a c/storage operation. Creating a layer is one operation; creating an image pointing at that layer is another operation, and the layer can be removed in the meantime.

i.e other instances should be able to read but not delete layers till build process unlocks it.

Write lock prevents concurrent reads. (And a read lock wouldn’t work because it can’t be reliably upgraded to a write lock.)

I think we should, long-term, solve that somehow, maybe by introducing “leases” on layers created during an image creation so that a podman rm doesn’t remove a layer that was “leased” in the past 5? 10? minutes. (And then we would need to think about builds crashing and cleanup.) But that’s not this issue; this one is just about the flaking test of concurrent removal, not concurrent builds.

This test is intended to test concurrent removals, so don't risk a removal breaking a build. Fixes containers#18659 . (The sitaution that removals can break a build WIP is a real problem that should be fixed, but that's not a target of this test.) Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac · 2023-05-23T16:18:15Z

#18664 hopefully fixes this flake.

So, I think this might be the same as #18449

It is, AFAICS.

edsantiago · 2023-05-23T19:50:07Z

Here's a slightly new symptom from the same test, a different flake, seen in f38 root:

podman image rm - concurrent with shared layers
...
time="2023-05-18T16:06:46-05:00"
level=warning
msg="Failed to determine if an image is a parent: reading image \"8eda20fb4f68dc6acb3aabb16d4c11b38db82116782ea2866c79b2b602f9e910\":
     locating image with ID \"8eda20fb4f68dc6acb3aabb16d4c11b38db82116782ea2866c79b2b602f9e910\":
     image not known, ignoring the error"

I'm assuming it's the same root cause.

And here's the full list of flakes this month:

fedora-38 : int podman fedora-38 root host boltdb
- 05-22 15:54 in podman image rm - concurrent with shared layers
fedora-38 : int podman fedora-38 root host sqlite
- 05-18 17:32 in podman image rm - concurrent with shared layers
- 05-06 17:07 in podman image rm - concurrent with shared layers
fedora-38 : int remote fedora-38 root host boltdb [remote]
- 05-10 23:46 in podman image rm - concurrent with shared layers
rawhide : int remote rawhide root host sqlite [remote]
- 05-03 16:16 in podman image rm - concurrent with shared layers

mtrmac · 2023-05-23T20:10:23Z

Failed to determine if an image is a parent

That is actually a part of the removal code, not caused by a removal breaking a concurrent build. It’s quite a bit closer to the test demonstrating a bug in the code under test (“rm concurrent with shared layers”), except that

it’s just a warning, and
the code failing is a more generic HasChildren. It’s quite possible that computing that should also silently ignore concurrently-removed images, in all cases, but it’s a bit weaker than the argument that a removal should silently ignore concurrently-removed images.

I think in that case the underlying failure is really

Error: no contents in "/tmp/podman_test1709958300/Dockerfile"

and that’s, I think, a test bug:

podman/test/e2e/common_test.go

Line 1178 in b635292

dockerfilePath := filepath.Join(p.TempDir, "Dockerfile")

uses a hard-coded temporary file name shared across a PodmanTestIntegration, while

podman/test/e2e/rmi_test.go

Line 276 in b635292

podmanTest.BuildImage(containerfile, imageName, "false")

uses a single PodmanTestIntegration across 10 goroutines; i.e. they all overwrite each other’s Dockerfile, and they remove each other’s Dockerfile.

I don’t immediately understand the Podman test harness enough to say whether BuildImage should not be using a hard-coded Dockefile, or whether the goroutines should each own its own PodmanTestIntegration.

edsantiago · 2023-05-23T20:12:13Z

Ah, thanks for that explanation. I guess if it happens again, the solution is easy: serialize the builds. There's not really any reason to parallelize those.

mtrmac · 2023-05-23T20:13:08Z

That’s a great solution.

mtrmac added the kind/bug Categorizes issue or PR as related to a bug. label May 22, 2023

mtrmac mentioned this issue May 22, 2023

New buildah/timeout/external container flakes #18631

Closed

mtrmac mentioned this issue May 23, 2023

In a concurrent removal test, don't remove concurrently with builds #18664

Merged

edsantiago mentioned this issue May 23, 2023

e2e: committing the finished image: layer not known #18449

Closed

edsantiago added the flakes Flakes from Continuous Integration label May 23, 2023

openshift-merge-robot closed this as completed in #18664 May 23, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"podman image rm - concurrent with shared layers" seems racy #18659

"podman image rm - concurrent with shared layers" seems racy #18659

mtrmac commented May 22, 2023

flouthoc commented May 23, 2023

edsantiago commented May 23, 2023

mtrmac commented May 23, 2023

mtrmac commented May 23, 2023

edsantiago commented May 23, 2023

mtrmac commented May 23, 2023

edsantiago commented May 23, 2023

mtrmac commented May 23, 2023

"podman image rm - concurrent with shared layers" seems racy #18659

"podman image rm - concurrent with shared layers" seems racy #18659

Comments

mtrmac commented May 22, 2023

Issue Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

podman info output

Podman in a container

Privileged Or Rootless

Upstream Latest Release

Additional environment details

Additional information

flouthoc commented May 23, 2023

edsantiago commented May 23, 2023

mtrmac commented May 23, 2023

mtrmac commented May 23, 2023

edsantiago commented May 23, 2023

mtrmac commented May 23, 2023

edsantiago commented May 23, 2023

mtrmac commented May 23, 2023