-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"podman image rm - concurrent with shared layers" seems racy #18659
Comments
Shouldn't storage have write-locks on layers created by a build, till the build ends ? i.e other instances should be able to read but not delete layers till build process unlocks it. |
So, I think this might be the same as #18449. Should I close that one, pointing here? |
c/storage locks are only held for the duration of a c/storage operation. Creating a layer is one operation; creating an image pointing at that layer is another operation, and the layer can be removed in the meantime.
Write lock prevents concurrent reads. (And a read lock wouldn’t work because it can’t be reliably upgraded to a write lock.) I think we should, long-term, solve that somehow, maybe by introducing “leases” on layers created during an image creation so that a |
This test is intended to test concurrent removals, so don't risk a removal breaking a build. Fixes containers#18659 . (The sitaution that removals can break a build WIP is a real problem that should be fixed, but that's not a target of this test.) Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Here's a slightly new symptom from the same test, a different flake, seen in f38 root:
I'm assuming it's the same root cause. And here's the full list of flakes this month:
|
That is actually a part of the removal code, not caused by a removal breaking a concurrent build. It’s quite a bit closer to the test demonstrating a bug in the code under test (“rm concurrent with shared layers”), except that
I think in that case the underlying failure is really
and that’s, I think, a test bug: podman/test/e2e/common_test.go Line 1178 in b635292
PodmanTestIntegration , while Line 276 in b635292
PodmanTestIntegration across 10 goroutines; i.e. they all overwrite each other’s Dockerfile , and they remove each other’s Dockerfile .
I don’t immediately understand the Podman test harness enough to say whether |
Ah, thanks for that explanation. I guess if it happens again, the solution is easy: serialize the builds. There's not really any reason to parallelize those. |
That’s a great solution. |
Issue Description
#18631 links to a test failure.
AFAICS the test was added in #9266, with code that was supposed to make
podman rm
more robust against concurrent removals.But the test does not quite test that: it tests concurrent (build + remove).
AFAICS this means that during a build, a layer can be created, and concurrently an image removal can remove that layer as dangling. (During builds, we don’t have a “reference” pointing to a created layer until an image is created pointing at a layer stack; that’s racy vs. concurrent removals.)
Arguably this build-vs-
rm
race is an user-visible problem we should fix, but it’s not trivially obvious how.Meanwhile, the test is failing on something it, AFAICS, it wasn’t intended to trigger.
Steps to reproduce the issue
Steps to reproduce the issue
Describe the results you received
A test fails in the “build” phase, after at least one removal has happened.
Describe the results you expected
The test not failing
podman info output
I guess https://api.cirrus-ci.com/v1/artifact/task/5379214719844352/html/int-podman-fedora-38-root-host-boltdb.log.html#t--podman-image-rm-concurrent-with-shared-layers--1 implicitly contains that data
Podman in a container
No
Privileged Or Rootless
None
Upstream Latest Release
Yes
Additional environment details
Additional environment details
Additional information
Originally discussed in #18631 (comment) .
The text was updated successfully, but these errors were encountered: