Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"podman image rm - concurrent with shared layers" seems racy #18659

Closed
mtrmac opened this issue May 22, 2023 · 8 comments · Fixed by #18664
Closed

"podman image rm - concurrent with shared layers" seems racy #18659

mtrmac opened this issue May 22, 2023 · 8 comments · Fixed by #18664
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@mtrmac
Copy link
Collaborator

mtrmac commented May 22, 2023

Issue Description

#18631 links to a test failure.

AFAICS the test was added in #9266, with code that was supposed to make podman rm more robust against concurrent removals.

But the test does not quite test that: it tests concurrent (build + remove).

AFAICS this means that during a build, a layer can be created, and concurrently an image removal can remove that layer as dangling. (During builds, we don’t have a “reference” pointing to a created layer until an image is created pointing at a layer stack; that’s racy vs. concurrent removals.)

Arguably this build-vs-rm race is an user-visible problem we should fix, but it’s not trivially obvious how.

Meanwhile, the test is failing on something it, AFAICS, it wasn’t intended to trigger.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Read https://api.cirrus-ci.com/v1/artifact/task/5379214719844352/html/int-podman-fedora-38-root-host-boltdb.log.html#t--podman-image-rm-concurrent-with-shared-layers--1
  2. Read the test code
  3. Ponder

Describe the results you received

A test fails in the “build” phase, after at least one removal has happened.

Describe the results you expected

The test not failing

podman info output

I guess https://api.cirrus-ci.com/v1/artifact/task/5379214719844352/html/int-podman-fedora-38-root-host-boltdb.log.html#t--podman-image-rm-concurrent-with-shared-layers--1 implicitly contains that data

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

Additional environment details

Additional information

Originally discussed in #18631 (comment) .

@mtrmac mtrmac added the kind/bug Categorizes issue or PR as related to a bug. label May 22, 2023
@flouthoc
Copy link
Collaborator

Shouldn't storage have write-locks on layers created by a build, till the build ends ? i.e other instances should be able to read but not delete layers till build process unlocks it.

@edsantiago
Copy link
Member

So, I think this might be the same as #18449. Should I close that one, pointing here?

@mtrmac
Copy link
Collaborator Author

mtrmac commented May 23, 2023

Shouldn't storage have write-locks on layers created by a build, till the build ends ?

c/storage locks are only held for the duration of a c/storage operation. Creating a layer is one operation; creating an image pointing at that layer is another operation, and the layer can be removed in the meantime.

i.e other instances should be able to read but not delete layers till build process unlocks it.

Write lock prevents concurrent reads. (And a read lock wouldn’t work because it can’t be reliably upgraded to a write lock.)


I think we should, long-term, solve that somehow, maybe by introducing “leases” on layers created during an image creation so that a podman rm doesn’t remove a layer that was “leased” in the past 5? 10? minutes. (And then we would need to think about builds crashing and cleanup.) But that’s not this issue; this one is just about the flaking test of concurrent removal, not concurrent builds.

mtrmac added a commit to mtrmac/libpod that referenced this issue May 23, 2023
This test is intended to test concurrent removals, so don't
risk a removal breaking a build.

Fixes containers#18659 .

(The sitaution that removals can break a build WIP is a real
problem that should be fixed, but that's not a target of this test.)

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
@mtrmac
Copy link
Collaborator Author

mtrmac commented May 23, 2023

#18664 hopefully fixes this flake.


So, I think this might be the same as #18449

It is, AFAICS.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label May 23, 2023
@edsantiago
Copy link
Member

Here's a slightly new symptom from the same test, a different flake, seen in f38 root:

podman image rm - concurrent with shared layers
...
time="2023-05-18T16:06:46-05:00"
level=warning
msg="Failed to determine if an image is a parent: reading image \"8eda20fb4f68dc6acb3aabb16d4c11b38db82116782ea2866c79b2b602f9e910\":
     locating image with ID \"8eda20fb4f68dc6acb3aabb16d4c11b38db82116782ea2866c79b2b602f9e910\":
     image not known, ignoring the error"

I'm assuming it's the same root cause.

And here's the full list of flakes this month:

  • fedora-38 : int podman fedora-38 root host boltdb
    • 05-22 15:54 in podman image rm - concurrent with shared layers
  • fedora-38 : int podman fedora-38 root host sqlite
    • 05-18 17:32 in podman image rm - concurrent with shared layers
    • 05-06 17:07 in podman image rm - concurrent with shared layers
  • fedora-38 : int remote fedora-38 root host boltdb [remote]
    • 05-10 23:46 in podman image rm - concurrent with shared layers
  • rawhide : int remote rawhide root host sqlite [remote]
    • 05-03 16:16 in podman image rm - concurrent with shared layers

@mtrmac
Copy link
Collaborator Author

mtrmac commented May 23, 2023

Failed to determine if an image is a parent

That is actually a part of the removal code, not caused by a removal breaking a concurrent build. It’s quite a bit closer to the test demonstrating a bug in the code under test (“rm concurrent with shared layers”), except that

  • it’s just a warning, and
  • the code failing is a more generic HasChildren. It’s quite possible that computing that should also silently ignore concurrently-removed images, in all cases, but it’s a bit weaker than the argument that a removal should silently ignore concurrently-removed images.

I think in that case the underlying failure is really

Error: no contents in "/tmp/podman_test1709958300/Dockerfile"

and that’s, I think, a test bug:

dockerfilePath := filepath.Join(p.TempDir, "Dockerfile")
uses a hard-coded temporary file name shared across a PodmanTestIntegration, while
podmanTest.BuildImage(containerfile, imageName, "false")
uses a single PodmanTestIntegration across 10 goroutines; i.e. they all overwrite each other’s Dockerfile, and they remove each other’s Dockerfile.

I don’t immediately understand the Podman test harness enough to say whether BuildImage should not be using a hard-coded Dockefile, or whether the goroutines should each own its own PodmanTestIntegration.

@edsantiago
Copy link
Member

Ah, thanks for that explanation. I guess if it happens again, the solution is easy: serialize the builds. There's not really any reason to parallelize those.

@mtrmac
Copy link
Collaborator Author

mtrmac commented May 23, 2023

That’s a great solution.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants