auto-update: test getting "rolled back" instead of "true" #17607

edsantiago · 2023-02-22T20:18:31Z

not ok 346 podman auto-update - label io.containers.autoupdate=local
...
# podman auto-update --format {{.Unit}},{{.Image}},{{.Updated}},{{.Policy}}
container-c_local_SOwgTtVbcL.service,quay.io/libpod/localtest:latest,rolled back,local
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Image is updated.
#| expected: '.*container-c_local_SOwgTtVbcL.service,quay.io/libpod/localtest:latest,true,local.*' (using expr)
#|   actual:   'container-c_local_SOwgTtVbcL.service,quay.io/libpod/localtest:latest,rolled back,local'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                      ^^^^^^^^^^^----- this is the diff

Debian-only so far, and only one logged instance, but I saw this one twice (in non-preserved logs) while reviewing #17305. This is going to start hitting us hard now that Debian is live.

The text was updated successfully, but these errors were encountered:

vrothberg · 2023-02-23T07:24:14Z

A "rolled back" instead of "true" suggests that the update did not succeed, so Podman reverted/rolled back to the previous image. My best "guess" is that it's Debian's systemd.

I probably need access to a Debian VM to analyze it correctly. If it flakes too hard, I suggest to skip the test on Debian for now.

The symptoms in containers#17607 point to some race since it does not always flake on Debian (and Debian only). Hence, wait for the service to be ready before building the image to make sure that the service is started with the old image and that everything's in order. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

edsantiago · 2023-03-13T14:42:40Z

Looks like it's not quite fixed. Reopening.

vrothberg · 2023-03-13T15:12:08Z

Thanks for reopening! I noticed the flake today as well and will take a look.

vrothberg · 2023-03-13T16:14:30Z

not ok 343 podman auto-update - label io.containers.autoupdate=image

This one looks very strange to me. The image has been pulled successfully but something must have gone wrong when restarting the service, hence the rollback. Unfortunately, no logs indicate what went wrong.

FAIL: Timed out waiting for c_image_ydV3dzdA8a to update; old IID=f5a99120db6452661930a1db3bf7390eec9b963f5f62c068fa32dc1d550afad3

This could either be another race or the 10 seconds are not enough (unlikely).

@edsantiago do you have any suspicion on what may be wrong or whether there's another race in the tests?

edsantiago · 2023-03-13T16:25:11Z

I don't have a sense for it... but I'm seeing lots of new flakes in auto-update tests. I need some time to curate those; will check back in later.

edsantiago · 2023-03-13T20:13:00Z

OK, I found a smoking gun that confirms the "Timed out waiting" flake is the same as this one:

The error happens only on Debian
It happens on all update types: local, registry, image
Every time it happens, the status of the image is "rolled back"
There is no pull error AFAICT. All pulls seem to succeed.

Example:

# podman auto-update
Trying to pull quay.io/libpod/busybox:latest...
Getting image source signatures
Copying blob sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572
Copying config sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f
Writing manifest to image destination
Storing signatures
Trying to pull quay.io/libpod/alpine:latest...
Getting image source signatures
Copying blob sha256:9d16cba9fb961d1aafec9542f2bf7cb64acfc55245f9e4eb5abecd4cdc38d749
Copying config sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
Writing manifest to image destination
Storing signatures
            UNIT                                     CONTAINER                             IMAGE                            POLICY      UPDATED
            container-c_local_ueJQAIjyAu.service     84dbf4e6dee7 (c_local_ueJQAIjyAu)     quay.io/libpod/localtest:latest  local       true
   >>>>     container-c_registry_7uryK68Lbg.service  9c4d7ea88e3d (c_registry_7uryK68Lbg)  quay.io/libpod/busybox:latest    registry    rolled back
            container-c_image_bM3TKhA1hO.service     a178c2b40c76 (c_image_bM3TKhA1hO)     quay.io/libpod/alpine:latest     registry    true
Error: invalid auto-update policy "fake_LE9anOHuLO": valid policies are ["disabled" "image" "local" "registry"]
[ rc=125 (expected) ]
.....
#| FAIL: Timed out waiting for c_registry_7uryK68Lbg to update; old IID=f5a99120db6452661930a1db3bf7390eec9b963f5f62c068fa32dc1d550afad3

("c_registry" is the one showing "rolled back").

I've been seeing a lot of quay.io flakes today... but the logs above do not show those. And if it were a quay.io flake, I would expect non-Debian systems to hit it too. Something on Debian is causing "rolled back" even when all looks (superficially) successful.

Here are the "Timed out waiting" flakes:

[sys] podman auto-update with multiple services

debian-12 : sys podman debian-12 root host

To help debug containers#17607, turn off rollbacks for tests that do not require rollbacks. Error when restarting the systemd units are then not suppressed but returned which should give us more information about what is going on the Debian systems. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Wait for all generated services to be ready to be sure we can iron out race conditions. Also disable rollbacks to make sure we can analyze the error if restarting a service fails. This information may be crucial to understand the flakes on Debian as tracked in containers#17607. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2023-03-14T09:40:28Z

I opened #17770 to help debug what's going on. I tried reproducing in a CI VM but did not manage to, so I am flying a bit blind.

auto update system tests: help debug #17607

edsantiago · 2023-03-14T14:21:24Z

Hmmm, just saw a similar-but-not-identical failure in f37 root:

# podman auto-update
Trying to pull quay.io/libpod/alpine:latest...
Getting image source signatures
Copying blob sha256:9d16cba9fb961d1aafec9542f2bf7cb64acfc55245f9e4eb5abecd4cdc38d749
Copying config sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
Writing manifest to image destination
Storing signatures
            UNIT                                     CONTAINER                             IMAGE                            POLICY      UPDATED
            container-c_image_H9OYqP0TRy.service     32369315ad70 (c_image_H9OYqP0TRy)     quay.io/libpod/alpine:latest     registry    true
            container-c_local_QUAvrLRuUL.service     912348ea1b02 (c_local_QUAvrLRuUL)     quay.io/libpod/localtest:latest  local       true
            container-c_registry_K70FShh3C1.service  c25cc9196b95 (c_registry_K70FShh3C1)  quay.io/libpod/busybox:latest    registry    false
Error: invalid auto-update policy "fake_9qmKwZbQXa": valid policies are ["disabled" "image" "local" "registry"]
[ rc=125 (expected) ]
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Number of images updated from registry.
#| expected: '2'
#|   actual: '1'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vrothberg · 2023-03-14T14:26:36Z

What a coincidence. I think this should be fixed by #17770 as we are now waiting for all services to be ready before running auto update.

edsantiago · 2023-03-15T12:02:33Z

Well, foo. Failure seen in 17786, and it looks like that one was based on main which included your debug PR. I see nothing useful in the logs, but (fingers crossed) maybe you will.

vrothberg · 2023-03-15T12:48:49Z

Well, foo. Failure seen in 17786, and it looks like that one was based on main which included your debug PR. I see nothing useful in the logs, but (fingers crossed) maybe you will.

My bad. The error is still being eaten. Will prepare a PR.

Return the error when restarting the unit failed during an update. The task is correctly marked to have failed but we really need to return the error to the user. [NO NEW TESTS NEEDED] - The flakes in containers#17607 will reveal errors. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

edsantiago · 2023-03-15T17:29:29Z

Hey, this one popped up yesterday, is it helpful?

updating image for container 64ff205e2fd013dca4a81298fc8c09e4dbe497e2b837fcc475cbb3a5cdb564ef: 
    copying system image from manifest list:
     determining manifest MIME type for docker://quay.io/libpod/busybox:latest:
     reading manifest sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc
     in quay.io/libpod/busybox: received unexpected HTTP status: 502 Bad Gateway'

It would seem to suggest a quay flake, except, why only on Debian?

vrothberg · 2023-03-16T07:50:37Z

Thanks for sharing! It's probably a temporary fart on the registry. The other symptoms are after having pulled the image and when restarting the systemd units.

edsantiago · 2023-03-22T19:38:06Z

debian root

edsantiago · 2023-03-22T21:40:04Z

Another

edsantiago · 2023-03-22T21:40:58Z

OOH! We got it on f37!

edsantiago · 2023-03-27T20:15:49Z

@vrothberg here ya go. I see nothing of value, but hope you do.

Error: restarting unit container-c_image_Uz9SshEnXC.service during update: expected "done" but received "failed"
[ rc=125 ]
------------------------------------ SYSTEMCTL STATUS
● container-c_image_Uz9SshEnXC.service - Podman container-c_image_Uz9SshEnXC.service
     Loaded: loaded (/run/systemd/system/container-c_image_Uz9SshEnXC.service; disabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Mon 2023-03-27 19:56:32 UTC; 26ms ago
       Docs: man:podman-generate-systemd(1)
    Process: 138935 ExecStart=/var/tmp/go/src/github.com/containers/podman/bin/podman run --cidfile=/run/container-c_image_Uz9SshEnXC.service.ctr-id --cgroups=no-conmon --rm --sdnotify=conmon -d --replace --name c_image_Uz9SshEnXC --label io.containers.autoupdate=image quay.io/libpod/alpine:latest top -d 120 (code=exited, status=125)
    Process: 138942 ExecStopPost=/var/tmp/go/src/github.com/containers/podman/bin/podman rm -f --ignore -t 10 --cidfile=/run/container-c_image_Uz9SshEnXC.service.ctr-id (code=exited, status=0/SUCCESS)
   Main PID: 138935 (code=exited, status=125)

It turns out the restart is _not_ a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

It turns out the restart is _not_ a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings. [NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause the failure/flake/race. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2023-03-28T12:58:06Z

That was indeed somehow helpful as it encouraged me lookup what this auto-restart thing in the status means. I looked in the systemd code and then came up with #17959

vrothberg · 2023-03-28T12:59:59Z

If #17959 really fixes the flake/bug, I will remove the additional echos from the test but I prefer to keep them until it has some time to bake in CI.

It turns out the restart is _not_ a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings. [NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause the failure/flake/race. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

It turns out the restart is _not_ a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings. [NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause the failure/flake/race. Also ignore ENOENTS of the CID file when removing a container which has been identified of actually fixing containers#17607. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Commit f131eaa changed restart to a stop+start motivated by comments in the systemd man pages that restart behaves different than stop+start, for instance, that it keeps certain resources open and treats timers differently. Yet, the actually fix for containers#17607 in the very same commit was dealing with an ENOENT of the CID file on container removal. As it turns out in in containers#18926, changing to stop+start regressed on restarting dependencies when auto updating a systemd unit. Hence, move back to using restart to make sure that dependent systemd units are restarted as well. An alternative could be recommending to use `BindsTo=` in Quadlet files but this seems less common than `Requires=` and hence more risky to cause issues on user sites. Fixes: containers#18926 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

edsantiago added the flakes Flakes from Continuous Integration label Feb 22, 2023

vrothberg mentioned this issue Mar 1, 2023

auto-update test: wait for service to be ready #17673

Merged

openshift-merge-robot closed this as completed in #17673 Mar 1, 2023

edsantiago reopened this Mar 13, 2023

vrothberg mentioned this issue Mar 14, 2023

auto update system tests: help debug #17607 #17770

Merged

openshift-merge-robot added a commit that referenced this issue Mar 14, 2023

Merge pull request #17770 from vrothberg/debug-17607

c4c0793

auto update system tests: help debug #17607

vrothberg mentioned this issue Mar 15, 2023

auto update: return restart error #17796

Merged

edsantiago mentioned this issue Mar 17, 2023

Ed's pet PR with no flake retries #17831

Draft

edsantiago mentioned this issue Mar 23, 2023

Fix a race around SQLite DB config validation #17902

Merged

vrothberg mentioned this issue Mar 28, 2023

auto-update: stop+start instead of restart sytemd units #17959

Merged

edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023

openshift-merge-robot closed this as completed in #17959 Mar 29, 2023

vrothberg mentioned this issue Jun 19, 2023

auto update: restart instead of stop+start #18929

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 29, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto-update: test getting "rolled back" instead of "true" #17607

auto-update: test getting "rolled back" instead of "true" #17607

edsantiago commented Feb 22, 2023

vrothberg commented Feb 23, 2023

edsantiago commented Mar 13, 2023

vrothberg commented Mar 13, 2023

vrothberg commented Mar 13, 2023

edsantiago commented Mar 13, 2023

edsantiago commented Mar 13, 2023

vrothberg commented Mar 14, 2023

edsantiago commented Mar 14, 2023

vrothberg commented Mar 14, 2023

edsantiago commented Mar 15, 2023

vrothberg commented Mar 15, 2023

edsantiago commented Mar 15, 2023

vrothberg commented Mar 16, 2023

edsantiago commented Mar 22, 2023

edsantiago commented Mar 22, 2023

edsantiago commented Mar 22, 2023

edsantiago commented Mar 27, 2023

vrothberg commented Mar 28, 2023

vrothberg commented Mar 28, 2023

auto-update: test getting "rolled back" instead of "true" #17607

auto-update: test getting "rolled back" instead of "true" #17607

Comments

edsantiago commented Feb 22, 2023

vrothberg commented Feb 23, 2023

edsantiago commented Mar 13, 2023

vrothberg commented Mar 13, 2023

vrothberg commented Mar 13, 2023

edsantiago commented Mar 13, 2023

edsantiago commented Mar 13, 2023

[sys] podman auto-update with multiple services

vrothberg commented Mar 14, 2023

edsantiago commented Mar 14, 2023

vrothberg commented Mar 14, 2023

edsantiago commented Mar 15, 2023

vrothberg commented Mar 15, 2023

edsantiago commented Mar 15, 2023

vrothberg commented Mar 16, 2023

edsantiago commented Mar 22, 2023

edsantiago commented Mar 22, 2023

edsantiago commented Mar 22, 2023

edsantiago commented Mar 27, 2023

vrothberg commented Mar 28, 2023

vrothberg commented Mar 28, 2023