Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-update: test getting "rolled back" instead of "true" #17607

Closed
edsantiago opened this issue Feb 22, 2023 · 19 comments · Fixed by #17673 or #17959
Closed

auto-update: test getting "rolled back" instead of "true" #17607

edsantiago opened this issue Feb 22, 2023 · 19 comments · Fixed by #17673 or #17959
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

not ok 346 podman auto-update - label io.containers.autoupdate=local
...
# podman auto-update --format {{.Unit}},{{.Image}},{{.Updated}},{{.Policy}}
container-c_local_SOwgTtVbcL.service,quay.io/libpod/localtest:latest,rolled back,local
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Image is updated.
#| expected: '.*container-c_local_SOwgTtVbcL.service,quay.io/libpod/localtest:latest,true,local.*' (using expr)
#|   actual:   'container-c_local_SOwgTtVbcL.service,quay.io/libpod/localtest:latest,rolled back,local'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                      ^^^^^^^^^^^----- this is the diff

Debian-only so far, and only one logged instance, but I saw this one twice (in non-preserved logs) while reviewing #17305. This is going to start hitting us hard now that Debian is live.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Feb 22, 2023
@vrothberg
Copy link
Member

A "rolled back" instead of "true" suggests that the update did not succeed, so Podman reverted/rolled back to the previous image. My best "guess" is that it's Debian's systemd.

I probably need access to a Debian VM to analyze it correctly. If it flakes too hard, I suggest to skip the test on Debian for now.

vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 1, 2023
The symptoms in containers#17607 point to some race since it does not always flake
on Debian (and Debian only).  Hence, wait for the service to be ready
before building the image to make sure that the service is started with
the old image and that everything's in order.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@edsantiago
Copy link
Member Author

Looks like it's not quite fixed. Reopening.

@edsantiago edsantiago reopened this Mar 13, 2023
@vrothberg
Copy link
Member

Thanks for reopening! I noticed the flake today as well and will take a look.

@vrothberg
Copy link
Member

not ok 343 podman auto-update - label io.containers.autoupdate=image

This one looks very strange to me. The image has been pulled successfully but something must have gone wrong when restarting the service, hence the rollback. Unfortunately, no logs indicate what went wrong.

FAIL: Timed out waiting for c_image_ydV3dzdA8a to update; old IID=f5a99120db6452661930a1db3bf7390eec9b963f5f62c068fa32dc1d550afad3

This could either be another race or the 10 seconds are not enough (unlikely).

@edsantiago do you have any suspicion on what may be wrong or whether there's another race in the tests?

@edsantiago
Copy link
Member Author

I don't have a sense for it... but I'm seeing lots of new flakes in auto-update tests. I need some time to curate those; will check back in later.

@edsantiago
Copy link
Member Author

OK, I found a smoking gun that confirms the "Timed out waiting" flake is the same as this one:

  • The error happens only on Debian
  • It happens on all update types: local, registry, image
  • Every time it happens, the status of the image is "rolled back"
  • There is no pull error AFAICT. All pulls seem to succeed.

Example:

# podman auto-update
Trying to pull quay.io/libpod/busybox:latest...
Getting image source signatures
Copying blob sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572
Copying config sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f
Writing manifest to image destination
Storing signatures
Trying to pull quay.io/libpod/alpine:latest...
Getting image source signatures
Copying blob sha256:9d16cba9fb961d1aafec9542f2bf7cb64acfc55245f9e4eb5abecd4cdc38d749
Copying config sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
Writing manifest to image destination
Storing signatures
            UNIT                                     CONTAINER                             IMAGE                            POLICY      UPDATED
            container-c_local_ueJQAIjyAu.service     84dbf4e6dee7 (c_local_ueJQAIjyAu)     quay.io/libpod/localtest:latest  local       true
   >>>>     container-c_registry_7uryK68Lbg.service  9c4d7ea88e3d (c_registry_7uryK68Lbg)  quay.io/libpod/busybox:latest    registry    rolled back
            container-c_image_bM3TKhA1hO.service     a178c2b40c76 (c_image_bM3TKhA1hO)     quay.io/libpod/alpine:latest     registry    true
Error: invalid auto-update policy "fake_LE9anOHuLO": valid policies are ["disabled" "image" "local" "registry"]
[ rc=125 (expected) ]
.....
#| FAIL: Timed out waiting for c_registry_7uryK68Lbg to update; old IID=f5a99120db6452661930a1db3bf7390eec9b963f5f62c068fa32dc1d550afad3

("c_registry" is the one showing "rolled back").

I've been seeing a lot of quay.io flakes today... but the logs above do not show those. And if it were a quay.io flake, I would expect non-Debian systems to hit it too. Something on Debian is causing "rolled back" even when all looks (superficially) successful.

Here are the "Timed out waiting" flakes:

[sys] podman auto-update with multiple services

vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 14, 2023
To help debug containers#17607, turn off rollbacks for tests that do not require
rollbacks.  Error when restarting the systemd units are then not
suppressed but returned which should give us more information about what
is going on the Debian systems.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 14, 2023
Wait for all generated services to be ready to be sure we can iron out
race conditions.  Also disable rollbacks to make sure we can analyze
the error if restarting a service fails.  This information may be
crucial to understand the flakes on Debian as tracked in containers#17607.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@vrothberg
Copy link
Member

I opened #17770 to help debug what's going on. I tried reproducing in a CI VM but did not manage to, so I am flying a bit blind.

openshift-merge-robot added a commit that referenced this issue Mar 14, 2023
auto update system tests: help debug #17607
@edsantiago
Copy link
Member Author

Hmmm, just saw a similar-but-not-identical failure in f37 root:

# podman auto-update
Trying to pull quay.io/libpod/alpine:latest...
Getting image source signatures
Copying blob sha256:9d16cba9fb961d1aafec9542f2bf7cb64acfc55245f9e4eb5abecd4cdc38d749
Copying config sha256:961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
Writing manifest to image destination
Storing signatures
            UNIT                                     CONTAINER                             IMAGE                            POLICY      UPDATED
            container-c_image_H9OYqP0TRy.service     32369315ad70 (c_image_H9OYqP0TRy)     quay.io/libpod/alpine:latest     registry    true
            container-c_local_QUAvrLRuUL.service     912348ea1b02 (c_local_QUAvrLRuUL)     quay.io/libpod/localtest:latest  local       true
            container-c_registry_K70FShh3C1.service  c25cc9196b95 (c_registry_K70FShh3C1)  quay.io/libpod/busybox:latest    registry    false
Error: invalid auto-update policy "fake_9qmKwZbQXa": valid policies are ["disabled" "image" "local" "registry"]
[ rc=125 (expected) ]
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Number of images updated from registry.
#| expected: '2'
#|   actual: '1'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@vrothberg
Copy link
Member

What a coincidence. I think this should be fixed by #17770 as we are now waiting for all services to be ready before running auto update.

@edsantiago
Copy link
Member Author

Well, foo. Failure seen in 17786, and it looks like that one was based on main which included your debug PR. I see nothing useful in the logs, but (fingers crossed) maybe you will.

@vrothberg
Copy link
Member

Well, foo. Failure seen in 17786, and it looks like that one was based on main which included your debug PR. I see nothing useful in the logs, but (fingers crossed) maybe you will.

My bad. The error is still being eaten. Will prepare a PR.

vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 15, 2023
Return the error when restarting the unit failed during an update.
The task is correctly marked to have failed but we really need to
return the error to the user.

[NO NEW TESTS NEEDED] - The flakes in containers#17607 will reveal errors.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 15, 2023
Return the error when restarting the unit failed during an update.
The task is correctly marked to have failed but we really need to
return the error to the user.

[NO NEW TESTS NEEDED] - The flakes in containers#17607 will reveal errors.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@edsantiago
Copy link
Member Author

Hey, this one popped up yesterday, is it helpful?

updating image for container 64ff205e2fd013dca4a81298fc8c09e4dbe497e2b837fcc475cbb3a5cdb564ef: 
    copying system image from manifest list:
     determining manifest MIME type for docker://quay.io/libpod/busybox:latest:
     reading manifest sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc
     in quay.io/libpod/busybox: received unexpected HTTP status: 502 Bad Gateway'

It would seem to suggest a quay flake, except, why only on Debian?

@vrothberg
Copy link
Member

Thanks for sharing! It's probably a temporary fart on the registry. The other symptoms are after having pulled the image and when restarting the systemd units.

@edsantiago
Copy link
Member Author

debian root

@edsantiago
Copy link
Member Author

Another

@edsantiago
Copy link
Member Author

OOH! We got it on f37!

@edsantiago
Copy link
Member Author

@vrothberg here ya go. I see nothing of value, but hope you do.

Error: restarting unit container-c_image_Uz9SshEnXC.service during update: expected "done" but received "failed"
[ rc=125 ]
------------------------------------ SYSTEMCTL STATUS
● container-c_image_Uz9SshEnXC.service - Podman container-c_image_Uz9SshEnXC.service
     Loaded: loaded (/run/systemd/system/container-c_image_Uz9SshEnXC.service; disabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Mon 2023-03-27 19:56:32 UTC; 26ms ago
       Docs: man:podman-generate-systemd(1)
    Process: 138935 ExecStart=/var/tmp/go/src/github.com/containers/podman/bin/podman run --cidfile=/run/container-c_image_Uz9SshEnXC.service.ctr-id --cgroups=no-conmon --rm --sdnotify=conmon -d --replace --name c_image_Uz9SshEnXC --label io.containers.autoupdate=image quay.io/libpod/alpine:latest top -d 120 (code=exited, status=125)
    Process: 138942 ExecStopPost=/var/tmp/go/src/github.com/containers/podman/bin/podman rm -f --ignore -t 10 --cidfile=/run/container-c_image_Uz9SshEnXC.service.ctr-id (code=exited, status=0/SUCCESS)
   Main PID: 138935 (code=exited, status=125)

vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 28, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 28, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@vrothberg
Copy link
Member

That was indeed somehow helpful as it encouraged me lookup what this auto-restart thing in the status means. I looked in the systemd code and then came up with #17959

@vrothberg
Copy link
Member

If #17959 really fixes the flake/bug, I will remove the additional echos from the test but I prefer to keep them until it has some time to bake in CI.

vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 28, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
edsantiago pushed a commit to edsantiago/libpod that referenced this issue Mar 28, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
edsantiago pushed a commit to edsantiago/libpod that referenced this issue Mar 28, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
edsantiago pushed a commit to edsantiago/libpod that referenced this issue Mar 29, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 29, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Also ignore ENOENTS of the CID file when removing a container which has
been identified of actually fixing containers#17607.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 29, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Also ignore ENOENTS of the CID file when removing a container which has
been identified of actually fixing containers#17607.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
edsantiago pushed a commit to edsantiago/libpod that referenced this issue Mar 29, 2023
It turns out the restart is _not_ a stop+start but keeps certain
resources open and is subject to some timeouts that may differ across
distributions' default settings.

[NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause
the failure/flake/race.

Also ignore ENOENTS of the CID file when removing a container which has
been identified of actually fixing containers#17607.

Fixes: containers#17607
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Jun 19, 2023
Commit f131eaa changed restart to a stop+start motivated by
comments in the systemd man pages that restart behaves different than
stop+start, for instance, that it keeps certain resources open and
treats timers differently.  Yet, the actually fix for containers#17607 in the very
same commit was dealing with an ENOENT of the CID file on container
removal.

As it turns out in in containers#18926, changing to stop+start regressed on
restarting dependencies when auto updating a systemd unit.  Hence, move
back to using restart to make sure that dependent systemd units are
restarted as well.

An alternative could be recommending to use `BindsTo=` in Quadlet files
but this seems less common than `Requires=` and hence more risky to
cause issues on user sites.

Fixes: containers#18926
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
cgiradkar pushed a commit to cgiradkar/podman that referenced this issue Jul 12, 2023
Commit f131eaa changed restart to a stop+start motivated by
comments in the systemd man pages that restart behaves different than
stop+start, for instance, that it keeps certain resources open and
treats timers differently.  Yet, the actually fix for containers#17607 in the very
same commit was dealing with an ENOENT of the CID file on container
removal.

As it turns out in in containers#18926, changing to stop+start regressed on
restarting dependencies when auto updating a systemd unit.  Hence, move
back to using restart to make sure that dependent systemd units are
restarted as well.

An alternative could be recommending to use `BindsTo=` in Quadlet files
but this seems less common than `Requires=` and hence more risky to
cause issues on user sites.

Fixes: containers#18926
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
cgiradkar pushed a commit to cgiradkar/podman that referenced this issue Jul 13, 2023
Commit f131eaa changed restart to a stop+start motivated by
comments in the systemd man pages that restart behaves different than
stop+start, for instance, that it keeps certain resources open and
treats timers differently.  Yet, the actually fix for containers#17607 in the very
same commit was dealing with an ENOENT of the CID file on container
removal.

As it turns out in in containers#18926, changing to stop+start regressed on
restarting dependencies when auto updating a systemd unit.  Hence, move
back to using restart to make sure that dependent systemd units are
restarted as well.

An alternative could be recommending to use `BindsTo=` in Quadlet files
but this seems less common than `Requires=` and hence more risky to
cause issues on user sites.

Fixes: containers#18926
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
cgiradkar pushed a commit to cgiradkar/podman that referenced this issue Jul 17, 2023
Commit f131eaa changed restart to a stop+start motivated by
comments in the systemd man pages that restart behaves different than
stop+start, for instance, that it keeps certain resources open and
treats timers differently.  Yet, the actually fix for containers#17607 in the very
same commit was dealing with an ENOENT of the CID file on container
removal.

As it turns out in in containers#18926, changing to stop+start regressed on
restarting dependencies when auto updating a systemd unit.  Hence, move
back to using restart to make sure that dependent systemd units are
restarted as well.

An alternative could be recommending to use `BindsTo=` in Quadlet files
but this seems less common than `Requires=` and hence more risky to
cause issues on user sites.

Fixes: containers#18926
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
cgiradkar pushed a commit to cgiradkar/podman that referenced this issue Jul 17, 2023
Commit f131eaa changed restart to a stop+start motivated by
comments in the systemd man pages that restart behaves different than
stop+start, for instance, that it keeps certain resources open and
treats timers differently.  Yet, the actually fix for containers#17607 in the very
same commit was dealing with an ENOENT of the CID file on container
removal.

As it turns out in in containers#18926, changing to stop+start regressed on
restarting dependencies when auto updating a systemd unit.  Hence, move
back to using restart to make sure that dependent systemd units are
restarted as well.

An alternative could be recommending to use `BindsTo=` in Quadlet files
but this seems less common than `Requires=` and hence more risky to
cause issues on user sites.

Fixes: containers#18926
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 29, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
2 participants