podman-remote wait: Error getting exit code from DB: no such exit code #14859

edsantiago · 2022-07-07T18:08:27Z

Seen just now:

$ podman-remote --url unix:/tmp/podman_tmp_mnLg wait gfKskJHy37
Error: getting exit code of container ebff09951585fcda59dfeceabfaab084751a8303a7cd981420c32c1d0c38b656 from DB: no such exit code

Only one so far. Seems suspiciously related to issue #14761 and PR #14830.

[sys] 203 podman kill - concurrent stop

fedora-36 : sys remote fedora-36 rootless host [remote]
- PR fix namespace reporting #14852
  - 07-07 12:23

The text was updated successfully, but these errors were encountered:

vrothberg · 2022-07-08T08:11:21Z

Thanks for filing, @edsantiago.

@mheon FYI

vrothberg · 2022-07-08T08:50:30Z

Another one here #14839

vrothberg · 2022-07-08T11:53:08Z

I just noticed that the test uses alpine instead of $IMAGE. I'll fix that as well.

While for some call paths we may be doing this redundantly we need to make sure the exit code is always read at this point. [NO NEW TESTS NEEDED] as containers#14859 is most likely caused by a code path not writing the exit code to the DB - which this commit fixes. Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

While for some call paths we may be doing this redundantly we need to make sure the exit code is always read at this point. [NO NEW TESTS NEEDED] as I do not manage to reproduce the issue which is very likely caused by a code path not writing the exit code when running concurrently. Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

edsantiago · 2022-07-18T14:23:48Z

Reopening, sorry. Seen again in two PRs which, as best I can tell, had a branch base that included #14874

[sys] 203 podman kill - concurrent stop

fedora-35 : sys remote fedora-35 root host [remote]
- PR Podman pull --all-tags shorthand option #14932
  - 07-14 12:36
- PR Use prepared image for WSL machine init #14919
  - 07-13 20:00

vrothberg · 2022-07-18T14:41:12Z

/me wipes tears

Thanks, Ed! I am sure we'll find the needle in the haystack eventually.

Improve the error message when looking up the exit code of a container. The state of the container may help us track down containers#14859 which flakes rarely and is impossible to reproduce on my machine. [NO NEW TESTS NEEDED] Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2022-07-22T13:26:54Z

I opened #15038 in hope that it will help us get a better understanding of the issue.

vrothberg · 2022-07-25T11:28:21Z

@edsantiago, since #15038 merged, I am curious to see new cases of the error with the updated error message. My hope is that the added container state will help isolate the error source.

edsantiago · 2022-07-26T12:38:08Z

Here you go - hth. remote f36 rootless:

[+1187s] not ok 205 podman kill - concurrent stop
...
# $ podman-remote --url unix:/tmp/podman_tmp_wJ2p wait 9yvX8WUvps
# Error: getting exit code of container fc1bf1ba67e1366f00f76d5ed4993145573efe1f1c7450240b679d77b245f719 from DB: no such exit code (container in state stopping)

vrothberg · 2022-07-26T12:42:43Z

That is indeed extremely helpful. Thank you so much, @edsantiago!

Improve the error message when looking up the exit code of a container. The state of the container may help us track down containers#14859 which flakes rarely and is impossible to reproduce on my machine. [NO NEW TESTS NEEDED] Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2022-07-27T12:35:29Z

Taking another look. I parsed the code yesterday but could not find something.

Allow the cleanup process to transition the container from `stopping` to `exited`. This fixes a race condition detected in containers#14859 where the cleanup process kicks in _before_ the stopping process can read the exit file. Prior to this fix, the cleanup process left the container in the `stopping` state and removed the conmon files, such that the stopping process also left the container in this state as it could not read the exit files. Hence, `podman wait` timed out (see the 23 seconds execution time of the test [1]) due to the unexpected/invalid state and the test failed. Further turn the warning during stop to a debug message since it's a natural race due to the daemonless/concurrent architecture and nothing to worry about. [NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues flaking or not. [1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205 Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Allow the cleanup process (and others) to transition the container from `stopping` to `exited`. This fixes a race condition detected in containers#14859 where the cleanup process kicks in _before_ the stopping process can read the exit file. Prior to this fix, the cleanup process left the container in the `stopping` state and removed the conmon files, such that the stopping process also left the container in this state as it could not read the exit files. Hence, `podman wait` timed out (see the 23 seconds execution time of the test [1]) due to the unexpected/invalid state and the test failed. Further turn the warning during stop to a debug message since it's a natural race due to the daemonless/concurrent architecture and nothing to worry about. [NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues flaking or not. [1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205 Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

joanbm · 2022-09-18T17:08:03Z

(Posting for discoverability) With Podman 4.2.1 and Docker-Compose 2.11.0, pressing CTRL+C on docker-compose up caused the container to remain in "Stopped" state and further CTRL+C presses just printed "no container to kill". 389a4a6 ~~fixes~~ causes it and d759576 fixes it

EDIT: Apparently I bisected the issue wrong, should be fixed now.

EDIT2: Apparently the fix above doesn't work for containers that have restart: unless-stopped - I can still reproduce the "no container to kill issue" with Podman 4.2.1 + d759576 or Podman Git as of now and this compose file:

services:
    test:
        image: alpine:3.16
        command: tail -f /dev/null
        restart: unless-stopped

rhatdan · 2022-09-19T10:52:20Z

@joanbm I changed podman 5.2.1 to podman 4.2.1 Unless your from the future, I think this is what you meant :^)

edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Jul 7, 2022

edsantiago assigned vrothberg Jul 7, 2022

github-actions bot removed the remote Problem is in podman-remote label Jul 7, 2022

edsantiago added the remote Problem is in podman-remote label Jul 7, 2022

vrothberg mentioned this issue Jul 8, 2022

exit code improvements #14874

Merged

openshift-ci bot closed this as completed in #14874 Jul 11, 2022

edsantiago reopened this Jul 18, 2022

vrothberg mentioned this issue Jul 22, 2022

container wait: improve error message #15038

Merged

vrothberg mentioned this issue Jul 27, 2022

cleanup: transition from stopping to exited #15090

Merged

openshift-merge-robot closed this as completed in #15090 Jul 28, 2022

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 12, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman-remote wait: Error getting exit code from DB: no such exit code #14859

podman-remote wait: Error getting exit code from DB: no such exit code #14859

edsantiago commented Jul 7, 2022

vrothberg commented Jul 8, 2022

vrothberg commented Jul 8, 2022

vrothberg commented Jul 8, 2022 •

edited

Loading

edsantiago commented Jul 18, 2022

vrothberg commented Jul 18, 2022

vrothberg commented Jul 22, 2022

vrothberg commented Jul 25, 2022

edsantiago commented Jul 26, 2022

vrothberg commented Jul 26, 2022

vrothberg commented Jul 27, 2022

joanbm commented Sep 18, 2022 •

edited

Loading

rhatdan commented Sep 19, 2022

podman-remote wait: Error getting exit code from DB: no such exit code #14859

podman-remote wait: Error getting exit code from DB: no such exit code #14859

Comments

edsantiago commented Jul 7, 2022

[sys] 203 podman kill - concurrent stop

vrothberg commented Jul 8, 2022

vrothberg commented Jul 8, 2022

vrothberg commented Jul 8, 2022 • edited Loading

edsantiago commented Jul 18, 2022

[sys] 203 podman kill - concurrent stop

vrothberg commented Jul 18, 2022

vrothberg commented Jul 22, 2022

vrothberg commented Jul 25, 2022

edsantiago commented Jul 26, 2022

vrothberg commented Jul 26, 2022

vrothberg commented Jul 27, 2022

joanbm commented Sep 18, 2022 • edited Loading

rhatdan commented Sep 19, 2022

vrothberg commented Jul 8, 2022 •

edited

Loading

joanbm commented Sep 18, 2022 •

edited

Loading