Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman-remote wait: Error getting exit code from DB: no such exit code #14859

Closed
edsantiago opened this issue Jul 7, 2022 · 12 comments · Fixed by #14874 or #15090
Closed

podman-remote wait: Error getting exit code from DB: no such exit code #14859

edsantiago opened this issue Jul 7, 2022 · 12 comments · Fixed by #14874 or #15090
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

Seen just now:

$ podman-remote --url unix:/tmp/podman_tmp_mnLg wait gfKskJHy37
Error: getting exit code of container ebff09951585fcda59dfeceabfaab084751a8303a7cd981420c32c1d0c38b656 from DB: no such exit code

Only one so far. Seems suspiciously related to issue #14761 and PR #14830.

[sys] 203 podman kill - concurrent stop

@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Jul 7, 2022
@github-actions github-actions bot removed the remote Problem is in podman-remote label Jul 7, 2022
@edsantiago edsantiago added the remote Problem is in podman-remote label Jul 7, 2022
@vrothberg
Copy link
Member

Thanks for filing, @edsantiago.

@mheon FYI

@vrothberg
Copy link
Member

Another one here #14839

@vrothberg
Copy link
Member

vrothberg commented Jul 8, 2022

I just noticed that the test uses alpine instead of $IMAGE. I'll fix that as well.

vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 8, 2022
While for some call paths we may be doing this redundantly we need to
make sure the exit code is always read at this point.

[NO NEW TESTS NEEDED] as containers#14859 is most likely caused by a code path
not writing the exit code to the DB - which this commit fixes.

Fixes: containers#14859
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 11, 2022
While for some call paths we may be doing this redundantly we need to
make sure the exit code is always read at this point.

[NO NEW TESTS NEEDED] as I do not manage to reproduce the issue which
is very likely caused by a code path not writing the exit code when
running concurrently.

Fixes: containers#14859
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 11, 2022
While for some call paths we may be doing this redundantly we need to
make sure the exit code is always read at this point.

[NO NEW TESTS NEEDED] as I do not manage to reproduce the issue which
is very likely caused by a code path not writing the exit code when
running concurrently.

Fixes: containers#14859
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@edsantiago
Copy link
Member Author

Reopening, sorry. Seen again in two PRs which, as best I can tell, had a branch base that included #14874

[sys] 203 podman kill - concurrent stop

@edsantiago edsantiago reopened this Jul 18, 2022
@vrothberg
Copy link
Member

/me wipes tears

Thanks, Ed! I am sure we'll find the needle in the haystack eventually.

vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 22, 2022
Improve the error message when looking up the exit code of a container.
The state of the container may help us track down containers#14859 which flakes
rarely and is impossible to reproduce on my machine.

[NO NEW TESTS NEEDED]

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 22, 2022
Improve the error message when looking up the exit code of a container.
The state of the container may help us track down containers#14859 which flakes
rarely and is impossible to reproduce on my machine.

[NO NEW TESTS NEEDED]

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@vrothberg
Copy link
Member

I opened #15038 in hope that it will help us get a better understanding of the issue.

@vrothberg
Copy link
Member

@edsantiago, since #15038 merged, I am curious to see new cases of the error with the updated error message. My hope is that the added container state will help isolate the error source.

@edsantiago
Copy link
Member Author

Here you go - hth. remote f36 rootless:

[+1187s] not ok 205 podman kill - concurrent stop
...
# $ podman-remote --url unix:/tmp/podman_tmp_wJ2p wait 9yvX8WUvps
# Error: getting exit code of container fc1bf1ba67e1366f00f76d5ed4993145573efe1f1c7450240b679d77b245f719 from DB: no such exit code (container in state stopping)

@vrothberg
Copy link
Member

That is indeed extremely helpful. Thank you so much, @edsantiago!

mheon pushed a commit to mheon/libpod that referenced this issue Jul 26, 2022
Improve the error message when looking up the exit code of a container.
The state of the container may help us track down containers#14859 which flakes
rarely and is impossible to reproduce on my machine.

[NO NEW TESTS NEEDED]

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@vrothberg
Copy link
Member

Taking another look. I parsed the code yesterday but could not find something.

vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 27, 2022
Allow the cleanup process to transition the container from `stopping` to
`exited`.  This fixes a race condition detected in containers#14859 where the
cleanup process kicks in _before_ the stopping process can read the exit
file.  Prior to this fix, the cleanup process left the container in the
`stopping` state and removed the conmon files, such that the stopping
process also left the container in this state as it could not read the
exit files.  Hence, `podman wait` timed out (see the 23 seconds
execution time of the test [1]) due to the unexpected/invalid state and
the test failed.

Further turn the warning during stop to a debug message since it's a
natural race due to the daemonless/concurrent architecture and nothing
to worry about.

[NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues
flaking or not.

[1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205

Fixes: containers#14859
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 27, 2022
Allow the cleanup process (and others) to transition the container from
`stopping` to `exited`.  This fixes a race condition detected in containers#14859
where the cleanup process kicks in _before_ the stopping process can
read the exit file.  Prior to this fix, the cleanup process left the
container in the `stopping` state and removed the conmon files, such
that the stopping process also left the container in this state as it
could not read the exit files.  Hence, `podman wait` timed out (see the
23 seconds execution time of the test [1]) due to the unexpected/invalid
state and the test failed.

Further turn the warning during stop to a debug message since it's a
natural race due to the daemonless/concurrent architecture and nothing
to worry about.

[NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues
flaking or not.

[1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205

Fixes: containers#14859
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
mheon pushed a commit to mheon/libpod that referenced this issue Aug 10, 2022
Allow the cleanup process (and others) to transition the container from
`stopping` to `exited`.  This fixes a race condition detected in containers#14859
where the cleanup process kicks in _before_ the stopping process can
read the exit file.  Prior to this fix, the cleanup process left the
container in the `stopping` state and removed the conmon files, such
that the stopping process also left the container in this state as it
could not read the exit files.  Hence, `podman wait` timed out (see the
23 seconds execution time of the test [1]) due to the unexpected/invalid
state and the test failed.

Further turn the warning during stop to a debug message since it's a
natural race due to the daemonless/concurrent architecture and nothing
to worry about.

[NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues
flaking or not.

[1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205

Fixes: containers#14859
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
mheon pushed a commit to mheon/libpod that referenced this issue Aug 10, 2022
Allow the cleanup process (and others) to transition the container from
`stopping` to `exited`.  This fixes a race condition detected in containers#14859
where the cleanup process kicks in _before_ the stopping process can
read the exit file.  Prior to this fix, the cleanup process left the
container in the `stopping` state and removed the conmon files, such
that the stopping process also left the container in this state as it
could not read the exit files.  Hence, `podman wait` timed out (see the
23 seconds execution time of the test [1]) due to the unexpected/invalid
state and the test failed.

Further turn the warning during stop to a debug message since it's a
natural race due to the daemonless/concurrent architecture and nothing
to worry about.

[NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues
flaking or not.

[1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205

Fixes: containers#14859
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@joanbm
Copy link

joanbm commented Sep 18, 2022

(Posting for discoverability) With Podman 4.2.1 and Docker-Compose 2.11.0, pressing CTRL+C on docker-compose up caused the container to remain in "Stopped" state and further CTRL+C presses just printed "no container to kill". 389a4a6 fixes causes it and d759576 fixes it

EDIT: Apparently I bisected the issue wrong, should be fixed now.

EDIT2: Apparently the fix above doesn't work for containers that have restart: unless-stopped - I can still reproduce the "no container to kill issue" with Podman 4.2.1 + d759576 or Podman Git as of now and this compose file:

services:
    test:
        image: alpine:3.16
        command: tail -f /dev/null
        restart: unless-stopped

@rhatdan
Copy link
Member

rhatdan commented Sep 19, 2022

@joanbm I changed podman 5.2.1 to podman 4.2.1 Unless your from the future, I think this is what you meant :^)

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 12, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
4 participants