-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman-remote wait: Error getting exit code from DB: no such exit code #14859
Comments
Thanks for filing, @edsantiago. @mheon FYI |
Another one here #14839 |
I just noticed that the test uses |
While for some call paths we may be doing this redundantly we need to make sure the exit code is always read at this point. [NO NEW TESTS NEEDED] as containers#14859 is most likely caused by a code path not writing the exit code to the DB - which this commit fixes. Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
While for some call paths we may be doing this redundantly we need to make sure the exit code is always read at this point. [NO NEW TESTS NEEDED] as I do not manage to reproduce the issue which is very likely caused by a code path not writing the exit code when running concurrently. Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
While for some call paths we may be doing this redundantly we need to make sure the exit code is always read at this point. [NO NEW TESTS NEEDED] as I do not manage to reproduce the issue which is very likely caused by a code path not writing the exit code when running concurrently. Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Reopening, sorry. Seen again in two PRs which, as best I can tell, had a branch base that included #14874 [sys] 203 podman kill - concurrent stop
|
/me wipes tears Thanks, Ed! I am sure we'll find the needle in the haystack eventually. |
Improve the error message when looking up the exit code of a container. The state of the container may help us track down containers#14859 which flakes rarely and is impossible to reproduce on my machine. [NO NEW TESTS NEEDED] Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Improve the error message when looking up the exit code of a container. The state of the container may help us track down containers#14859 which flakes rarely and is impossible to reproduce on my machine. [NO NEW TESTS NEEDED] Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
I opened #15038 in hope that it will help us get a better understanding of the issue. |
@edsantiago, since #15038 merged, I am curious to see new cases of the error with the updated error message. My hope is that the added container state will help isolate the error source. |
Here you go - hth. remote f36 rootless:
|
That is indeed extremely helpful. Thank you so much, @edsantiago! |
Improve the error message when looking up the exit code of a container. The state of the container may help us track down containers#14859 which flakes rarely and is impossible to reproduce on my machine. [NO NEW TESTS NEEDED] Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Taking another look. I parsed the code yesterday but could not find something. |
Allow the cleanup process to transition the container from `stopping` to `exited`. This fixes a race condition detected in containers#14859 where the cleanup process kicks in _before_ the stopping process can read the exit file. Prior to this fix, the cleanup process left the container in the `stopping` state and removed the conmon files, such that the stopping process also left the container in this state as it could not read the exit files. Hence, `podman wait` timed out (see the 23 seconds execution time of the test [1]) due to the unexpected/invalid state and the test failed. Further turn the warning during stop to a debug message since it's a natural race due to the daemonless/concurrent architecture and nothing to worry about. [NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues flaking or not. [1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205 Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Allow the cleanup process (and others) to transition the container from `stopping` to `exited`. This fixes a race condition detected in containers#14859 where the cleanup process kicks in _before_ the stopping process can read the exit file. Prior to this fix, the cleanup process left the container in the `stopping` state and removed the conmon files, such that the stopping process also left the container in this state as it could not read the exit files. Hence, `podman wait` timed out (see the 23 seconds execution time of the test [1]) due to the unexpected/invalid state and the test failed. Further turn the warning during stop to a debug message since it's a natural race due to the daemonless/concurrent architecture and nothing to worry about. [NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues flaking or not. [1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205 Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Allow the cleanup process (and others) to transition the container from `stopping` to `exited`. This fixes a race condition detected in containers#14859 where the cleanup process kicks in _before_ the stopping process can read the exit file. Prior to this fix, the cleanup process left the container in the `stopping` state and removed the conmon files, such that the stopping process also left the container in this state as it could not read the exit files. Hence, `podman wait` timed out (see the 23 seconds execution time of the test [1]) due to the unexpected/invalid state and the test failed. Further turn the warning during stop to a debug message since it's a natural race due to the daemonless/concurrent architecture and nothing to worry about. [NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues flaking or not. [1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205 Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Allow the cleanup process (and others) to transition the container from `stopping` to `exited`. This fixes a race condition detected in containers#14859 where the cleanup process kicks in _before_ the stopping process can read the exit file. Prior to this fix, the cleanup process left the container in the `stopping` state and removed the conmon files, such that the stopping process also left the container in this state as it could not read the exit files. Hence, `podman wait` timed out (see the 23 seconds execution time of the test [1]) due to the unexpected/invalid state and the test failed. Further turn the warning during stop to a debug message since it's a natural race due to the daemonless/concurrent architecture and nothing to worry about. [NO NEW TESTS NEEDED] since we can only monitor if containers#14859 continues flaking or not. [1] https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6210434704343040/html/sys-remote-fedora-36-rootless-host.log.html#t--00205 Fixes: containers#14859 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
(Posting for discoverability) With Podman 4.2.1 and Docker-Compose 2.11.0, pressing CTRL+C on EDIT: Apparently I bisected the issue wrong, should be fixed now. EDIT2: Apparently the fix above doesn't work for containers that have services:
test:
image: alpine:3.16
command: tail -f /dev/null
restart: unless-stopped |
@joanbm I changed podman 5.2.1 to podman 4.2.1 Unless your from the future, I think this is what you meant :^) |
Seen just now:
Only one so far. Seems suspiciously related to issue #14761 and PR #14830.
[sys] 203 podman kill - concurrent stop
The text was updated successfully, but these errors were encountered: