Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdnotify play kube policies: podman container wait, hangs #16076

Closed
edsantiago opened this issue Oct 6, 2022 · 21 comments · Fixed by #16284 or #16709
Closed

sdnotify play kube policies: podman container wait, hangs #16076

edsantiago opened this issue Oct 6, 2022 · 21 comments · Fixed by #16284 or #16709
Labels
flakes Flakes from Continuous Integration kube locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

[+1368s] not ok 310 sdnotify : play kube - with policies
...
$ podman container inspect test_pod-a --format {{.Config.SdNotifySocket}}
/var/tmp/-podman-notify-proxy.sock54032352
$ podman logs test_pod-a
/run/notify/notify.sock
READY
$ podman exec test_pod-a /bin/touch /stop
$ podman container inspect 9cffc8377ee1-service --format {{.State.ConmonPid}}
98526
$ podman container wait test_pod-a
timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
[ rc=124 (** EXPECTED 0 **) ]
*** TIMED OUT ***

Looks like Sept 9 is the first logged instance.

So far, f36 only (both amd64 and aarch64), root and rootless.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Oct 6, 2022
@vrothberg
Copy link
Member

Will take a look, thanks!

@edsantiago
Copy link
Member Author

Possibly related to #16062?

@vrothberg
Copy link
Member

I recently attempted to fix some issues down there. @edsantiago, do you know when the flake happened for the first time?

@vrothberg
Copy link
Member

I recently attempted to fix some issues down there. @edsantiago, do you know when the flake happened for the first time?

Just saw it at the bottom "Looks like Sept 9 is the first logged instance."

vrothberg added a commit to vrothberg/libpod that referenced this issue Oct 11, 2022
Starting listening for the READY messages on the sdnotify proxies before
starting the Pod.  Otherwise, we may be missing messages.

[NO NEW TESTS NEEDED] as it's hard to test this very narrow race.

Related to but may not be fixing containers#16076.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@vrothberg
Copy link
Member

I found one potential race that #16118 addresses. The potential race looks extreeeemely narrow to me, so I am not sure whether the PR fixes this issue.

Let's keep an eye open and please let me know if the flakes pops up again after #16118 merges.

vrothberg added a commit to vrothberg/libpod that referenced this issue Oct 12, 2022
Starting listening for the READY messages on the sdnotify proxies before
starting the Pod.  Otherwise, we may be missing messages.

[NO NEW TESTS NEEDED] as it's hard to test this very narrow race.

Related to but may not be fixing containers#16076.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Oct 12, 2022
Starting listening for the READY messages on the sdnotify proxies before
starting the Pod.  Otherwise, we may be missing messages.

[NO NEW TESTS NEEDED] as it's hard to test this very narrow race.

Related to but may not be fixing containers#16076.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@rhatdan rhatdan added the kube label Oct 12, 2022
@vrothberg
Copy link
Member

Closing as #16118 merged. I am not 100 percent sure it fixes the flake but it's the only potential source for the flake I could spot so far. Let's reopen in case it continues.

mheon pushed a commit to mheon/libpod that referenced this issue Oct 18, 2022
Starting listening for the READY messages on the sdnotify proxies before
starting the Pod.  Otherwise, we may be missing messages.

[NO NEW TESTS NEEDED] as it's hard to test this very narrow race.

Related to but may not be fixing containers#16076.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@edsantiago
Copy link
Member Author

Reopening: this is still happening (and yes, I confirmed that the PR in question is forked from a main that includes #16118).

@edsantiago edsantiago reopened this Oct 24, 2022
@vrothberg
Copy link
Member

Can we close #16246 as a duplicate, or is there a need to track it separately?

@edsantiago
Copy link
Member Author

edsantiago commented Oct 24, 2022

Seen also in fedora gating tests. [Edit: yes, dup. I was hunting for the log link before closing that one]

@vrothberg
Copy link
Member

Thanks, @edsantiago! I will take a look at this one. A stubborn issue!

vrothberg added a commit to vrothberg/libpod that referenced this issue Oct 25, 2022
The notify proxy has a watcher to check whether the container has left
the running state.  In that case, Podman should stop waiting for the
ready message to prevent a dead lock.  Fix this watcher but adding a
loop.

Fixes the dead lock in containers#16076 surfacing in a timeout.  The underlying
issue persists though.  Also use a timer in the select statement to
prevent the goroutine from running unnecessarily long

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Oct 25, 2022
The notify proxy has a watcher to check whether the container has left
the running state.  In that case, Podman should stop waiting for the
ready message to prevent a dead lock.  Fix this watcher but adding a
loop.

Fixes the dead lock in containers#16076 surfacing in a timeout.  The underlying
issue persists though.  Also use a timer in the select statement to
prevent the goroutine from running unnecessarily long

[NO NEW TESTS NEEDED]

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@vrothberg
Copy link
Member

Reopening as I don't expect #16284 to fix the underlying issue. It may help to resolve the timeout but should fail.

@vrothberg
Copy link
Member

@edsantiago have you seen this one flake recently?

@edsantiago
Copy link
Member Author

@edsantiago
Copy link
Member Author

f36 rootless

vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 2, 2022
Does not fully fix containers#16515 as the BARRIER=1 message can, in theory,
occure in a separate subsequent message that will not be read as
the proxies return/stop when reading the READY=1 message.

[NO NEW TESTS NEEDED] - existing tests are expected to pass and containers#16076
should (finally) stop flaking.

Fixes: containers#16076
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 2, 2022
Does not fully fix containers#16515 as the BARRIER=1 message can, in theory,
occure in a separate subsequent message that will not be read as
the proxies return/stop when reading the READY=1 message.

[NO NEW TESTS NEEDED] - existing tests are expected to pass and containers#16076
should (finally) stop flaking.

Fixes: containers#16076
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 2, 2022
Does not fully fix containers#16515 as the BARRIER=1 message can, in theory,
occure in a separate subsequent message that will not be read as
the proxies return/stop when reading the READY=1 message.

[NO NEW TESTS NEEDED] - existing tests are expected to pass and containers#16076
should (finally) stop flaking.:

Fixes: containers#16076
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@vrothberg
Copy link
Member

Help is on the way in #16709. This has the highest priority for me this week. Currently refining tests, so I am hopeful the flake will be buried this week.

vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 6, 2022
The flake in containers#16076 is likely related to the notify message not being
delivered/read correctly.  Move sending the message into an exec session
such that flakes will reveal an error message.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 6, 2022
Does not fully fix containers#16515 as the BARRIER=1 message can, in theory,
occure in a separate subsequent message that will not be read as
the proxies return/stop when reading the READY=1 message.

[NO NEW TESTS NEEDED] - existing tests are expected to pass and containers#16076
should (finally) stop flaking.:

Fixes: containers#16076
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 6, 2022
As outlined in containers#16076, a subsequent BARRIER *may* follow the READY
message sent by a container.  To correctly imitate the behavior of
systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must
hence process messages for the entirety of the workload.

We know that the workload is done and that all containers and pods have
exited when the service container exits.  Hence, all proxies are closed
at that time.

The above changes imply that Podman runs for the entirety of the
workload and will henceforth act as the MAINPID when running inside of
systemd.  Prior to this change, the service container acted as the
MAINPID which is now not possible anymore; Podman would be killed
immediately on exit of the service container and could not clean up.

The kube template now correctly transitions to in-active instead of
failed in systemd.

Fixes: containers#16076
Fixes: containers#16515
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 6, 2022
As outlined in containers#16076, a subsequent BARRIER *may* follow the READY
message sent by a container.  To correctly imitate the behavior of
systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must
hence process messages for the entirety of the workload.

We know that the workload is done and that all containers and pods have
exited when the service container exits.  Hence, all proxies are closed
at that time.

The above changes imply that Podman runs for the entirety of the
workload and will henceforth act as the MAINPID when running inside of
systemd.  Prior to this change, the service container acted as the
MAINPID which is now not possible anymore; Podman would be killed
immediately on exit of the service container and could not clean up.

The kube template now correctly transitions to in-active instead of
failed in systemd.

Fixes: containers#16076
Fixes: containers#16515
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 6, 2022
As outlined in containers#16076, a subsequent BARRIER *may* follow the READY
message sent by a container.  To correctly imitate the behavior of
systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must
hence process messages for the entirety of the workload.

We know that the workload is done and that all containers and pods have
exited when the service container exits.  Hence, all proxies are closed
at that time.

The above changes imply that Podman runs for the entirety of the
workload and will henceforth act as the MAINPID when running inside of
systemd.  Prior to this change, the service container acted as the
MAINPID which is now not possible anymore; Podman would be killed
immediately on exit of the service container and could not clean up.

The kube template now correctly transitions to in-active instead of
failed in systemd.

Fixes: containers#16076
Fixes: containers#16515
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Dec 6, 2022
As outlined in containers#16076, a subsequent BARRIER *may* follow the READY
message sent by a container.  To correctly imitate the behavior of
systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must
hence process messages for the entirety of the workload.

We know that the workload is done and that all containers and pods have
exited when the service container exits.  Hence, all proxies are closed
at that time.

The above changes imply that Podman runs for the entirety of the
workload and will henceforth act as the MAINPID when running inside of
systemd.  Prior to this change, the service container acted as the
MAINPID which is now not possible anymore; Podman would be killed
immediately on exit of the service container and could not clean up.

The kube template now correctly transitions to in-active instead of
failed in systemd.

Fixes: containers#16076
Fixes: containers#16515
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@vrothberg
Copy link
Member

@edsantiago any indications of a flake after the merge?

@edsantiago
Copy link
Member Author

In the last ten minutes, no :-)

Patience, grasshopper. Data collection takes time. I'll let you know late December.

@vrothberg
Copy link
Member

In the last ten minutes, no :-)

GitHub claims it to be 13 hours :^)

Patience, grasshopper. Data collection takes time. I'll let you know late December.

Thanks! I intended to be pro-active; it has been quite a flake/ride.

@edsantiago
Copy link
Member Author

Two flakes on December 14 (after the Dec 8 merge). Both on PR 16781 which, if I'm gitting correctly, was parented on a Dec 7 commit that did not include #16709. So I think we're good. Thank you @vrothberg!

@vrothberg
Copy link
Member

That was a tough cookie! Thanks for your help and patience, @edsantiago

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kube locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
3 participants