sdnotify play kube policies: podman container wait, hangs #16076

edsantiago · 2022-10-06T19:57:19Z

[+1368s] not ok 310 sdnotify : play kube - with policies
...
$ podman container inspect test_pod-a --format {{.Config.SdNotifySocket}}
/var/tmp/-podman-notify-proxy.sock54032352
$ podman logs test_pod-a
/run/notify/notify.sock
READY
$ podman exec test_pod-a /bin/touch /stop
$ podman container inspect 9cffc8377ee1-service --format {{.State.ConmonPid}}
98526
$ podman container wait test_pod-a
timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
[ rc=124 (** EXPECTED 0 **) ]
*** TIMED OUT ***

fedora-36 : sys podman fedora-36 rootless host
- PR Update vendor of containers/buildah v1.28.0 #16034
  - 10-03 15:32
  - 10-03 11:37
fedora-36 : sys podman fedora-36 root host
- PR System tests: fix three races #15794
  - 09-14 14:10
fedora-36 : sys podman fedora-36 rootless host
fedora-36 : sys podman fedora-36 rootless host
- PR auto-update: validate container image #15933
  - 09-26 12:53
fedora-36-aarch64 : sys podman fedora-36-aarch64 root host
- PR Proof of concept: nightly dependency treadmill #15910
  - 09-24 09:08
fedora-36 : sys podman fedora-36 rootless host
- PR Ensure that the DF endpoint updated volume refcount #15753
  - 09-12 22:10
fedora-36 : sys podman fedora-36 root host
- PR Update buildah and c/common to latest #15695
  - 09-09 07:22

Looks like Sept 9 is the first logged instance.

So far, f36 only (both amd64 and aarch64), root and rootless.

The text was updated successfully, but these errors were encountered:

vrothberg · 2022-10-11T11:50:34Z

Will take a look, thanks!

edsantiago · 2022-10-11T11:57:10Z

Possibly related to #16062?

vrothberg · 2022-10-11T12:18:39Z

I recently attempted to fix some issues down there. @edsantiago, do you know when the flake happened for the first time?

vrothberg · 2022-10-11T12:25:40Z

I recently attempted to fix some issues down there. @edsantiago, do you know when the flake happened for the first time?

Just saw it at the bottom "Looks like Sept 9 is the first logged instance."

Starting listening for the READY messages on the sdnotify proxies before starting the Pod. Otherwise, we may be missing messages. [NO NEW TESTS NEEDED] as it's hard to test this very narrow race. Related to but may not be fixing containers#16076. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2022-10-11T14:46:14Z

I found one potential race that #16118 addresses. The potential race looks extreeeemely narrow to me, so I am not sure whether the PR fixes this issue.

Let's keep an eye open and please let me know if the flakes pops up again after #16118 merges.

Starting listening for the READY messages on the sdnotify proxies before starting the Pod. Otherwise, we may be missing messages. [NO NEW TESTS NEEDED] as it's hard to test this very narrow race. Related to but may not be fixing containers#16076. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2022-10-13T07:08:59Z

Closing as #16118 merged. I am not 100 percent sure it fixes the flake but it's the only potential source for the flake I could spot so far. Let's reopen in case it continues.

Starting listening for the READY messages on the sdnotify proxies before starting the Pod. Otherwise, we may be missing messages. [NO NEW TESTS NEEDED] as it's hard to test this very narrow race. Related to but may not be fixing containers#16076. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

edsantiago · 2022-10-24T14:45:49Z

Reopening: this is still happening (and yes, I confirmed that the PR in question is forked from a main that includes #16118).

vrothberg · 2022-10-24T14:47:34Z

Can we close #16246 as a duplicate, or is there a need to track it separately?

edsantiago · 2022-10-24T14:48:40Z

Seen also in fedora gating tests. [Edit: yes, dup. I was hunting for the log link before closing that one]

vrothberg · 2022-10-24T14:55:51Z

Thanks, @edsantiago! I will take a look at this one. A stubborn issue!

The notify proxy has a watcher to check whether the container has left the running state. In that case, Podman should stop waiting for the ready message to prevent a dead lock. Fix this watcher but adding a loop. Fixes the dead lock in containers#16076 surfacing in a timeout. The underlying issue persists though. Also use a timer in the select statement to prevent the goroutine from running unnecessarily long Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

The notify proxy has a watcher to check whether the container has left the running state. In that case, Podman should stop waiting for the ready message to prevent a dead lock. Fix this watcher but adding a loop. Fixes the dead lock in containers#16076 surfacing in a timeout. The underlying issue persists though. Also use a timer in the select statement to prevent the goroutine from running unnecessarily long [NO NEW TESTS NEEDED] Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2022-10-25T12:37:42Z

Reopening as I don't expect #16284 to fix the underlying issue. It may help to resolve the timeout but should fail.

vrothberg · 2022-11-14T12:27:52Z

@edsantiago have you seen this one flake recently?

edsantiago · 2022-11-14T13:44:54Z

fedora-36 : sys podman fedora-36 rootless host
fedora-37 : sys podman fedora-37 rootless host
- PR CI: set and verify DESIRED_NETWORK (netavark, cni) #16389
  - 11-02 23:23

edsantiago · 2022-11-17T18:21:26Z

f36 rootless

Does not fully fix containers#16515 as the BARRIER=1 message can, in theory, occure in a separate subsequent message that will not be read as the proxies return/stop when reading the READY=1 message. [NO NEW TESTS NEEDED] - existing tests are expected to pass and containers#16076 should (finally) stop flaking. Fixes: containers#16076 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Does not fully fix containers#16515 as the BARRIER=1 message can, in theory, occure in a separate subsequent message that will not be read as the proxies return/stop when reading the READY=1 message. [NO NEW TESTS NEEDED] - existing tests are expected to pass and containers#16076 should (finally) stop flaking.: Fixes: containers#16076 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

edsantiago · 2022-12-05T20:21:51Z

One of our most popular flakes recently

fedora-36 : sys podman fedora-36 rootless host
fedora-37 : sys podman fedora-37 rootless host

vrothberg · 2022-12-06T08:38:38Z

Help is on the way in #16709. This has the highest priority for me this week. Currently refining tests, so I am hopeful the flake will be buried this week.

The flake in containers#16076 is likely related to the notify message not being delivered/read correctly. Move sending the message into an exec session such that flakes will reveal an error message. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Does not fully fix containers#16515 as the BARRIER=1 message can, in theory, occure in a separate subsequent message that will not be read as the proxies return/stop when reading the READY=1 message. [NO NEW TESTS NEEDED] - existing tests are expected to pass and containers#16076 should (finally) stop flaking.: Fixes: containers#16076 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

As outlined in containers#16076, a subsequent BARRIER *may* follow the READY message sent by a container. To correctly imitate the behavior of systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must hence process messages for the entirety of the workload. We know that the workload is done and that all containers and pods have exited when the service container exits. Hence, all proxies are closed at that time. The above changes imply that Podman runs for the entirety of the workload and will henceforth act as the MAINPID when running inside of systemd. Prior to this change, the service container acted as the MAINPID which is now not possible anymore; Podman would be killed immediately on exit of the service container and could not clean up. The kube template now correctly transitions to in-active instead of failed in systemd. Fixes: containers#16076 Fixes: containers#16515 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2022-12-08T11:54:24Z

@edsantiago any indications of a flake after the merge?

edsantiago · 2022-12-08T11:56:43Z

In the last ten minutes, no :-)

Patience, grasshopper. Data collection takes time. I'll let you know late December.

vrothberg · 2022-12-08T11:58:33Z

In the last ten minutes, no :-)

GitHub claims it to be 13 hours :^)

Patience, grasshopper. Data collection takes time. I'll let you know late December.

Thanks! I intended to be pro-active; it has been quite a flake/ride.

edsantiago · 2023-01-04T12:59:45Z

Two flakes on December 14 (after the Dec 8 merge). Both on PR 16781 which, if I'm gitting correctly, was parented on a Dec 7 commit that did not include #16709. So I think we're good. Thank you @vrothberg!

fedora-36 : sys podman fedora-36 rootless host
- PR fix: event read from syslog when syslog entry too long #16781
  - 12-14 11:41
fedora-37 : sys podman fedora-37 rootless host
- PR fix: event read from syslog when syslog entry too long #16781
  - 12-14 11:42

vrothberg · 2023-01-04T13:04:55Z

That was a tough cookie! Thanks for your help and patience, @edsantiago

edsantiago added the flakes Flakes from Continuous Integration label Oct 6, 2022

vrothberg mentioned this issue Oct 11, 2022

play kube: notifyproxy: listen before starting the pod #16118

Merged

rhatdan added the kube label Oct 12, 2022

vrothberg closed this as completed Oct 13, 2022

edsantiago mentioned this issue Oct 21, 2022

Fedora gating test failure: 312 sdnotify : play kube - with policies #16246

Closed

jakecorrenti mentioned this issue Oct 24, 2022

Fix system df issues with -f and -v #16234

Merged

edsantiago reopened this Oct 24, 2022

vrothberg mentioned this issue Oct 25, 2022

notifyproxy: fix container watcher #16284

Merged

openshift-merge-robot closed this as completed in #16284 Oct 25, 2022

vrothberg reopened this Oct 25, 2022

edsantiago mentioned this issue Nov 1, 2022

podman exec: Error: retrieving exec session exit code: exec died ... unable to find event #16036

Closed

alexlarsson mentioned this issue Nov 15, 2022

Various issues in systemd.NotifyProxy #16515

Closed

vrothberg mentioned this issue Dec 2, 2022

kube sdnotify: run proxies for the lifespan of the service #16709

Merged

openshift-merge-robot closed this as completed in #16709 Dec 7, 2022

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdnotify play kube policies: podman container wait, hangs #16076

sdnotify play kube policies: podman container wait, hangs #16076

edsantiago commented Oct 6, 2022

vrothberg commented Oct 11, 2022

edsantiago commented Oct 11, 2022

vrothberg commented Oct 11, 2022

vrothberg commented Oct 11, 2022

vrothberg commented Oct 11, 2022

vrothberg commented Oct 13, 2022

edsantiago commented Oct 24, 2022

vrothberg commented Oct 24, 2022

edsantiago commented Oct 24, 2022 •

edited

Loading

vrothberg commented Oct 24, 2022

vrothberg commented Oct 25, 2022

vrothberg commented Nov 14, 2022

edsantiago commented Nov 14, 2022

edsantiago commented Nov 17, 2022

edsantiago commented Dec 5, 2022

vrothberg commented Dec 6, 2022

vrothberg commented Dec 8, 2022

edsantiago commented Dec 8, 2022

vrothberg commented Dec 8, 2022

edsantiago commented Jan 4, 2023

vrothberg commented Jan 4, 2023

sdnotify play kube policies: podman container wait, hangs #16076

sdnotify play kube policies: podman container wait, hangs #16076

Comments

edsantiago commented Oct 6, 2022

vrothberg commented Oct 11, 2022

edsantiago commented Oct 11, 2022

vrothberg commented Oct 11, 2022

vrothberg commented Oct 11, 2022

vrothberg commented Oct 11, 2022

vrothberg commented Oct 13, 2022

edsantiago commented Oct 24, 2022

vrothberg commented Oct 24, 2022

edsantiago commented Oct 24, 2022 • edited Loading

vrothberg commented Oct 24, 2022

vrothberg commented Oct 25, 2022

vrothberg commented Nov 14, 2022

edsantiago commented Nov 14, 2022

edsantiago commented Nov 17, 2022

edsantiago commented Dec 5, 2022

vrothberg commented Dec 6, 2022

vrothberg commented Dec 8, 2022

edsantiago commented Dec 8, 2022

vrothberg commented Dec 8, 2022

edsantiago commented Jan 4, 2023

vrothberg commented Jan 4, 2023

edsantiago commented Oct 24, 2022 •

edited

Loading