test: issues with DaemonSet pods not coming up after a series of reboots #9870

smira · 2024-12-03T16:48:43Z

In Talos integration tests, the tests do a series of (pretty frequent) reboots, specifically for the tests which run with encrypted volumes, as the encryption config changes require a reboot.

This got specifically triggered by the test in #9834, which adds two more reboots, and due to the order of the tests, comes right after encryption tests.

The issue is somewhat random and pops up as the test times out on the cluster health check with the error that number of read kube-proxy pods doesn't reach the desired value (3 out of 4).

Analysis

When the node goes into a reboot cycle, Talos instructs the kubelet to do a graceful shutdown, which terminates the pods, including DaemonSet pods. There is a bit of a race with kube-scheduler there, but in the end there will be a pod in the phase Failed because kubelet denies new pods to be run as it is itself in the graceful shutdown phase.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-12-03T14:45:34Z"
    message: Pod was terminated in response to imminent node shutdown.
    reason: TerminationByKubelet
    status: "True"
    type: DisruptionTarget

As the machine comes back up after a reboot, an existing pod in the Failed state prevents a new pod to be scheduled for the node for some time.

The Failed pods are supposed to be cleaned up by the DaemonSetsControllers in the kube-controller-manager: https://github.com/kubernetes/kubernetes/blob/8046362e6ff74ee18776e0cdb90ead62c577d607/pkg/controller/daemon/daemon_controller.go#L804-L826

That cleanup has a backoff introduced in kubernetes/kubernetes#65309, to fight other issues related to misconfigured pods.

But after a series of reboots a failed pod cleanup will be delayed due to the backoff mechanism long enough for Talos cluster health checks to fail:

I1203 15:12:52.834783       1 daemon_controller.go:813] "Deleting failed pod on node has been limited by backoff" logger="daemonset-controller" pod="kube-system/kube-flannel-zftvs" node="talos-default-worker-1" currentDelay="4m16s"

The more the rate of reboots is, the issue will pop up more often.

Solutions

It looks like backoff is hardcoded and can't be removed/reconfigured via any options: https://github.com/kubernetes/kubernetes/blob/8046362e6ff74ee18776e0cdb90ead62c577d607/cmd/kube-controller-manager/app/apps.go#L51-L52

Restarting kube-controller-manager on reboots (e.g. on non-controlplane reboots). It should help, as backoff is in memory.
Restarting the affected daemonsets due to the condition on the backoff key: https://github.com/kubernetes/kubernetes/blob/8046362e6ff74ee18776e0cdb90ead62c577d607/pkg/controller/daemon/daemon_controller.go#L1353-L1356. If the observed generation goes up, the backoff key will change.
Adding more worker nodes to the tests. We have just one, and that bothers me. I would prefer to have at least two. As the backoff key depends on the nodeName, this would give us roughly twice the reboot rate.

The text was updated successfully, but these errors were encountered:

Fixes siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>

Fixes siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 77e9db4)

See siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>

See siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 9470e84)

smira mentioned this issue Dec 3, 2024

📦 Talos 1.9.0 release process #9825

Closed

smira added a commit to smira/talos that referenced this issue Dec 4, 2024

test: use two workers in qemu tests by default

6775a20

Fixes siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>

smira mentioned this issue Dec 4, 2024

test: use two workers in qemu tests by default #9876

Merged

talos-bot closed this as completed in 77e9db4 Dec 4, 2024

smira added a commit to smira/talos that referenced this issue Dec 9, 2024

test: use two workers in qemu tests by default

1343773

Fixes siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 77e9db4)

smira added a commit to smira/talos that referenced this issue Dec 16, 2024

test: cleanup failed Kubernetes pods

f8ff36b

See siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>

smira mentioned this issue Dec 16, 2024

test: cleanup failed Kubernetes pods #9962

Merged

smira added a commit to smira/talos that referenced this issue Dec 16, 2024

test: cleanup failed Kubernetes pods

9470e84

See siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>

smira added a commit to smira/talos that referenced this issue Dec 17, 2024

test: cleanup failed Kubernetes pods

242a91f

See siderolabs#9870 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 9470e84)

github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: issues with DaemonSet pods not coming up after a series of reboots #9870

test: issues with DaemonSet pods not coming up after a series of reboots #9870

smira commented Dec 3, 2024 •

edited

Loading

test: issues with DaemonSet pods not coming up after a series of reboots #9870

test: issues with DaemonSet pods not coming up after a series of reboots #9870

Comments

smira commented Dec 3, 2024 • edited Loading

Analysis

Solutions

smira commented Dec 3, 2024 •

edited

Loading