-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single node cluster does not recover from SIGHUP to containerd #9271
Comments
Can you please send The core issue I guess is that Talos lost its idea of that it's ready to run a controlplane, but not sure why. |
Here you go. Happy to provide more info if needed. |
That bundle was generated with the cluster is a good steady state, not after SIGHUP-ing containerd. I can generate another one in the bad state, though I am not sure that will work since apiserver won't be reachable. |
Bad state is interesting, |
Here you go. |
Thank you, I'll take look! |
The root cause of the bug is: metadata:
namespace: runtime
type: Services.v1alpha1.talos.dev
id: etcd
version: 1
owner: v1alpha1.ServiceController
phase: running
created: 2024-09-04T17:52:50Z
updated: 2024-09-04T17:52:50Z
spec:
running: true
healthy: false
unknown: true Due to a missing internal event, Talos considers etcd to be not healthy, and doesn't run Kubernetes control plane pods. |
i'm experience the same. in my case. how to produce is:
i suspect, the vip is then removed, after doing point 10 |
your case is not same (if you have a problem, please open a separate issue with relevant support logs attached), and VIP is not supposed to work for workloads, only for Kubernetes API server. |
Otherwise the internal code might assume that the service is still running and healthy, never issuing a health change event. Fixes siderolabs#9271 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Otherwise the internal code might assume that the service is still running and healthy, never issuing a health change event. Fixes siderolabs#9271 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 07b9179)
Otherwise the internal code might assume that the service is still running and healthy, never issuing a health change event. Fixes siderolabs#9271 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 07b9179)
Otherwise the internal code might assume that the service is still running and healthy, never issuing a health change event. Fixes siderolabs#9271 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 07b9179)
Otherwise the internal code might assume that the service is still running and healthy, never issuing a health change event. Fixes siderolabs#9271 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 07b9179)
Bug Report
Description
On a single node test cluster, sending SIGHUP to containerd causes the cluster to remain in a bad state until reboot.
Reproduction
Logs
The controller-runtime logs repeat from then on.
Environment
Talos version:
Client:
Tag: v1.7.6
SHA: ae67123
Built:
Go version: go1.22.5
OS/Arch: linux/amd64
Server:
NODE: 192.168.1.13
Tag: v1.7.6
SHA: ae67123-dirty
Built:
Go version: go1.22.5
OS/Arch: linux/amd64
Enabled: RBAC
Kubernetes version: v1.30.4
Platform: baremetal x86-64
The text was updated successfully, but these errors were encountered: