Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Talos turns shutdown into reboot event #7854

Closed
Tracked by #7561
salkin opened this issue Oct 13, 2023 · 0 comments · Fixed by #8028
Closed
Tracked by #7561

🐛 Talos turns shutdown into reboot event #7854

salkin opened this issue Oct 13, 2023 · 0 comments · Fixed by #8028
Assignees

Comments

@salkin
Copy link
Contributor

salkin commented Oct 13, 2023

Bug Report

Talos turns shutdown event into a reboot when there is misbehving pod not obeying SIGTERM

Description

Triggering a shutdown using Shutdown API, Talos starts shutdown but turned into a reboot after one pod misbehaving.

Logs

[ 1571.984742] [talos] stopped pod default/tpm-device-plugin-wz92s
[ 1572.284172] [talos] task stopAllPods (1/1): failed: failed stopping pod FILTERED_OUT_POD: ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:810be2f79aa5cc74e8fd6029d3f3c2d20c9061aad38a9ad20899ea44c39ecd38,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[ 1574.050049] [talos] phase cleanup (1/9): failed
[ 1574.287490] [talos] shutdown sequence: failed
[ 1574.515824] [talos] shutdown failed: error running phase 1 in shutdown sequence: task 1/1: failed, failed stopping pod app-teamcomms-stable-912bmhxawq/mcs-1-6bcbcb69d7-v6drz: ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:810be2f79aa5cc74e8fd6029d3f3c2d20c9061aad38a9ad20899ea44c39ecd38,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[ 1576.481870] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 1913, container apid)
[ 1576.944240] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 4124, container etcd)
[ 1577.407055] [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
[ 1577.924690] [talos] service[machined](Finished): Service finished successfully
[ 1578.294614] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 4047, container trustd)
[ 1578.783092] [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
[ 1579.493072] [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
[ 1580.164745] [talos] service[udevd](Finished): Service finished successfully
[ 1580.521130] [talos] service[apid](Finished): Service finished successfully
[ 1580.874347] [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
[ 1581.546399] [talos] service[etcd](Finished): Service finished successfully
[ 1581.900214] [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[ 1582.719632] [talos] service[cri](Finished): Service finished successfully
[ 1583.159952] [talos] service[trustd](Finished): Service finished successfully
[ 1583.522368] [talos] service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"])
[ 1584.538867] [talos] service[containerd](Finished): Service finished successfully
[ 1584.921394] [talos] fatal sequencer error in "shutdown" sequence: message:"sequence failed: error running phase 1 in shutdown sequence: task 1/1: failed, failed stopping pod FILTERED_OUT_POD: ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:810be2f79aa5cc74e8fd6029d3f3c2d20c9061aad38a9ad20899ea44c39ecd38,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
[ 1587.168198] [talos] rebooting in 10 seconds
[ 1588.395069] [talos] rebooting in 9 seconds
[ 1589.617832] [talos] rebooting in 8 seconds
[ 1590.838854] [talos] rebooting in 7 seconds
[ 1592.061347] [talos] rebooting in 6 seconds
[ 1593.282258] [talos] rebooting in 5 seconds
[ 1594.502522] [talos] rebooting in 4 seconds
[ 1595.723035] [talos] rebooting in 3 seconds
[ 1596.943063] [talos] rebooting in 2 seconds
[ 1597.625279] [talos] controller runtime finished
[ 1598.162033] [talos] rebooting in 1 seconds
[ 1599.379261] [talos] rebooting in 0 seconds

Environment

  • Talos version: Talos 1.4.6
  • Kubernetes version: 1.27.6
  • Platform: bare-metal
@smira smira self-assigned this Oct 16, 2023
@smira smira changed the title Talos turns shutdown into reboot event 🐛 Talos turns shutdown into reboot event Nov 28, 2023
smira added a commit to smira/talos that referenced this issue Dec 4, 2023
Fixes siderolabs#7854

Talos runs an emergency handler if the sequence experience and
unrecoverable failure. The emergency handler was unconditionally
executing "reboot" action if no other action was received (which only
gets received if the sequence completes successfully), so the Shutdown
request might result in a Reboot behavior on error during shutdown
phase.

This is not a pretty fix, but it's hard to deliver the intent from one
part of the code to another right now, so instead use a global variable
which stores default emergency intention, and gets overridden early in
the Shutdown sequence.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this issue Dec 4, 2023
Fixes siderolabs#7854

Talos runs an emergency handler if the sequence experience and
unrecoverable failure. The emergency handler was unconditionally
executing "reboot" action if no other action was received (which only
gets received if the sequence completes successfully), so the Shutdown
request might result in a Reboot behavior on error during shutdown
phase.

This is not a pretty fix, but it's hard to deliver the intent from one
part of the code to another right now, so instead use a global variable
which stores default emergency intention, and gets overridden early in
the Shutdown sequence.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this issue Dec 8, 2023
Fixes siderolabs#7854

Talos runs an emergency handler if the sequence experience and
unrecoverable failure. The emergency handler was unconditionally
executing "reboot" action if no other action was received (which only
gets received if the sequence completes successfully), so the Shutdown
request might result in a Reboot behavior on error during shutdown
phase.

This is not a pretty fix, but it's hard to deliver the intent from one
part of the code to another right now, so instead use a global variable
which stores default emergency intention, and gets overridden early in
the Shutdown sequence.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 474fa04)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants