Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draining but never gets to reboot #27

Closed
davidkarlsen opened this issue Aug 15, 2018 · 3 comments
Closed

draining but never gets to reboot #27

davidkarlsen opened this issue Aug 15, 2018 · 3 comments

Comments

@davidkarlsen
Copy link
Collaborator

Kured kicked in rightly, and disabled scheduling for my node:

app03.lan.davidkarlsen.com   Ready,SchedulingDisabled   <none>    62d       v1.11.2

I can see a number of pods being killed:

root@app03:/var/log/containers# tail kured-vkwnk_kube-system_kured-83287b3a6ba5d8a4dfd8a22822932a1655b71cc2ca2bfbd5007f5d389992100c.log 
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"kube-system-kubernetes-dashboard-proxy-55c7756d46-dsqzq\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.066034203Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"coredns-78fcdf6894-k87gl\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.066110928Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monitoring-grafana-788f47b84-bkggz\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.066134368Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monitoring-prometheus-alertmanager-cbcc46d55-gwkqz\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.179296096Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"logging-cerebro-6794fc6bc6-t26v9\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.179378188Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monocular-monocular-mongodb-5644f785b9-24tmz\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.202395762Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monitoring-prometheus-blackbox-exporter-7775df5698-86s67\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.202444091Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"monitoring-prometheus-server-75bfb9f66-xm9vp\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.249172642Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"kube-ops-view-kube-ops-view-kube-ops-view-6db67848c4-krmx8\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.449382938Z"}
{"log":"time=\"2018-08-15T09:47:13Z\" level=info msg=\"pod \\\"logging-elasticsearch-client-5978d8f465-t9kkm\\\" evicted\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T09:47:13.649729383Z"}
root@app03:/var/log/containers# 

but then nothing more happens. I guess it fails at something - but the logs should tell why.

these pods are left (which are mainly daemons, except for the nginx-ingress:

Non-terminated Pods:         (9 in total)
  Namespace                  Name                                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                                          ------------  ----------  ---------------  -------------
  auditbeat                  auditbeat-auditbeat-q4k65                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  datadog                    datadog-datadog-agent-datadog-pr9w5                           200m (2%)     200m (2%)   256Mi (1%)       256Mi (1%)
  kube-system                calico-node-qlmcc                                             250m (3%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-proxy-5cf42                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-system-nginx-ingress-controller-84f76b76cb-jp8dr         0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-system-nginx-ingress-default-backend-6b557bb97c-vlfqc    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kured-vkwnk                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)
  logging                    fluent-bit-2djgl                                              100m (1%)     0 (0%)      100Mi (0%)       100Mi (0%)
  monitoring                 monitoring-prometheus-node-exporter-6jl6k                     0 (0%)        0 (0%)      0 (0%)           0 (0%)

any hints?

@davidkarlsen davidkarlsen changed the title draining but not rebooting draining but never gets to reboo Aug 15, 2018
@davidkarlsen davidkarlsen changed the title draining but never gets to reboo draining but never gets to reboot Aug 15, 2018
@davidkarlsen
Copy link
Collaborator Author

Update:
I kubectl deleted the nginx pods, and it actually booted the node:

{"log":"time=\"2018-08-15T13:33:53Z\" level=info msg=\"node \\\"app03.lan.davidkarlsen.com\\\" drained\" cmd=/usr/bin/kubectl std=out\n","stream":"stderr","time":"2018-08-15T13:33:53.526347753Z"}
{"log":"time=\"2018-08-15T13:33:53Z\" level=info msg=\"Commanding reboot\"\n","stream":"stderr","time":"2018-08-15T13:33:53.530629662Z"}
{"log":"time=\"2018-08-15T13:34:23Z\" level=warning msg=\"Failed to set wall message, ignoring: Connection reset by peer\" cmd=/bin/systemctl std=err\n","stream":"stderr","time":"2018-08-15T13:34:23.90611995Z"}
{"log":"time=\"2018-08-15T13:34:23Z\" level=warning msg=\"Failed to reboot system via logind: Transport endpoint is not connected\" cmd=/bin/systemctl std=err\n","stream":"stderr","time":"2018-08-15T13:34:23.906204151Z"}

but should this not be reported in some way?

also - what can cause the failing messages at the end?

@brantb
Copy link

brantb commented Oct 22, 2018

I was being hit by this too. It turns out kubectl drain will hang if there is a PodDisruptionBudget preventing a pod from being evicted, and the nginx-ingress helm chart creates a couple of these by default. A newer version of the chart fixes this: helm/charts#7127

kubectl drain would have eventually returned if the nginx-ingress pods were removed from the node for some other reason like being manually deleted.

@davidkarlsen
Copy link
Collaborator Author

I think we close this as it was actually a PDB problem rather than flux - I fixed the chart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants