Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

kube-aws: Drain nodes before shutting them down #465

Closed
wants to merge 13 commits into from

Conversation

mumoshu
Copy link
Contributor

@mumoshu mumoshu commented May 5, 2016

Drain nodes before shutting them down to give running pods time to gracefully stop.

This change basically achieves it by running docker run IMAGE kubectl drain THE_NODE --force on the to-be-shut-down node before the kubelet gets SIGTERM in a CoreOS' shutdown process.

Depends #608
Refs #340

Without this change, kubelets getting SIGTERM without a prior drain results in unfunctional(actually not schedulable) pods to have statuses Ready.

With this change, when an ASG's desired cap decreased, a node's status changes over time as follows:

On desired cap change:
=> STATUS=Ready

On shut-down started:
=> STATUS=Ready,SchedulingDisabled (<- Pods are stopped and status is changed by kubectl drain)

On shut-down finished:
=> Status=NotReady,SchedulingDisabled (<- It's NotReady but it won't result in a down-time because we already stopped both further scheduling and pods)

After a minute:
=> The node disappears from the output of kubectl get nodes

Note that:

  • This applies to manual shutdowns(via running sudo systemctl shutdown for example) and automated shutdowns(triggered by AWS AutoScaling when nodes get rotated out of a group)
  • We currently depend on the community docker image mumoshu/kubectl because kubectl included in the official coreos/hyperkube image doesn't work due to the issue hyperkube + kubectl - unknown flag kubernetes/kubernetes#24088 in Kubernetes. Once the issue is fixed and the CoreOS team published the new hyperkube image with the updated Kubernetes, we can remove that dependency.
  • I consider this an experimental feature. So you shouldn't expect configuration API regarding this stable. It may change in the future.

@colhom
Copy link
Contributor

colhom commented May 5, 2016

@mumoshu this looks awesome, very exciting!

I don't know that we need the experimental bit. I would say that this behavior is "universally desirable", so we may just want to run it through conformance tests and node-draining on shutdown non-configurable. If for some odd reason (testing, etc) somebody wants to disable this, that should be done at a lower level (disabling the service in cloud-config, etc)

\cc @aaronlevy @cgag

--kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
drain $$(hostname) \
--force'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also need --ignore-daemonsets=true? The man page implies to me that you'd need both if you had a daemonset and an unmanaged pod, even though the name "force" makes me think "force" would be enough.

@aaronlevy
Copy link
Contributor

@mumoshu can you provide a little more detail regarding:

Without this change, kubelets getting SIGTERM without a prior drain results in unfunctional(actually not schedulable) pods to have statuses Ready

enable: true
content: |
[Unit]
Description=drain this k8s node to make running pods time to gracefully shut down before stopping kubelet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing how this unit keeps the kubelet running until drained.

Copy link
Contributor

@cgag cgag May 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like systemd shutdown down services in the reverse of the boot order[1], so After=kubelet.service should mean this should be shut down before the kubelet.

[1] https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Before=

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks @cgag

@colhom
Copy link
Contributor

colhom commented May 6, 2016

To recap in-person discussions between @aaronlevy, @philips and myself:

Summary- we will leave this node-draining behavior gated on experimental now as @mumoshu has it in this PR, defaulting to false.

Node draining is certainly a good thing to do if you're taking a node out of commission forever. If you're just rebooting the machine, it's not clear if draining it is the desirable thing to do. Factors in this decision include number of pods on that node, how long it takes to re-schedule them, and whether or not those pods have any affitinty for resources on that host. (An example discussed would be a local-cache implemented via host mount).

In the future:

  • Taints and tolerations could provide a means to express which pods/nodes will "be drained on shutdown", and which ones won't.
  • Integrate node-draining more with events more specific than "we're shutting down." The most salient additional detail there is "is this node coming back soon, or is it gone?".

@mumoshu: could we get a rebase on this PR please ;)

@mumoshu mumoshu force-pushed the drain-node-before-shutdown branch from 9eb9197 to eb27164 Compare May 8, 2016 04:21
@mumoshu
Copy link
Contributor Author

mumoshu commented May 8, 2016

@colhom @aaronlevy @cgag @philips Thanks for your review 🙇
I have just rebased this.

Regarding the experimental bit, I agree with leaving it as is.

The context is that, I believe my tests around this functionality does not cover edge cases I can't even imagine of, hence experimental. As I'm planning to deploy CoreOS+K8s onto my production environment, I wished I could end up with a battle-tested solution which we can have some confidence marking it non-experimental. Googling the whole internet and consulting myself does not reveal something like that for me though 😃

The example of a local-cache implemented via host mount you've mentioned is very insightful for me. I personally think of it as one of those (not even edge?) cases.

Also, the integration based on taints-and-tolerations and more fine-grained events seems very interesting/promising to me 👍

@mumoshu
Copy link
Contributor Author

mumoshu commented May 9, 2016

@cgag Thanks for the good catch #465 (comment)!

I have confirmed that it was actually failing when draining a node running DaemonSet-managed pods and fixed it in ae43f64

@mumoshu
Copy link
Contributor Author

mumoshu commented May 9, 2016

#465 (comment)

@aaronlevy Sure. I had wished if a kubelet could stop gracefully when in shutting-down process CoreOS tries to stop it(by sending SIGTERM to it via systemd via docker). So I have tested it. My test showed that kubelet doesn't gracefully stop while CoreOS is shutting down.

I have confirmed it by repeating kubectl get po,no and looking into its output after I had triggered shutdown on a node.
After the node is terminated, the node has shown as the status Ready in kubectl getpo,no output which indicates it is recognized as schedulable in k8s but actually it isn't (as the node is already terminated).
After 1 or 2 minutes, it went NotReady and finally disappeared from the output.

nodeDrainer: NodeDrainer{
Enabled: false,
KubectlImage: "mumoshu/kubectl:v1.2.2",
KubectlCommand: "kubectl",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the default Kubectlmage and KubectlCommand values are changed when the /hyperkube kubectl issue is fixed, we'll have to edit tests like to reflect the changes to defaults.

I would like to see constants defaultHyperkubeImage and defaultKubectlCommand:

  • defined in config.go
  • applied as defaults to the Cluster struct here
  • used in the config tests like this which infer the default value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds nice! Just added 3171bd6 for it 👍

@colhom
Copy link
Contributor

colhom commented May 10, 2016

@mumoshu I talked with @aaronlevy and we have decided to take a stab a the upstream issue hyperkube issue (kubernetes/kubernetes#24088) that's blocking this PR. Give us a day or two to look into this.

@philips
Copy link
Contributor

philips commented May 14, 2016

There is also this issue which takes a different approach for CoreOS autoupdates. They are somewhat orthogonal but highly related: coreos/bugs#1274

@colhom
Copy link
Contributor

colhom commented May 25, 2016

kubernetes/kubernetes#25112 has been merged. @aaronlevy could we cherry-pick this into the next hyperkube image?

@aaronlevy
Copy link
Contributor

@mumoshu what you describe is more or less the expected behavior. The actual window for marking the node as unhealthy / pod eviction can be set via flags on the controller-manager (e.g. PodEvictionTimeout, NodeMonitorGracePeriod, NodeStartupGracePeriod, NodeMonitorPeriod)

@colhom are you sure that's the right issue?

@colhom
Copy link
Contributor

colhom commented May 26, 2016

Ooops I meant kubernetes/kubernetes#25512.

@mumoshu we're going to wait until v1.3 (with that PR in it) is released before going forward wit this PR.

@mumoshu
Copy link
Contributor Author

mumoshu commented Jul 15, 2016

@colhom I'm happy to work on it now as you've merged the 1.3-updates branch into master.

@mumoshu
Copy link
Contributor Author

mumoshu commented Jul 15, 2016

@aaronlevy Thanks for confirming!

Yes, it might be an expected behavior.
The point is that I want to avoid generating errors/delays whenever possible.

As it's been a quite long time since we've talked, let me recap for clarification.

I believe a node shut-down in (1)CoreOS's auto updating process (2)an ASG scale-in (3)a manual maintenance (4)an EC2 failure or etc will result in some down-time/decreased functional pods until Kubernetes marks the node not-ready/unschedulable. Time to not-ready/unschedulable seems to be configurable via flags you've mentioned.

What I wanted to do here is, mark the node before kubernetes does, to completely avoid errors/delays in this case, so that kubernetes can react to decreased pods immediately.

Also, even with changes in this PR, in some cases I believe you'll need additional steps to make a node actually out of service without returning errors to customers. E.g. when the node running pods or services exposing your web service through node/host ports for external load balancing. I guess the whole procedure would be as follows:

  • Remove the node from ELB (done by k8s? or outbound according to your setup. If you've set up your ASG to manage ELB, and the node is going down in process of auto-scaling, it's done by ASG implicitly. But in other cases?)
  • Then kubectl drain the node
  • Do some maintenance work, including a shut-down.
  • If the node comes up, add it to ELB(done by k8s if you've made k8s to manage it via k8s services(type: LoadBalancer)? I haven't tried this case yet)

It is not addressed in this PR and I don't even know how I can do the whole thing cleanly enough.

Does rolling replacement update on controller ASG, followed by workers

Punts on upgrading etcd cluster- simply makes sure resource definitions don't change after create.
@colhom colhom mentioned this pull request Aug 9, 2016
@colhom
Copy link
Contributor

colhom commented Aug 9, 2016

@mumoshu any interest in resurrecting this from the dead?

hyperkube kubectl is fully functional and we now have cluster upgrades coming down the pipeline in #608. Wouldn't be a graceful upgrade without some node draining!

@mumoshu
Copy link
Contributor Author

mumoshu commented Aug 10, 2016

@colhom Definitely yes! Can I just rebase this branch onto the latest master (with of course some manual testing) or would you like colhom:cluster-upgrades?

@colhom
Copy link
Contributor

colhom commented Aug 10, 2016

@mumoshu could you rebase on colhom:cluster-upgrades and list #608 as a dependency?

@mumoshu
Copy link
Contributor Author

mumoshu commented Aug 10, 2016

@colhom sure 👍

mumoshu and others added 2 commits August 11, 2016 17:38
…time to gracefully stop.

This change basically achieves it by running `docker run IMAGE kubectl drain THE_NODE --force` on the to-be-shut-down node before the kubelet gets SIGTERM in a CoreOS' shutdown process.

Without this change, kubelets getting SIGTERM without a prior `drain` results in unfunctional(actually not schedulable) pods to have statuses `Ready`.

With this change, when an ASG's desired cap decreased, a node's status changes over time as follows:

On desired cap change:
  STATUS=Ready
On shut-down started:
  STATUS=Ready,SchedulingDisabled (<- Pods are stopped and status is changed by `kubectl drain`)
On shut-down finished:
  Status=NotReady,SchedulingDisabled (<- It's `NotReady` but it won't result in a down-time because we already stopped both further scheduling and pods)
After a minute:
  The node disappears from the output of `kubectl get nodes`

Note that:

* This applies to manual shutdowns(via running `sudo systemctl shutdown` for example) and automated shutdowns(triggered by AWS AutoScaling when nodes get rotated out of a group)
* We currently depend on the community docker image `mumoshu/kubectl` because `kubectl` included in the official `coreos/hyperkube` image doesn't work due to the issue kubernetes/kubernetes#24088 in Kubernetes. Once the issue is fixed and the CoreOS team published the new hyperkube image with the updated Kubernetes, we can remove that dependency.
* The author considers this an experimental feature. So you shouldn't expect configuration API regarding this stable. It may change in the future.

Re-format code introduced in the previous with gofmt to conform the rules and make the build pases
…Set-managed pods

Before this change, you could reproduce the failure running `sudo sytemctl stop kube-node-drainer` which showed a error message indicating this issue like:

```
May 09 05:04:08 ip-10-0-0-234.ap-northeast-1.compute.internal sh[15376]: error: DaemonSet-managed pods: cassandra-e3tdb (use --ignore-daemonsets to ignore)
May 09 05:04:08 ip-10-0-0-234.ap-northeast-1.compute.internal systemd[1]: kube-node-drainer.service: Control process exited, code=exited status=1
```

After this change, you can drain a node even if it was running DaemonSet-managed pods:

```
May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: node "ip-10-0-0-202.ap-northeast-1.compute.internal" cordoned
May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: WARNING: Skipping DaemonSet-managed pods: cassandra-jzfqo
May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: node "ip-10-0-0-202.ap-northeast-1.compute.internal" drained
```
@mumoshu
Copy link
Contributor Author

mumoshu commented Aug 11, 2016

@colhom Rebased and fixed a failing test!

@felixbuenemann
Copy link
Contributor

Does this really depend on #608? I think this would be a great feature to have with spot instances, which can go away due to price fluctuations, but prices usually don't fluctuate in all AZs at once, so there are usually healthy nodes on other AZs to quickly take over the load.

@aaronlevy
Copy link
Contributor

kube-aws development has moved to its own top-level repository. If this is still something you would like to be merged, please re-open this PR under the new repository at: https://github.com/coreos/kube-aws

@aaronlevy aaronlevy closed this Nov 15, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants