kube-aws: Drain nodes before shutting them down #465

mumoshu · 2016-05-05T04:24:52Z

Drain nodes before shutting them down to give running pods time to gracefully stop.

This change basically achieves it by running docker run IMAGE kubectl drain THE_NODE --force on the to-be-shut-down node before the kubelet gets SIGTERM in a CoreOS' shutdown process.

Depends #608
Refs #340

Without this change, kubelets getting SIGTERM without a prior drain results in unfunctional(actually not schedulable) pods to have statuses Ready.

With this change, when an ASG's desired cap decreased, a node's status changes over time as follows:

On desired cap change:
=> STATUS=Ready

On shut-down started:
=> STATUS=Ready,SchedulingDisabled (<- Pods are stopped and status is changed by kubectl drain)

On shut-down finished:
=> Status=NotReady,SchedulingDisabled (<- It's NotReady but it won't result in a down-time because we already stopped both further scheduling and pods)

After a minute:
=> The node disappears from the output of kubectl get nodes

Note that:

This applies to manual shutdowns(via running sudo systemctl shutdown for example) and automated shutdowns(triggered by AWS AutoScaling when nodes get rotated out of a group)
We currently depend on the community docker image mumoshu/kubectl because kubectl included in the official coreos/hyperkube image doesn't work due to the issue hyperkube + kubectl - unknown flag kubernetes/kubernetes#24088 in Kubernetes. Once the issue is fixed and the CoreOS team published the new hyperkube image with the updated Kubernetes, we can remove that dependency.
I consider this an experimental feature. So you shouldn't expect configuration API regarding this stable. It may change in the future.

colhom · 2016-05-05T18:49:10Z

@mumoshu this looks awesome, very exciting!

I don't know that we need the experimental bit. I would say that this behavior is "universally desirable", so we may just want to run it through conformance tests and node-draining on shutdown non-configurable. If for some odd reason (testing, etc) somebody wants to disable this, that should be done at a lower level (disabling the service in cloud-config, etc)

\cc @aaronlevy @cgag

cgag · 2016-05-05T20:38:42Z

multi-node/aws/pkg/config/templates/cloud-config-worker

+          --kubeconfig=/etc/kubernetes/worker-kubeconfig.yaml \
+          drain $$(hostname) \
+          --force'
+


Does this also need --ignore-daemonsets=true? The man page implies to me that you'd need both if you had a daemonset and an unmanaged pod, even though the name "force" makes me think "force" would be enough.

aaronlevy · 2016-05-05T21:53:28Z

@mumoshu can you provide a little more detail regarding:

Without this change, kubelets getting SIGTERM without a prior drain results in unfunctional(actually not schedulable) pods to have statuses Ready

aaronlevy · 2016-05-05T22:32:54Z

multi-node/aws/pkg/config/templates/cloud-config-worker

+      enable: true
+      content: |
+        [Unit]
+        Description=drain this k8s node to make running pods time to gracefully shut down before stopping kubelet


I'm not seeing how this unit keeps the kubelet running until drained.

Looks like systemd shutdown down services in the reverse of the boot order[1], so After=kubelet.service should mean this should be shut down before the kubelet.

[1] https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Before=

ah, thanks @cgag

colhom · 2016-05-06T21:14:14Z

To recap in-person discussions between @aaronlevy, @philips and myself:

Summary- we will leave this node-draining behavior gated on experimental now as @mumoshu has it in this PR, defaulting to false.

Node draining is certainly a good thing to do if you're taking a node out of commission forever. If you're just rebooting the machine, it's not clear if draining it is the desirable thing to do. Factors in this decision include number of pods on that node, how long it takes to re-schedule them, and whether or not those pods have any affitinty for resources on that host. (An example discussed would be a local-cache implemented via host mount).

In the future:

Taints and tolerations could provide a means to express which pods/nodes will "be drained on shutdown", and which ones won't.
Integrate node-draining more with events more specific than "we're shutting down." The most salient additional detail there is "is this node coming back soon, or is it gone?".

@mumoshu: could we get a rebase on this PR please ;)

mumoshu · 2016-05-08T06:18:34Z

@colhom @aaronlevy @cgag @philips Thanks for your review 🙇
I have just rebased this.

Regarding the experimental bit, I agree with leaving it as is.

The context is that, I believe my tests around this functionality does not cover edge cases I can't even imagine of, hence experimental. As I'm planning to deploy CoreOS+K8s onto my production environment, I wished I could end up with a battle-tested solution which we can have some confidence marking it non-experimental. Googling the whole internet and consulting myself does not reveal something like that for me though 😃

The example of a local-cache implemented via host mount you've mentioned is very insightful for me. I personally think of it as one of those (not even edge?) cases.

Also, the integration based on taints-and-tolerations and more fine-grained events seems very interesting/promising to me 👍

mumoshu · 2016-05-09T10:57:13Z

@cgag Thanks for the good catch #465 (comment)!

I have confirmed that it was actually failing when draining a node running DaemonSet-managed pods and fixed it in ae43f64

mumoshu · 2016-05-09T11:10:15Z

#465 (comment)

@aaronlevy Sure. I had wished if a kubelet could stop gracefully when in shutting-down process CoreOS tries to stop it(by sending SIGTERM to it via systemd via docker). So I have tested it. My test showed that kubelet doesn't gracefully stop while CoreOS is shutting down.

I have confirmed it by repeating kubectl get po,no and looking into its output after I had triggered shutdown on a node.
After the node is terminated, the node has shown as the status Ready in kubectl getpo,no output which indicates it is recognized as schedulable in k8s but actually it isn't (as the node is already terminated).
After 1 or 2 minutes, it went NotReady and finally disappeared from the output.

colhom · 2016-05-10T18:32:31Z

multi-node/aws/pkg/config/config_test.go

+			nodeDrainer: NodeDrainer{
+				Enabled:        false,
+				KubectlImage:   "mumoshu/kubectl:v1.2.2",
+				KubectlCommand: "kubectl",


When the default Kubectlmage and KubectlCommand values are changed when the /hyperkube kubectl issue is fixed, we'll have to edit tests like to reflect the changes to defaults.

I would like to see constants defaultHyperkubeImage and defaultKubectlCommand:

defined in config.go

applied as defaults to the Cluster struct here

used in the config tests like this which infer the default value

Sounds nice! Just added 3171bd6 for it 👍

colhom · 2016-05-10T18:49:57Z

@mumoshu I talked with @aaronlevy and we have decided to take a stab a the upstream issue hyperkube issue (kubernetes/kubernetes#24088) that's blocking this PR. Give us a day or two to look into this.

philips · 2016-05-14T14:47:24Z

There is also this issue which takes a different approach for CoreOS autoupdates. They are somewhat orthogonal but highly related: coreos/bugs#1274

colhom · 2016-05-25T20:23:19Z

kubernetes/kubernetes#25112 has been merged. @aaronlevy could we cherry-pick this into the next hyperkube image?

aaronlevy · 2016-05-25T20:50:09Z

@mumoshu what you describe is more or less the expected behavior. The actual window for marking the node as unhealthy / pod eviction can be set via flags on the controller-manager (e.g. PodEvictionTimeout, NodeMonitorGracePeriod, NodeStartupGracePeriod, NodeMonitorPeriod)

@colhom are you sure that's the right issue?

colhom · 2016-05-26T02:36:24Z

Ooops I meant kubernetes/kubernetes#25512.

@mumoshu we're going to wait until v1.3 (with that PR in it) is released before going forward wit this PR.

mumoshu · 2016-07-15T06:49:41Z

@colhom I'm happy to work on it now as you've merged the 1.3-updates branch into master.

mumoshu · 2016-07-15T08:07:14Z

@aaronlevy Thanks for confirming!

Yes, it might be an expected behavior.
The point is that I want to avoid generating errors/delays whenever possible.

As it's been a quite long time since we've talked, let me recap for clarification.

I believe a node shut-down in (1)CoreOS's auto updating process (2)an ASG scale-in (3)a manual maintenance (4)an EC2 failure or etc will result in some down-time/decreased functional pods until Kubernetes marks the node not-ready/unschedulable. Time to not-ready/unschedulable seems to be configurable via flags you've mentioned.

What I wanted to do here is, mark the node before kubernetes does, to completely avoid errors/delays in this case, so that kubernetes can react to decreased pods immediately.

Also, even with changes in this PR, in some cases I believe you'll need additional steps to make a node actually out of service without returning errors to customers. E.g. when the node running pods or services exposing your web service through node/host ports for external load balancing. I guess the whole procedure would be as follows:

Remove the node from ELB (done by k8s? or outbound according to your setup. If you've set up your ASG to manage ELB, and the node is going down in process of auto-scaling, it's done by ASG implicitly. But in other cases?)
Then kubectl drain the node
Do some maintenance work, including a shut-down.
If the node comes up, add it to ELB(done by k8s if you've made k8s to manage it via k8s services(type: LoadBalancer)? I haven't tried this case yet)

It is not addressed in this PR and I don't even know how I can do the whole thing cleanly enough.

…ddrs

Does rolling replacement update on controller ASG, followed by workers Punts on upgrading etcd cluster- simply makes sure resource definitions don't change after create.

colhom · 2016-08-09T23:57:26Z

@mumoshu any interest in resurrecting this from the dead?

hyperkube kubectl is fully functional and we now have cluster upgrades coming down the pipeline in #608. Wouldn't be a graceful upgrade without some node draining!

mumoshu · 2016-08-10T00:01:54Z

@colhom Definitely yes! Can I just rebase this branch onto the latest master (with of course some manual testing) or would you like colhom:cluster-upgrades?

colhom · 2016-08-10T00:10:29Z

@mumoshu could you rebase on colhom:cluster-upgrades and list #608 as a dependency?

mumoshu · 2016-08-10T00:13:14Z

@colhom sure 👍

…time to gracefully stop. This change basically achieves it by running `docker run IMAGE kubectl drain THE_NODE --force` on the to-be-shut-down node before the kubelet gets SIGTERM in a CoreOS' shutdown process. Without this change, kubelets getting SIGTERM without a prior `drain` results in unfunctional(actually not schedulable) pods to have statuses `Ready`. With this change, when an ASG's desired cap decreased, a node's status changes over time as follows: On desired cap change: STATUS=Ready On shut-down started: STATUS=Ready,SchedulingDisabled (<- Pods are stopped and status is changed by `kubectl drain`) On shut-down finished: Status=NotReady,SchedulingDisabled (<- It's `NotReady` but it won't result in a down-time because we already stopped both further scheduling and pods) After a minute: The node disappears from the output of `kubectl get nodes` Note that: * This applies to manual shutdowns(via running `sudo systemctl shutdown` for example) and automated shutdowns(triggered by AWS AutoScaling when nodes get rotated out of a group) * We currently depend on the community docker image `mumoshu/kubectl` because `kubectl` included in the official `coreos/hyperkube` image doesn't work due to the issue kubernetes/kubernetes#24088 in Kubernetes. Once the issue is fixed and the CoreOS team published the new hyperkube image with the updated Kubernetes, we can remove that dependency. * The author considers this an experimental feature. So you shouldn't expect configuration API regarding this stable. It may change in the future. Re-format code introduced in the previous with gofmt to conform the rules and make the build pases

…Set-managed pods Before this change, you could reproduce the failure running `sudo sytemctl stop kube-node-drainer` which showed a error message indicating this issue like: ``` May 09 05:04:08 ip-10-0-0-234.ap-northeast-1.compute.internal sh[15376]: error: DaemonSet-managed pods: cassandra-e3tdb (use --ignore-daemonsets to ignore) May 09 05:04:08 ip-10-0-0-234.ap-northeast-1.compute.internal systemd[1]: kube-node-drainer.service: Control process exited, code=exited status=1 ``` After this change, you can drain a node even if it was running DaemonSet-managed pods: ``` May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: node "ip-10-0-0-202.ap-northeast-1.compute.internal" cordoned May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: WARNING: Skipping DaemonSet-managed pods: cassandra-jzfqo May 09 10:41:08 ip-10-0-0-202.ap-northeast-1.compute.internal sh[7671]: node "ip-10-0-0-202.ap-northeast-1.compute.internal" drained ```

…ontrollerIP

mumoshu · 2016-08-11T09:16:52Z

@colhom Rebased and fixed a failing test!

felixbuenemann · 2016-10-12T19:21:29Z

Does this really depend on #608? I think this would be a great feature to have with spot instances, which can go away due to price fluctuations, but prices usually don't fluctuate in all AZs at once, so there are usually healthy nodes on other AZs to quickly take over the load.

aaronlevy · 2016-11-15T00:38:12Z

kube-aws development has moved to its own top-level repository. If this is still something you would like to be merged, please re-open this PR under the new repository at: https://github.com/coreos/kube-aws

mumoshu force-pushed the drain-node-before-shutdown branch from a0117e5 to df90a06 Compare May 5, 2016 04:29

This was referenced May 5, 2016

Production Quality Deployment #340

Closed

Use kubectl instead of curl to load kube-system manifests #462

Closed

colhom added platform/AWS kind/enhancement dependency/kubernetes labels May 5, 2016

cgag reviewed May 5, 2016
View reviewed changes

aaronlevy reviewed May 5, 2016
View reviewed changes

mumoshu force-pushed the drain-node-before-shutdown branch from 9eb9197 to eb27164 Compare May 8, 2016 04:21

colhom reviewed May 10, 2016
View reviewed changes

colhom added 7 commits August 8, 2016 16:34

WIP discrete etcd cluster

2f547b6

fixup! cluster.yaml stubs for new config fields

d4bd382

fixup! sed s/kubelet/etcd2

dde38f6

fixup! don't trample reserved subnet address space w/ etcd instance a…

81a2729

…ddrs

fixup! don't allow re-definition of etcd resources during update

56a77fe

(WIP) HA control plane.

5296214

fixup! remove controllerIP from cluster.yaml template

a84a6f4

colhom added 3 commits August 8, 2016 16:38

fixup! show dns name for api elb

65bd4fe

aws: cluster upgrade support via kube-aws up --update

a3c772a

Does rolling replacement update on controller ASG, followed by workers Punts on upgrading etcd cluster- simply makes sure resource definitions don't change after create.

aws: cluster upgrade docs

3dff28f

colhom mentioned this pull request Aug 9, 2016

Cluster upgrades #608

Closed

mumoshu and others added 2 commits August 11, 2016 17:38

mumoshu force-pushed the drain-node-before-shutdown branch from 3171bd6 to d2a9085 Compare August 11, 2016 08:41

Fix a failing test caused by the removal of configuration variable: C…

8f90853

…ontrollerIP

aaronlevy closed this Nov 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-aws: Drain nodes before shutting them down #465

kube-aws: Drain nodes before shutting them down #465

mumoshu commented May 5, 2016 •

edited

Loading

colhom commented May 5, 2016

cgag May 5, 2016

aaronlevy commented May 5, 2016

aaronlevy May 5, 2016

cgag May 5, 2016 •

edited

Loading

aaronlevy May 5, 2016

colhom commented May 6, 2016 •

edited

Loading

mumoshu commented May 8, 2016

mumoshu commented May 9, 2016

mumoshu commented May 9, 2016

colhom May 10, 2016

mumoshu May 11, 2016

colhom commented May 10, 2016

philips commented May 14, 2016

colhom commented May 25, 2016 •

edited

Loading

aaronlevy commented May 25, 2016

colhom commented May 26, 2016

mumoshu commented Jul 15, 2016

mumoshu commented Jul 15, 2016

colhom commented Aug 9, 2016

mumoshu commented Aug 10, 2016 •

edited

Loading

colhom commented Aug 10, 2016

mumoshu commented Aug 10, 2016

mumoshu commented Aug 11, 2016

felixbuenemann commented Oct 12, 2016

aaronlevy commented Nov 15, 2016

kube-aws: Drain nodes before shutting them down #465

kube-aws: Drain nodes before shutting them down #465

Conversation

mumoshu commented May 5, 2016 • edited Loading

colhom commented May 5, 2016

cgag May 5, 2016

Choose a reason for hiding this comment

aaronlevy commented May 5, 2016

aaronlevy May 5, 2016

Choose a reason for hiding this comment

cgag May 5, 2016 • edited Loading

Choose a reason for hiding this comment

aaronlevy May 5, 2016

Choose a reason for hiding this comment

colhom commented May 6, 2016 • edited Loading

mumoshu commented May 8, 2016

mumoshu commented May 9, 2016

mumoshu commented May 9, 2016

colhom May 10, 2016

Choose a reason for hiding this comment

mumoshu May 11, 2016

Choose a reason for hiding this comment

colhom commented May 10, 2016

philips commented May 14, 2016

colhom commented May 25, 2016 • edited Loading

aaronlevy commented May 25, 2016

colhom commented May 26, 2016

mumoshu commented Jul 15, 2016

mumoshu commented Jul 15, 2016

colhom commented Aug 9, 2016

mumoshu commented Aug 10, 2016 • edited Loading

colhom commented Aug 10, 2016

mumoshu commented Aug 10, 2016

mumoshu commented Aug 11, 2016

felixbuenemann commented Oct 12, 2016

aaronlevy commented Nov 15, 2016

mumoshu commented May 5, 2016 •

edited

Loading

cgag May 5, 2016 •

edited

Loading

colhom commented May 6, 2016 •

edited

Loading

colhom commented May 25, 2016 •

edited

Loading

mumoshu commented Aug 10, 2016 •

edited

Loading