diff --git a/docs/user-guide/node-selection/index.md b/docs/user-guide/node-selection/index.md index 165ee0c9596b7..22db183ae41bb 100644 --- a/docs/user-guide/node-selection/index.md +++ b/docs/user-guide/node-selection/index.md @@ -198,3 +198,215 @@ must be satisfied for the pod to schedule onto a node. For more information on inter-pod affinity/anti-affinity, see the design doc [here](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/podaffinity.md). + +## Taints and tolerations (beta feature) + +### Specifying taints and tolerations + +Node affinity, described earlier, allows a pod to "be attracted to" a set of nodes +(either as a preference or a hard requirement). Taints are the opposite -- they allow +a *node* to *repel* a set of pods. + +More concretely: *taints* go on nodes, and *tolerations* go on pods. A taint on a node +means "repel all pods that do not tolerate the taint." + +You add a taint to a node using [kubectl taint](https://kubernetes.io/docs/user-guide/kubectl/kubectl_taint/). +For example, + +```shell +kubectl taint nodes node1 key=value:NoSchedule +``` + +places a taint on node `node1`. The taint has key `key`, value `value`, and taint effect `NoSchedule`. +This means that no pod will be able to schedule on `node1` unless it has a matching toleration. +The toleration is specified in the PodSpec. Both of the following tolerations "match" the +taint created by the `kubectl taint` line above, and thus a pod with either toleration would be able +to schedule onto `node1`: + +```yaml +tolerations: +- key: "key" + operator: "Equal" + value: "value" + effect: "NoSchedule" +``` + +```yaml +tolerations: +- key: "key" + operator: "Exists" + effect: "NoSchedule" +``` + +A toleration "matches" a taint if the `key`s are the same and the `effect`s are the same, and: + +* the `operator` is `Exists` (in which case no `value` should be specified), or +* the `operator` is `Equal` and the `value`s are equal + +(`Operator` defaults to `Equal` if not specified.) +As a special case, an empty `key` with operator `Exists` matches all keys and all values. +Also as a special case, empty `effect` matches all effects. + +The above example used `effect` of `NoSchedule`. Alternatively, you can use `effect` of `PreferNoSchedule`. +This is a "preference" or "soft" version of `NoSchedule` -- the system will *try* to avoid placing a +pod that does not tolerate the taint on the node, but it is not required. The third kind of `effect` is +`NoExecute`, described later. + +You can put multiple taints on the same node and multiple tolerations on the same pod. +The way Kubernetes processes multiple taints and tolerations is like a filter: start +with all of a node's taints, then ignore the ones for which the pod has a matching toleration; the +remaining un-ignored taints have the indicated effects on the pod. In particular, + +* if there is at least one un-ignored taint with effect `NoSchedule` then Kubernetes will not schedule +the pod onto that node +* if there is no un-ignored taint with effect `NoSchedule` but there is at least one un-ignored taint with +effect `PreferNoSchedule` then Kubernetes will *try* to not schedule the pod onto the node +* if there is at least one un-ignored taint with effect `NoExecute` then the pod will be evicted from +the node (if it is already running on the node), and will not be +scheduled onto the node (if it is not yet running on the node). + +For example, imagine you taint a node like this + +```shell +kubectl taint nodes node1 key1=value1:NoSchedule +kubectl taint nodes node1 key1=value1:NoExecute +kubectl taint nodes node1 key2=value2:NoSchedule +``` + +And a pod has two tolerations: + +```yaml +tolerations: +- key: "key1" + operator: "Equal" + value: "value1" + effect: "NoSchedule" +- key: "key1" + operator: "Equal" + value: "value1" + effect: "NoExecute" +``` + +In this case, the pod will not be able to shedule onto the node, because there is no +toleration matching the third taint. But it will be able to continue running if it is +already running on the node when the taint is added, because the third taint is the only +one of the three that is not tolerated by the pod. + +Normally, if a taint with effect `NoExecute` is added to a node, then any pods that do +not tolerate the taint will be evicted immediately, and any pods that do tolerate the +taint will never be evicted. However, a toleration with `NoExecute` effect can specify +an optional `tolerationSeconds` field that dictates how long the pod will stay bound +to the node after the taint is added. For example, + +```yaml +tolerations: +- key: "key1" + operator: "Equal" + value: "value1" + effect: "NoExecute" + tolerationSeconds: 3600 +``` + +means that if this pod is running and a matching taint is added to the node, then +the pod will stay bound to the node for 3600 seconds, and then be evicted. If the +taint is removed before that time, the pod will not be evicted. + +### Example use cases + +Taints and tolerations are a flexible way to steer pods away from nodes or evict +pods that shouldn't be running. A few of the use cases are + +* **dedicated nodes**: If you want to dedicate a set of nodes for exclusive use by +a particular set of users, you can add a taint to those nodes (say, +`kubectl taint nodes nodename dedicated=groupName:NoSchedule`) and then add a corresponding +toleration to their pods (this would be done most easily by writing a custom +[admission controller](https://kubernetes.io/docs/admin/admission-controllers/)). +The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as +well as any other nodes in the cluster. If you want to dedicate the nodes to them *and* +ensure they *only* use the dedicated nodes, then you should additionally add a label similar +to the taint to the same set of nodes (e.g. `dedicated=groupName`), and the admission +controller should additionally add a node affinity to require that the pods can only schedule +onto nodes labeled with `dedicated=groupName`. + +* **nodes with special hardware**: In a cluster where a small subset of nodes have specialized +hardware (for example GPUs), it is desirable to keep pods that don't need the specialized +hardware off of those nodes, thus leaving room for later-arriving pods that do need the +specialized hardware. This can be done by tainting the nodes that have the specialized +hardware (e.g. `kubectl taint nodes nodename special=true:NoSchedule` or +`kubectl taint nodes nodename special=true:PreferNoSchedule`) and adding a corresponding +toleration to pods that use the special hardware. As in the dedicated nodes use case, +it is probably easiest to apply the tolerations using a custom +[admission controller](https://kubernetes.io/docs/admin/admission-controllers/)). +For example, the admission controller could use +some characteristic(s) of the pod to determine that the pod should be allowed to use +the special nodes and hence the admission controller should add the toleration. +To ensure that the pods that need +the special hardware *only* schedule onto the nodes that have the special hardware, you will need some +additional mechanism, e.g. you could represent the special resource using +[opaque integer resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature) +and request it as a resource in the PodSpec, or you could label the nodes that have +the special hardware and use node affinity on the pods that need the hardware. + +* **per-pod-configurable eviction behavior when there are node problems (alpha feature)**, +which is described in the next section. + +### Per-pod-configurable eviction behavior when there are node problems (alpha feature) + +Earlier we mentioned the `NoExecute` taint effect, which affects pods that are already +running on the node as follows + + * pods that do not tolerate the taint are evicted immediately + * pods that tolerate the taint without specifying `tolerationSeconds` in + their toleration specification remain bound forever + * pods that tolerate the taint with a specified `tolerationSeconds` remain + bound for the specified amount of time + +The above behavior is a beta feature. In addition, Kubernetes 1.6 has alpha +support for representing node problems (currently only "node unreachable" and +"node not ready", corresponding to the NodeCondition "Ready" being "Unknown" or +"False" respectively) as taints. When the `TaintBasedEvictions` alpha feature +is enabled (you can do this by including `TaintBasedEvictions=true` in `--feature-gates`, such as +`--feature-gates=FooBar=true,TaintBasedEvictions=true`), the taints are automatically +added by the NodeController and the normal logic for evicting pods from nodes +based on the Ready NodeCondition is disabled. +(Note: To maintain the existing [rate limiting](https://kubernetes.io/docs/admin/node/#node-controller)) +behavior of pod evictions due to node problems, the system actually adds the taints +in a rate-limited way. This prevents massive pod evictions in scenarios such +as the master becoming partitioned from the nodes.) +This alpha feature, in combination with `tolerationSeconds`, allows a pod +to specify how long it should stay bound to a node that has one or both of these problems. + +For example, an application with a lot of local state might want to stay +bound to node for a long time in the event of network partition, in the hope +that the partition will recover and thus the pod eviction can be avoided. +The toleration the pod would use in that case would look like + +```yaml +tolerations: +- key: "node.alpha.kubernetes.io/unreachable" + operator: "Exists" + effect: "NoExecute" + tolerationSeconds: 6000 +``` + +(For the node not ready case, change the key to `node.alpha.kubernetes.io/notReady`.) + +Note that Kubernetes automatically adds a toleration for +`node.alpha.kubernetes.io/notReady` with `tolerationSeconds=300` +unless the pod configuration provided +by the user already has a toleration for `node.alpha.kubernetes.io/notReady`. +Likewise it adds a toleration for +`node.alpha.kubernetes.io/unreachable` with `tolerationSeconds=300` +unless the pod configuration provided +by the user already has a toleration for `node.alpha.kubernetes.io/unreachable`. + +These automatically-added tolerations ensure that +the default pod behavior of remaining bound for 5 minutes after one of these +problems is detected is maintained. +The two default tolerations are added by the [DefaultTolerationSeconds +admission controller](https://github.com/kubernetes/kubernetes/tree/master/plugin/pkg/admission/defaulttolerationseconds). + +[DaemonSet](https://kubernetes.io/docs/admin/daemons/) pods are created with +`NoExecute` tolerations for `node.alpha.kubernetes.io/unreachable` and `node.alpha.kubernetes.io/notReady` +with no `tolerationSeconds`. This ensures that DaemonSet pods are never evicted due +to these problems, which matches the behavior when this feature is disabled.