Although the Kubernetes scheduler(s) try to make good placement decisions for pods, conditions in the cluster change over time (e.g. jobs finish and new pods arrive, nodes are removed due to failures or planned maintenance or auto-scaling down, nodes appear due to recovery after a failure or re-joining after maintenance or auto-scaling up or adding new hardware to a bare-metal cluster), and schedulers are not omniscient (e.g. there are some interactions between pods, or between pods and nodes, that they cannot predict). As a result, the initial node selected for a pod may turn out to be a bad match, from the perspective of the pod and/or the cluster as a whole, at some point after the pod has started running.
Today (Kubernetes version 1.2) once a pod is scheduled to a node, it never moves unless it terminates on its own, is deleted by the user, or experiences some unplanned event (e.g. the node where it is running dies). Thus in a cluster with long-running pods, the assignment of pods to nodes degrades over time, no matter how good an initial scheduling decision the scheduler makes. This observation motivates "controlled rescheduling," a mechanism by which Kubernetes will "move" already-running pods over time to improve their placement. Controlled rescheduling is the subject of this proposal.
Note that the term "move" is not technically accurate -- the mechanism used is that Kubernetes will terminate a pod that is managed by a controller, and the controller will create a replacement pod that is then scheduled by the pod's scheduler. The terminated pod and replacement pod are completely separate pods, and no pod migration is implied. However, describing the process as "moving" the pod is approximately accurate and easier to understand, so we will use this terminology in the document.
We use the term "rescheduling" to describe any action the system takes to move an already-running pod. The decision may be made and executed by any component; we wil introduce the concept of a "rescheduler" component later, but it is not the only component that can do rescheduling.
This proposal primarily focuses on the architecture and features/mechanisms used to achieve rescheduling, and only briefly discuss example policies. We expect that community experimentation will lead to a significantly better understanding of the range, potential, and limitations of rescheduling policies.
Example use cases for rescheduling are
- moving a running pod onto a node that better satisfies its scheduling criteria
- moving a pod onto an under-utilized node
- moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences
- moving a running pod off of a node in anticipation of a known or speculated future event
- draining a node in preparation for maintenance, decomissioning, auto-scale-down, etc.
- "preempting" a running pod to make room for a pending pod to schedule
- proactively/speculatively make room for large and/or exclusive pods to facilitate fast scheduling in the future (often called "defragmentation")
- (note that these last two cases are the only use cases where the first-order intent is to move a pod specifically for the benefit of another pod)
- moving a running pod off of a node from which it is receiving poor service
- anomalous crashlooping or other mysterious incompatiblity between the pod and the node
- repeated out-of-resource killing (see #18724)
- repeated attempts by the scheduler to schedule the pod onto some node, but it is rejected by Kubelet admission control due to incomplete scheduler knowledge
- poor performance due to interference from other containers on the node (CPU hogs, cache thrashers, etc.) (note that in this case there is a choice of moving the victim or the aggressor)
Among the key design decisions are
- how does a pod specify its tolerance for these system-generated disruptions, and how does the system enforce such disruption limits
- for each use case, where is the decision made about when and which pods to reschedule (controllers, schedulers, an entirely new component e.g. "rescheduler", etc.)
- rescheduler design issues: how much does a rescheduler need to know about pods' schedulers' policies, how does the rescheduler specify its rescheduling requests/decisions (e.g. just as an eviction, an eviction with a hint about where to reschedule, or as an eviction paired with a specific binding), how does the system implement these requests, does the rescheduler take into account the second-order effects of decisions (e.g. whether an evicted pod will reschedule, will cause a preemption when it reschedules, etc.), does the rescheduler execute multi-step plans (e.g. evict two pods at the same time with the intent of moving one into the space vacated by the other, or even more complex plans)
Additional musings on the rescheduling design space can be found here.
The key mechanisms and components of the proposed design are priority, preemption,
disruption budgets, the /evict
subresource, and the rescheduler.
Just as it is useful to overcommit nodes to increase node-level utilization, it is useful to overcommit clusters to increase cluster-level utilization. Scheduling priority (which we abbreviate as priority, in combination with disruption budgets (described in the next section), allows Kubernetes to safely overcommit clusters much as QoS levels allow it to safely overcommit nodes.
Today, cluster sharing among users, workload types, etc. is regulated via the quota mechanism. When allocating quota, a cluster administrator has two choices: (1) the sum of the quotas is less than or equal to the capacity of the cluster, or (2) the sum of the quotas is greater than the capacity of the cluster (that is, the cluster is overcommitted). (1) is likely to lead to cluster under-utilization, while (2) is unsafe in the sense that someone's pods may go pending indefinitely even though they are still within their quota. Priority makes cluster overcommitment (i.e. case (2)) safe by allowing users and/or administrators to identify which pods should be allowed to run, and which should go pending, when demand for cluster resources exceeds supply to due to cluster overcommitment.
Priority is also useful in some special-case scenarios, such as ensuring that system DaemonSets can always schedule and reschedule onto every node where they want to run (assuming they are given the highest priority), e.g. see #21767.
We propose to add a required Priority
field to PodSpec
. Its value type is string, and
the cluster administrator defines a total ordering on these strings (for example
Critical
, Normal
, Preemptible
). We choose string instead of integer so that it is
easy for an administrator to add new priority levels in between existing levels, to
encourage thinking about priority in terms of user intent and avoid magic numbers, and to
make the internal implementation more flexible.
When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at the same or lower priority than P (subject to disruption budgets, see next section) from a node in order to make room for P, i.e. in order to make the scheduling predicates satisfied for P on that node. (Note that when we add cluster-level resources (#19080), it might be necessary to preempt from multiple nodes, but that scenario is outside the scope of this document.) The preempted pod(s) may or may not be able to reschedule. The net effect of this process is that when demand for cluster resources exceeds supply, the higher-priority pods will be able to run while the lower-priority pods will be forced to wait. The detailed mechanics of preemption are described in a later section.
In addition to taking disruption budget into account, for equal-priority preemptions the scheduler will try to enforce fairness (across victim controllers, services, etc.)
Priorities could be specified directly by users in the podTemplate, or assigned by an admission controller using properties of the pod. Either way, all schedulers must be configured to understand the same priorities (names and ordering). This could be done by making them constants in the API, or using ConfigMap to configure the schedulers with the information. The advantage of the former (at least making the names, if not the ordering, constants in the API) is that it allows the API server to do validation (e.g. to catch mis-spelling).
In the future, which priorities are usable for a given namespace and pods with certain attributes may be configurable, similar to ResourceQuota, LimitRange, or security policy.
Priority and resource QoS are indepedent.
The priority we have described here might be used to prioritize the scheduling queue (i.e. the order in which a scheduler examines pods in its scheduling loop), but the two priority concepts do not have to be connected. It is somewhat logical to tie them together, since a higher priority genreally indicates that a pod is more urgent to get running. Also, scheduling low-priority pods before high-priority pods might lead to avoidable preemptions if the high-priority pods end up preempting the low-priority pods that were just scheduled.
TODO: Priority and preemption are global or namespace-relative? See this discussion thread.
Of course, if the decision of what priority to give a pod is solely up to the user, then users have no incentive to ever request any priority less than the maximum. Thus priority is intimately related to quota, in the sense that resource quotas must be allocated on a per-priority-level basis (X amount of RAM at priority A, Y amount of RAM at priority B, etc.). The "guarantee" that highest-priority pods will always be able to schedule can only be achieved if the sum of the quotas at the top priority level is less than or equal to the cluster capacity. This is analogous to QoS, where safety can only be achieved if the sum of the limits of the top QoS level ("Guaranteed") is less than or equal to the node capacity. In terms of incentives, an organization could "charge" an amount proportional to the priority of the resources.
The topic of how to allocate quota at different priority levels to achieve a desired balance between utilization and probability of schedulability is an extremely complex topic that is outside the scope of this document. For example, resource fragmentation and RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the sum of the quotas at the top priority level is less than or equal to the total aggregate capacity of the cluster, some pods at the top priority level might still go pending. In general, priority provdes a probabilistic guarantees of pod schedulability in the face of overcommitment, by allowing prioritization of which pods should be allowed to run pods when demand for cluster resources exceeds supply.
While priority can protect pods from one source of disruption (preemption by a
lower-priority pod), disruption budgets limit disruptions from all Kubernetes-initiated
causes, including preemption by an equal or higher-priority pod, or being evicted to
achieve other rescheduling goals. In particular, each pod is optionally associated with a
"disruption budget," a new API resource that limits Kubernetes-initiated terminations
across a set of pods (e.g. the pods of a particular Service might all point to the same
disruption budget object), regardless of cause. Initially we expect disruption budget
(e.g. DisruptionBudgetSpec
) to consist of
- a rate limit on disruptions (preemption and other evictions) across the corresponding set of pods, e.g. no more than one disruption per hour across the pods of a particular Service
- a minimum number of pods that must be up simultaneously (sometimes called "shard strength") (of course this can also be expressed as the inverse, i.e. the number of pods of the collection that can be down simultaneously)
The second item merits a bit more explanation. One use case is to specify a quorum size, e.g. to ensure that at least 3 replicas of a quorum-based service with 5 replicas are up at the same time. In practice, a service should ideally create enough replicas to survive at least one planned and one unplanned outage. So in our quorum example, we would specify that at least 4 replicas must be up at the same time; this allows for one intentional disruption (bringing the number of live replicas down from 5 to 4 and consuming one unit of shard strength budget) and one unplanned disruption (bringing the number of live replicas down from 4 to 3) while still maintaining a quorum. Shard strength is also useful for simpler replicated services; for example, you might not want more than 10% of your front-ends to be down at the same time, so as to avoid overloading the remaining replicas.
Initially, disruption budgets will be specified by the user. Thus as with priority, disruption budgets need to be tied into quota, to prevent users from saying none of their pods can ever be disrupted. The exact way of expressing and enforcing this quota is TBD, though a simple starting point would be to have an admission controller assign a default disruption budget based on priority level (more liberal with decreasing priority). We also likely need a quota that applies to Kubernetes components, to the limit the rate at which any one component is allowed to consume disruption budget.
Of course there should also be a DisruptionBudgetStatus
that indicates the current
disruption rate that the collection of pods is experiencing, and the number of pods that
are up.
For the purposes of disruption budget, a pod is considered to be disrupted as soon as its graceful termination period starts.
A pod that is not covered by a disruption budget but is managed by a controller, gets an implicit disruption budget of infinite (though the system should try to not unduly victimize such pods). How a pod that is not managed by a controller is handled is TBD.
TBD: In addition to PodSpec
, where do we store pointer to disruption budget
(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption
budget (e.g. when instantiating a Service), or require the user to create it manually
before they create a controller? Which objects should return the disruption budget object
as part of the output on kubectl get
other than (obviously) kubectl get
for the
disruption budget itself?
TODO: Clean up distinction between "down due to voluntary action taken by Kubernetes" and "down due to unplanned outage" in spec and status.
For now, there is nothing to prevent clients from circumventing the disruption budget protections. Of course, clients that do this are not being "good citizens." In the next section we describe a mechanism that at least makes it easy for well-behaved clients to obey the disruption budgets.
See #12611 for additional discussion of disruption budgets.
Although we could put the responsibility for checking and updating disruption budgets
solely on the client, it is safer and more convenient if we implement that functionality
in the API server. Thus we will introduce a new /evict
subresource on pod. It is similar to
today's "delete" on pod except
-
It will be rejected if the deletion would violate disruption budget. (See how Deployment handles failure of /rollback for ideas on how clients could handle failure of
/evict
.) There are two possible ways to implement this:-
For the initial implementation, this will be accomplished by the API server just looking at the
DisruptionBudgetStatus
and seeing if the disruption would violate theDisruptionBudgetSpec
. In this approach, we assume a disruption budget controller keeps theDisruptionBudgetStatus
up-to-date by observing all pod deletions and creations in the cluster, so that an approved disruption is quickly reflected in theDisruptionBudgetStatus
. Of course this approach does allow a race in which one or more additional disruptions could be approved before the first one is reflected in theDisruptionBudgetStatus
. -
Thus a subsequent implementation will have the API server explicitly debit the
DisruptionBudgetStatus
when it accepts an/evict
. (There still needs to be a controller, to keep the shard strength status up-to-date when replacement pods are created after an eviction; the controller may also be necessary for the rate status depending on how rate is represented, e.g. adding tokens to a bucket at a fixed rate.) Once etcd support multi-object transactions (etcd v3), the debit and pod deletion will be placed in the same transaction. -
Note: For the purposes of disruption budget, a pod is considered to be disrupted as soon as its graceful termination period starts (so when we say "delete" here we do not mean "deleted from etcd" but rather "graceful termination period has started").
-
-
It will allow clients to communicate additional parameters when they wish to delete a pod. (In the absence of the
/evict
subresource, we would have to create a pod-specific type analogous toapi.DeleteOptions
.)
We will make kubectl delete pod
use /evict
by default, and require a command-line
flag to delete the pod directly.
We will add to NodeStatus
a bounded-sized list of signatures of pods that should avoid
that node (provisionally called PreferAvoidPods
). One of the pieces of information
specified in the /evict
subresource is whether the eviction should add the evicted
pod's signature to the corresponding node's PreferAvoidPods
. Initially the pod
signature will be a
controllerRef,
i.e. a reference to the pod's controller. Controllers are responsible for garbage
collecting, after some period of time, PreferAvoidPods
entries that point to them, but the API
server will also enforce a bounded size on the list. All schedulers will have a
highest-weighted priority function that gives a node the worst priority if the pod it is
scheduling appears in that node's PreferAvoidPods
list. Thus appearing in
PreferAvoidPods
is similar to
RequiredDuringScheduling node anti-affinity
but it takes precedence over all other priority criteria and is not explicitly listed in
the NodeAffinity
of the pod.
PreferAvoidPods
is useful for the "moving a running pod off of a node from which it is
receiving poor service" use case, as it reduces the chance that the replacement pod will
end up on the same node (keep in mind that most of those cases are situations that the
scheduler does not have explicit priority functions for, for example it cannot know in
advance that a pod will be starved). Also, though we do not intend to implement any such
policies in the first version of the rescheduler, it is useful whenever the rescheduler evicts
two pods A and B with the intention of moving A into the space vacated by B (it prevents
B from rescheduling back into the space it vacated before A's scheduler has a chance to
reschedule A there). Note that these two uses are subtly different; in the first
case we want the avoidance to last a relatively long time, whereas in the second case we
may only need it to last until A schedules.
See #20699 for more discussion.
NOTE: We expect a fuller design doc to be written on preemption before it is implemented. However, a sketch of some ideas are presented here, since preemption is closely related to the concepts discussed in this doc.
Pod schedulers will decide and enact preemptions, subject to the priority and disruption
budget rules described earlier. (Though note that we currently do not have any mechanism
to prevent schedulers from bypassing either the priority or disruption budget rules.)
The scheduler does not concern itself with whether the evicted pod(s) can reschedule. The
eviction(s) use(s) the /evict
subresource so that it is subject to the disruption
budget(s) of the victim(s), but it does not request to add the victim pod(s) to the
nodes' PreferAvoidPods
.
Evicting victim(s) and binding the pending pod that the evictions are intended to enable to schedule, are not transactional. We expect the scheduler to issue the operations in sequence, but it is still possible that another scheduler could schedule its pod in between the eviction(s) and the binding, or that the set of pods running on the node in question changed between the time the scheduler made its decision and the time it sent the operations to the API server thereby causing the eviction(s) to be not sufficient to get the pending pod to schedule. In general there are a number of race conditions that cannot be avoided without (1) making the evictions and binding be part of a single transaction, and (2) making the binding preconditioned on a version number that is associated with the node and is incremented on every binding. We may or may not implement those mechanisms in the future.
Given a choice between a node where scheduling a pod requires preemption and one where it does not, all other things being equal, a scheduler should choose the one where preemption is not required. (TBD: Also, if the selected node does require preemption, the scheduler should preempt lower-priority pods before higher-priority pods (e.g. if the scheduler needs to free up 4 GB of RAM, and the node has two 2 GB low-priority pods and one 4 GB high-priority pod, all of which have sufficient disruption budget, it should preempt the two low-priority pods). This is debatable, since all have sufficient disruption budget. But still better to err on the side of giving better disruption SLO to higher-priority pods when possible?)
Preemption victims must be given their termination grace period. One possible sequence of events is
- The API server binds the preemptor to the node (i.e. sets
nodeName
on the preempting pod) and setsdeletionTimestamp
on the victims - Kubelet sees that
deletionTimestamp
has been set on the victims; they enter their graceful termination period - Kubelet sees the preempting pod. It runs the admission checks on the new pod assuming all pods that are in their graceful termination period are gone and that all pods that are in the waiting state (see (4)) are running.
- If (3) fails, then the new pod is rejected. If (3) passes, then Kubelet holds the new pod in a waiting state, and does not run it until the pod passes passes the admission checks using the set of actually running pods.
Note that there are a lot of details to be figured out here; above is just a very hand-wavy sketch of one general approach that might work.
See #22212 for additional discussion.
Node drain will be handled by one or more components not described in this document. They
will respect disruption budgets. Initially, we will just make kubectl drain
respect disruption budgets. See #17393 for other discussion.
All rescheduling other than preemption and node drain will be decided and enacted by a
new component called the rescheduler. It runs continuously in the background, looking
for opportunities to move pods to better locations. It acts when the degree of
improvement meets some threshold and is allowed by the pod's disruption budget. The
action is eviction of a pod using the /evict
subresource, with the pod's signature
enqueued in the node's PreferAvoidPods
. It does not force the pod to reschedule to any
particular node. Thus it is really an "unscheduler"; only in combination with the evicted
pod's scheduler, which schedules the replacement pod, do we get true "rescheduling." See
the "Example use cases" section earlier for some example use cases.
The rescheduler is a best-effort service that makes no guarantees about how quickly (or whether) it will resolve a suboptimal pod placement.
The first version of the rescheduler will not take into consideration where or whether an
evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the
corresponding shard strength disruption budget by one indefinitely. By using the /evict
subresource, the rescheduler ensures that an evicted pod has sufficient budget for the
evicted pod to go and stay pending. We expect future versions of the rescheduler may be
linked with the "mandatory" predicate functions (currently, the ones that constitute the
Kubelet admission criteria), and will only evict if the rescheduler determines that the
pod can reschedule somewhere according to those criteria. (Note that this still does not
guarantee that the pod actually will be able to reschedule, for at least two reasons: (1)
the state of the cluster may change between the time the rescheduler evaluates it and
when the evicted pod's scheduler tries to schedule the replacement pod, and (2) the
evicted pod's scheduler may have additional predicate functions in addition to the
mandatory ones).
(Note: see this comment).
The first version of the rescheduler will only implement two objectives: moving a pod onto an under-utilized node, and moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences than wherever it is currently running. (We assume that nodes that are intentionally under-utilized, e.g. because they are being drained, are marked unschedulable, thus the first objective will not cause the rescheduler to "fight" a system that is draining nodes.) We assume that all schedulers sufficiently weight the priority functions for affinity/anti-affinity and avoiding very packed nodes, otherwise evicted pods may not actually move onto a node that is better according to the criteria that caused it to be evicted. (But note that in all cases it will move to a node that is better according to the totality of its scheduler's priority functions, except in the case where the node where it was already running was the only node where it can run.) As a general rule, the rescheduler should only act when it sees particularly bad situations, since (1) an eviction for a marginal improvement is likely not worth the disruption--just because there is sufficient budget for an eviction doesn't mean an eviction is painless to the application, and (2) rescheduling the pod might not actually mitigate the identified problem if it is minor enough that other scheduling factors dominate the decision of where the replacement pod is scheduled.
We assume schedulers' priority functions are at least vaguely aligned with the rescheduler's policies; otherwise the rescheduler will never accomplish anything useful, given that it relies on the schedulers to actually reschedule the evicted pods. (Even if the rescheduler acted as a scheduler, explicitly rebinding evicted pods, we'd still want this to be true, to prevent the schedulers and rescheduler from "fighting" one another.)
The rescheduler will be configured using ConfigMap; the cluster administrator can enable or disable policies and can tune the rescheduler's aggressiveness (aggressive means it will use a relatively low threshold for triggering an eviction and may consume a lot of disruption budget, while non-aggressive means it will use a relatively high threshold for triggering an eviction and will try to leave plenty of buffer in disruption budgets). The first version of the rescheduler will not be extensible or pluggable, since we want to keep the code simple while we gain experience with the overall concept. In the future, we anticipate a version that will be extensible and pluggable.
We might want some way to force the evicted pod to the front of the scheduler queue, independently of its priority.
See #12140 for additional discussion.
In general, the design space for this topic is huge. This document describes some of the
design considerations and proposes one particular initial implementation. We expect
certain aspects of the design to be "permanent" (e.g. the notion and use of priorities,
preemption, disruption budgets, and the /evict
subresource) while others may change over time
(e.g. the partitioning of functionality between schedulers, controllers, rescheduler,
horizontal pod autoscaler, and cluster autoscaler; the policies the rescheduler implements;
the factors the rescheduler takes into account when making decisions (e.g. knowledge of
schedulers' predicate and priority functions, second-order effects like whether and where
evicted pod will be able to reschedule, etc.); the way the rescheduler enacts its
decisions; and the complexity of the plans the rescheduler attempts to implement).
The highest-priority feature to implement is the rescheduler with the two use cases
highlighted earlier: moving a pod onto an under-utilized node, and moving a pod onto a
node that meets more of the pod's affinity/anti-affinity preferences. The former is
useful to rebalance pods after cluster auto-scale-up, and the latter is useful for
Ubernetes. This requires implementing disruption budgets and the /evict
subresource,
but not priority or preemption.
Because the general topic of rescheduling is very speculative, we have intentionally proposed that the first version of the rescheduler be very simple -- only uses eviction (no attempt to guide replacement pod to any particular node), doesn't know schedulers' predicate or priority functions, doesn't try to move two pods at the same time, and only implements two use cases. As alluded to in the previous subsection, we expect the design and implementation to evolve over time, and we encourage members of the community to experiment with more sophisticated policies and to report their results from using them on real workloads.
TODO.
TODO.
TODO: Add reference to this doc from docs/proposals/rescheduler.md