-
Notifications
You must be signed in to change notification settings - Fork 4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
246 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,246 @@ | ||
# Cluster Autoscaler parallel drain | ||
|
||
Author: x13n | ||
|
||
## Background | ||
|
||
Scale down of non-empty nodes is slow. We are only draining one node at a time. This is particularly problematic in large clusters. While empty nodes can be removed at the same time, non-empty nodes being removed sequentially, which can take hours in thousand node clusters. We want to speed it up by allowing parallel scale down of multiple non-empty nodes. Ideally, we'd like to drain as many nodes as possible simultaneously, without leaving workloads hanging. | ||
|
||
## High level proposal | ||
|
||
### Algorithm triggering | ||
|
||
With the old algorithm scale down was disabled if deleting of non-empty node from previous iteration did not yet complete. For the new algorithm we plan to relax this. | ||
|
||
### Algorithm operation | ||
|
||
The algorithm will internally keep track of cluster state which will be updated every loop iteration. The state will allow answering following questions: | ||
|
||
|
||
* What is the set of nodes which can be removed from the cluster right now in parallel? There pods from the nodes in the set must be schedulable on other nodes in the cluster. The set will be called **candidate\_set** later in the document. | ||
* For how long is given node in the candidate\_set? | ||
|
||
The candidate\_set will be built in a greedy manner by iterating over all nodes. To verify if a given node can be put in the candidate\_set we will use scheduler simulation to binpack the pods on the nodes starting from the nodes on the other end of the list (details below). Since the simulation can be time-consuming, it will be time bound. This may lead to a slower scale down, but will prevent scale up from starving. | ||
|
||
In each iteration the contents of candidate\_set will be updated. Nodes can be added, removed or stay in candidate\_set. For each node in the candidate\_set we will keep track of how long it is in there. If node is removed and then re-added to candidate set the timer is reset. | ||
|
||
To trigger node deletion will use the already present ScaleDownUnneededTime and scaleDownUnreadyTime parameters. If in given CA loop iteration there are nodes which have been in candidate\_set for more than ScaleDownUnneededTime (or is not ready and is in candidate\_set for more than scaleDownUnreadyTime), the actuation of scaledown is triggered. The number of nodes which are actively scaled down will be limited by MaxDrainParallelism and MaxScaleDownParallelism configuration parameters. We will configure separate limits for scaling down empty and non-empty nodes to allow quick scale-down of empty nodes even if lengthy drains of non-empty nodes are in progress. | ||
|
||
The actuation will be done in a similar fashion as in the old algorithm. For a given set of nodes to be scaled down we will synchronously taint the nodes. Then separate goroutines to perform draining and node deletions will be run. | ||
|
||
Current scale-down algorithm uses SoftTainting to limit the chance that new pods will be moved toward scaled down candidates. New algorithm will use the same mechanism for nodes in the candidate\_set. | ||
|
||
The state in the scale-down algorithm will be updated incrementally based on the changes to the set of nodes and pods in the cluster snapshot. | ||
|
||
## Detailed design | ||
|
||
### Existing code refactoring | ||
|
||
The existing scale down code lives mostly in the ScaleDown object, spanning scale\_down.go file with 1.5k lines of code. As a part of this effort, ScaleDown object will undergo refactoring, extracting utils to a separate file. ScaleDown itself will become an interface with two implementations (both relying on common utils): existing version and the new algorithm described below. This will allow easy switching between algorithms with a flag: different flag values will pick different ScaleDown interface implementations. | ||
|
||
As a part of the refactoring, we will combine `FastGetPodsToMove` and `DetailedGetPodsForMove` into a single `PodsToMove` function. The `checkReferences` flag will be dropped and we will always do a detailed check. We are now using listers so doing a detailed check does not add extra API calls and should not add too much to execution time. | ||
|
||
### Algorithm State | ||
|
||
The algorithm is stateful. On each call to UpdateNodeInfos (happening every CA loop iteration) the state will be updated to match the most recent cluster state snapshot. The most important fields to be held in algorithm state are: | ||
|
||
* deleted\_set: set of names of nodes which are being deleted (implementation-wise we will probably use NodeDeletionTracker) | ||
* candidate\_set: set of names for nodes which are can be deleted | ||
* non\_candidate\_set: set of names of nodes for which we tried to simulate drain and it failed. For each node we keep time when the simulation was done. | ||
* pod\_destination\_hints: | ||
* stores destination nodes computed during draining simulation. For each pod from nodes in candidate\_set and deleted\_set it keeps the destination node assigned during simulation. | ||
* map: source node names -> pod UID -> destination node names | ||
* previous\_node\_infos | ||
* List of NodeInfo computed at the beginning of previous algorithm iteration. | ||
* pdbs\_remaining\_disruptions - bookkeeping for graceful handling of PDBs | ||
([details](#pdbs-checking)). | ||
* recent\_evictions: set of pods that were recently successfully evicted from a node | ||
|
||
### UpdateUnneededNodes | ||
|
||
UpdateUnneededNodes will be called from RunOnce every CA loop as it is now. On every call the internal state is updated to match the recent snapshot of the cluster state. | ||
|
||
Single loop iteration of algorithm: | ||
|
||
* Build node\_infos list | ||
* Build new algorithm state (steps below) | ||
|
||
#### State update | ||
|
||
* Verify the pods from nodes currently being scaled down can still be rescheduled | ||
* Iterate over all the nodes in deleted\_set | ||
* Validate that we can still reschedule remaining pods on other nodes in the cluster | ||
* Use pod\_destination\_hints and fallback to linear searching | ||
* The general algorithm here follows the same rules as "Scaledown simulation loop" described in more details below. | ||
* Implementation-wise verification for nodes being scaled down and looking for new candidates will be shared code. | ||
* Important differences: | ||
* We will update the [pdbs\_remaining\_disruptions](#pdbs-checking) structure but skip PDB checking for simulation for nodes being scaled down | ||
* We are not updating candidate\_set | ||
* Verify [recently evicted pods can be scheduled](#race-conditions-with-other-contollers) | ||
* Iterate over all the pods in recent_evictions | ||
* If a pod was added to the list N or more seconds ago, remove it from recent_evictions | ||
* If a pod has a known owner reference, check if parent object has created all the replicas | ||
* If there was no known owner reference or the parent object doesn't have all the replicas, try to schedule the pod | ||
* Verification fails if any pod failed to be scheduled | ||
* If either of the above verifications failed, break the algorithm here; candidate_set is empty | ||
* Use ScaleDownNodeProcessor.GetScaleDownCandidates to get a list of scale down eligible nodes: ones that may end up in candidate\_set. | ||
* Clone the cluster snapshot to be used for the simulation, so simulation results don't leak through it. | ||
* Set target\_node\_pointer to the last node in node\_infos | ||
* Scaledown simulation loop: | ||
* Fetch next node eligible[^1] for scale down | ||
* If we run out of time for simulation break the loop; do not continue with next node | ||
* Simulate the node scaledown | ||
* Fork() cluster snapshot. | ||
* List the pods that need to be drained from the node. Logic already implemented in GetPodsForDeletionOnNodeDrain. | ||
* If the function returns an error add node to non\_candidate\_set and continue with the next node. | ||
* For each pod on the node run a predicate checker to test if it fits one of the nodes in the cluster. Pods should be sorted in some way so we limit the chance of making different decisions each loop iteration. | ||
* First try to use pod\_destination\_hints: | ||
* Do predicate checking on the node pointed by pod\_destination\_hints. | ||
* If scheduling is possible simulate it (sourceNodeInfo.RemovePod(), targetNodeInfo.AddPod(), update new pod\_destination\_hints, update pdb\_counters) | ||
* If scheduling to node pointed by pod\_destination\_hints is not possible try to find other node: | ||
* Keep track of the [target\_node\_pointer](#motivation-for-using-target\_node\_pointer-in-draining-simulation) through all iterations of the algorithm. It will point to a node which is currently considered as a destination for simulated pod drains. | ||
* If currently considered pod cannot be scheduled to node pointed by target\_node\_pointer, move the pointer to the left and try again | ||
* Skip nodes which are part of currently built candidate\_set or deleted\_set | ||
* Skip nodes which are not in GetPodDestinationCandidates() | ||
* Wrap around if first node was reached | ||
* If currently considered pod can be scheduled to node pointed by target\_node\_pointer simulate scheduling (sourceNodeInfo.RemovePod(), targetNodeInfo.AddPod(), update pod\_destinations\_hints, pdb\_counters) and restart loop for the next pod | ||
* If the currently considered pod cannot be scheduled to any node, Revert() the cluster snapshot and start simulation for the next node. Also revert the target\_node\_pointer to point in time when the current node started being worked on. Add the current node to the non\_candidate\_set. | ||
* If all the pods from the considered node find new homes, mark the source node as candidate for scaledown (add to new candidate\_set, remove from non\_candidate\_set) and Commit() the cluster snapshot. | ||
* update previous\_node\_infos and candidate\_set in algorithm state | ||
|
||
|
||
#### Caveats | ||
|
||
* The number of iterations will be time bound yet we may require that at least a fixed number of nodes is evaluated each run.. | ||
|
||
### GetUnneededNodes() | ||
|
||
Returns node list built based on candidate\_set and previous\_node\_infos. | ||
|
||
### TryToScaleDown | ||
|
||
The responsibility of TryToScaleDown is to trigger actual scaledown of nodes in candidate\_set (or a subset of those). The method will keep track of empty and non-empty nodes being scaled down (separately). The number of nodes being scaled down will be bounded by MaxDrainParallelism and MaxScaleDownParallelism options passed in as CA Flags. | ||
|
||
Method steps in pseudocode: | ||
|
||
* Delete N empty nodes, up to MaxScaleDownParallelism. | ||
* Delete min(MaxScaleDownParallelism - N, MaxDrainParallelism) non-empty nodes. | ||
* synchronously taint the nodes to be scaled down as we currently do | ||
* schedule draining and node deletion as a separate go routine for each node | ||
* move nodes from candidate\_set to deleted\_set | ||
|
||
### Motivation for using target\_node\_pointer in draining simulation | ||
|
||
At first sight it may seem more natural to do the draining simulation in a way | ||
when we try to find the destination node starting from the tail of the sorted | ||
nodes list for each pod. That approach would probably result in tighter packing | ||
and a larger candidate set. Yet it comes at huge scalability cost: | ||
|
||
* Much more simulation steps would need to be made as the nodes in the end of | ||
the list are tighter packed already. Therefore there is a high chance that | ||
we would need to traverse a large part of the list to find a destination | ||
candidate. The situation would get worse as the algorithm progresses and | ||
more rescheduling simulations are made. | ||
* With that approach more nodes would be used as destinations for pods coming | ||
from a single source node. As a result it would be less probable that we | ||
could use the pod\_destination and pod\_sources maps effectively in the | ||
"quick path" | ||
|
||
Using target\_node\_pointer approach mitigates above issues at a cost of | ||
(hopefully) not much worse simulation outcome. | ||
|
||
Implementation-wise, we'd like to reuse existing SchedulerBasedPredicateChecker. | ||
In order to do this, the algorithm for picking nodes for scheduling will be | ||
modified: it will be passed to SchedulerBasedPredicateChecker as a strategy. By | ||
default, it will be the existing round robin across all nodes, but ScaleDown | ||
simulation will inject a more sophisticated one, relying on the | ||
target\_node\_pointer. | ||
|
||
In the initial version, vanilla SchedulerBasedPredicateChecker will be used. | ||
Implementing support for target\_node\_pointer is essentially an optimization | ||
on top of it. | ||
|
||
### PDBs checking | ||
|
||
Throughout the loop we will keep the quotas remaining for PDBs in | ||
pdbs\_remaining\_disruptions structure. The structure will be computed at the | ||
beginning of UpdateUnneededNodes and for each PDB it will hold how many Pods | ||
matching this PDB can still be disrupted. Then we decrease the counters as we go | ||
over the nodes and simulate pods rescheduling. The initial computation and drain | ||
simulation takes into account the state of the pod. Specifically if Pod is not | ||
Healthy it will be subtracted from the remaining quota on the initial | ||
computation and then not subtracted again on drain simulation. | ||
|
||
|
||
### PreFilteringScaleDownNodeProcessor changes | ||
|
||
We need to repeat checks from PreFilteringScaleDownNodeProcessor in | ||
UpdateUnneededNodes anyway as the latter simulate multi node deletion so it | ||
seems we can drop PreFilteringScaleDownNodeProcessor if new scale down algorithm | ||
is enabled. Not crucial as checks are not costly. | ||
|
||
|
||
### Changes to ScaleDownStatus | ||
|
||
ScaleDownStatus does not play very well with the concept of multiple nodes being | ||
scheduled down. The existing ScaleDownInProgress value in ScaleDownResult enum | ||
represents a state in which CA decides to skip scale down logic due to ongoing | ||
deletion of a single non-empty node. In the new algorithm, this status will only | ||
be emitted when max parallelism for both empty and non-empty nodes was reached | ||
and hence no new deletion can happen. | ||
|
||
### Changes to clusterstate | ||
|
||
Cluster state holds a map listing unneeded candidates for each node group. We | ||
will keep the map. The semantic of unneeded nodes will change though. With the | ||
old algorithm there was no guarantee that all the "unneeded" nodes can be | ||
removed altogether - the drain simulation is done independently for each | ||
candidate node. The parallel scaledown algorithm validates that all unneeded | ||
nodes can be dropped together (modulo difference in behavior of scheduler | ||
simulation and actual scheduler). | ||
|
||
### Race conditions with other controllers | ||
|
||
Whenever Cluster Autoscaler evicts a pod, it is expected that some external | ||
controller will create a similar pod elsewhere. However, it takes a non-zero | ||
time for controllers to react and hence we will keep track of all evicted pods | ||
on a dedicated list (recent\_evictions). After each eviction, the pod object | ||
along with the eviction timestamp will be added to the list and kept for a | ||
preconfigured amount of time. The pods from that list will be injected back into | ||
the cluster before scale down simulation as a safety buffer, except when CA is | ||
certain the replacement pods were either already scheduled or don't need | ||
replacing. This can be verified by examining the parent object for the pods | ||
(e.g. ReplicaSet). | ||
|
||
In the initial implementation, for the sake of simplicity, parallel drain will | ||
not be triggered as long as there are already any nodes in the deleted\_set. | ||
|
||
## Monitoring | ||
|
||
The existing set of metrics will suffice for the sake of monitoring performance | ||
of parallel scale down (i.e. scaled\_down\_nodes\_total, | ||
scaled\_down\_gpu\_nodes\_total, function\_duration\_seconds). We may extend the | ||
set of function metric label values for function\_duration\_seconds to get a | ||
better visibility into empty vs. non-empty scale down duration. | ||
|
||
## Rollout | ||
|
||
Flag controlled: when --max-parallel-drain > 0, the new logic is enabled. Old | ||
logic will stay for a while to make rollback possible. The flag will be | ||
introduced in version 1.X (probably 1.24), default it to non-0 in 1.(X+1) and | ||
eventually drop the old logic in 1.(X+2) or later. | ||
|
||
## Notes | ||
|
||
[^1]: | ||
|
||
The reasons for node to not be eligible for scale down include: | ||
|
||
* node is currently being scaled down | ||
* node is not in scale down candidates | ||
* removing the node would move node pool size below min | ||
* removing the node would move cluster resources below min | ||
* node has no-scaledown annotation | ||
* node utilization is too high | ||
* node is already marked as destination | ||
|