- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Consideration to race condition
- The pod2's scheduling is successful (pod2 is equal or lower priority than pod1)
- The pod2's scheduling is successful (pod2 is higher priority than pod1)
- The pod2's scheduling is failed and starts the preemption (pod2 is equal or lower priority than pod1)
- The pod2's scheduling is failed and starts the preemption (pod2 is higher priority than pod1)
- Test Plan
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- Consideration to race condition
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This KEP proposes decoupling the API calls for the preemption from the scheduling cycle, to enhance the scheduling throughput of the scheduling failure scenarios.
The scheduler is basically only one in a cluster, and hence scheduling throughput is the crucial metric for the scheduler.
The scheduler schedules Pods one by one within the scheduling cycle, and we basically try to reduce the API calls as much as possible to enhance the scheduling cycle throughput.
The binding cycle is the example for this motivation;
- The scheduling cycle decides where Pod should go to,
- At the end of the scheduling cycle, the scheduler reserves the Node within the scheduler's cache so that next scheduling cycle will take the current pod into consideration.
- The scheduling cycle ends and the binding cycle starts; the binding cycle is run asynchronously, and the scheduler starts the next scheduling cycle.
This flow allows us to decouple the API call to assign Pod to the Node from the scheduling cycle so that the API call doesn't block the scheduling throughput.
But, we have the similar problem with the preemption; the preemption is run at PostFilter extension point which is the part of the scheduling cycle. The preemption has to make some API calls to update Pods' condition and delete Pods after all, which could block the scheduling throughput.
scheduler-perf actually shows currently the preemption scenario takes too long time, compared to others.
- Improve scheduling throughput when pods require issuing preemptions by making API calls asynchronous
- Making the same enhancement for DRA is not a goal of this KEP because it's an under-construction feature yet.
- If DRA maintainers want, technically they can along with this KEP. But, at least in this KEP, we don't discuss how.
The preemption plugin makes API calls for the preemption asynchronously after PostFilter
extension point
so that the scheduler can continue to other Pods' scheduling while making API calls for preemption.
After the preemption goroutine is done, the scheduling for the Pod that triggered the preemption will be retried.
When kube-apiserver is unstable and API calls at the preemption goroutine fails frequently,
the scheduler could make a non-optimal scheduling decision
because the scheduler nominates pods at PostFilter
though, those Pods won't be scheduled on nodes because the preemption API calls fail.
Let's say many mid-priority Pods are making the preemption API calls.
With the scheduler after this proposal, during the preemption goroutine for them are runnning,
the scheduler assumes they'll be scheduled at the Nodes eventually
that the preemptions are targeting via .Status.NominatedNodeName
.
So, other mid-priority or lower priority Pods' scheduling take those preemptor Pods into consideration,
which is correct if the preemption goroutine finishes successful actually, while which results in non-best scheduling results otherwise.
(Higher priority Pods won't be affected; Pods can take place of reserved for lower priority Pods via .Status.NominatedNodeName
)
But, in the first place though, when kube-apiserver is unstable, the scheduler doesn't behave well because it works with a lot of communication with kube-apiserver. Even if the scheduler makes the best scheduling result, the binding API might fail after all.
So, we don't have to pay a special attention to this issue.
To achieve an asynchronous preemption, we will change the preemption plugin's implementation like the following:
- The preemption PostFilter plugin calculates the preemption target and nominate the Pod for the Node. (We'll use
AddNominatedPod
API exposed from the scheduling framework to plugins.) - The preemption PostFilter plugin starts the goroutine to make API calls inside, and return success status (= not wait for the goroutine to finish).
- The preemption plugin blocks the Pod while the preemption routine is in-progress, using PreEnqueue extension point, so that the target Pod won't be retried during this time.
Then, afterwards the preemption goroutine makes actual API calls to delete victime Pods and set Pod.Status.NominatedNodeName
.
If the preemption goroutine fails at some point, it reverts the nomination via AddNominatedPod
with clearNominatedNode
.
If the preemption goroutine is complete, the preemption plugin ungates the Pod; the Pod is queued back to the queue with the Pod/delete event, and (hopefully) scheduled on the nominated node in the next scheduling cycle.
Thanks to the nomination at PostFilter
, this new asynchronous preemption shouldn't make any race condition between several scheduling cycles.
Here, I'll discuss what happens in which scenario, and make sure there's no worry.
Let's say pod1 is during the preemption process (node1) at the preemption goroutine, the next scheduling cycle is scheduling pod2.
As I described above, pod1's PostFilter
nominates pod1 for node1.
At the scheduling cycle, the scheduler takes such nominated pods that are equal or higher priority than pod1 into consideration; meaning, pod2 won't rob pod1 of the place on node1.
Even though pod1 is nominated for the node, the scheduler allows pod2 to take node1, where the pod1's preemption made the space.
Then, when pod1 comes back to the scheduling cycle, it may not be able to land on node1 because pod2 is scheduled there now. It happens with both the current and this KEP's scheduler, so no issue here.
The pod2's scheduling is failed and starts the preemption (pod2 is equal or lower priority than pod1)
The preemption also takes nominated Pods into consideration when calculating the preemption target.
Therefore, if, coincidently, two preemptions for pod1 and pod2 select the same Node after all, then the preemption for pod2 should decide to make the space for pod1 and pod2.
So, we don't have to worry about two preemption targeting the same Node make any issue.
The pod2's preemption ignores pod1's nomination for node1.
If, coincidently, two preemptions for pod1 and pod2 select the same Node after all, then the preemption for pod2 may just select the same preemption targets as pod1, and when pod1 comes back to the scheduling cycle, it (probably) cannot be scheduled on node1 because of pod2.
But, this isn't an issue because the final result is completely the same as the current scheduler; with the current scheduler, pod1 preempts some Pods on node1, then pod2's scheduling starts, pod2 takes node1, and when pod1 comes back to the scheduling cycle, it (probably) cannot be scheduled on node1 because of pod2.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
:2024-09-07
-85.4
/pkg/scheduler/framework/preemption/preemption.go
:2024-09-07
-27.2
Because the coverage for preemption.go is pretty low, we have to improve the testing there before the change for this KEP.
We have to add integration tests to make sure the asynchronous preemption is performed appropriately, especially in the scenarios listed in Consideration to race condition.
We'll add test cases that multiple pods are trigger preemption.
- Feature implemented behind a feature flag
- All tests mentioned in Test Plan are implemented.
- Gather feedback from users and fix reported bugs.
- Change the feature flag to be enabled by default.
- Gather feedback from users and fix reported bugs.
Upgrade
During the alpha period, users have to enable the feature gate SchedulerAsyncPreemption
to opt in this feature.
This is purely in-memory feature for kube-scheduler, so no other special actions are required outside the scheduler.
Downgrade
Users need to disable the feature gate.
This is purely in-memory feature for kube-scheduler, and hence no version skew strategy.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
SchedulerAsyncPreemption
- Components depending on the feature gate: kube-scheduler
- Feature gate name:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
No. The feature is a performance optimization that affects every Pod that needs preemption, but there are no functional changes: the result of the preemption is the same. But, like mentioned in When kube-apiserver is unstable, scheduling results could be different.
Yes. The feature can be disabled in Alpha and Beta versions by restarting kube-scheduler with the feature-gate off.
The scheduler again starts to run PostFilter asynchronously.
Given it's purely in-memory feature and enablement/disablement requires restarting the component (to change the value of feature flag), having feature tests is enough.
The partly failure in the rollout isn't there because the scheduler is only the component to rollout this feature. But, if upgrading the scheduler itself fails somehow, new Pods won't be scheduled anymore. If there's a bug in the preemption because of this enhancement, and also downgrading the scheduler fails somehow, running Pods could be affected, for example, by being deleted by mistake (depending on bugs).
Maybe something goes wrong with the preemption if goroutines_duration_seconds{operation=preemption}
takes too long time.
Also, if preemption_attempts_total
increases too much, then that might also imply some bugs around the preemption.
No. This feature is an in-memory feature of the scheduler, and just upgrading it and upgrade->downgrade->upgrade are both the same.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
This feature is used during all Pods' preemption if the feature gate is enabled.
You can see if the scheduler triggers any preemptions via preemption_attempts_total
metric.
You can find Pods that have triggered the preemption by referring to .Status.NominatedNodeName
,
and Pods that have been preempted by referring to their condition with type: DisruptionTarget
and reason: PreemptionByScheduler
.
- API .status
- Other field: If
.Status.NominatedNodeName
of Pods is non-empty, they have experienced the preemption running asynchronously.
- Other field: If
- The failure rate of the preemption goroutine (
goroutines_execution_total{result=error, operation=preemption}
/goroutines_execution_total{operation=preemption}
) should be < 0.01.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
goroutines_execution_total{result=error, operation=preemption}
- Components exposing the metric: kube-scheduler
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
goroutines_duration_seconds
(w/ label:operation
): to observe how long each preemption goroutine takes to complete.goroutines_execution_total
(w/ labels:operation
,result
): to observe how many preemption goroutines have failed.
No.
No. Just move the existing API calls from PostFilter
into goroutines.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
The scheduler starts to run more goroutines in the preemption plugin, so maybe the CPU usage go up.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
In such cases, API calls for the preemption fails in the preemption goroutines. But, the scheduler cannot perform not only the preemption, but anything essentially because it cannot get objects, bind Pods to Nodes, etc.
Nothing.
- Sep 07, 2024: The initial KEP is submitted.
- Nob 08, 2024: The implementation PR is merged.
- Feb 03, 2025: The PR to promote it to beta is submitted.
To make this kind of scenario easier to implement for other plugins, we can implement a new extension point AsyncPostFilter
.
We calculate the preemption target and nominate the Pod for the Node at PostFilter
, and then AsyncPostFilter
starts asynchronously, in which the preemption plugin makes API calls for the preemption.
The Pod won't be queued back to the queue until AsyncPostFilter
is done.
We don't go with this idea because we can implement the async preemption without introducing a new extension point. Adding a new extension point unnecessarily may result in the regret in the future, and also we can implement it if it's really necessary.