- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This KEP aims to add a .spec.schedulingGates
field to Pod's API, to mark a Pod's schedule readiness.
Integrators can mutate this field to signal to scheduler when a Pod is ready for scheduling.
Pods are currently considered ready for scheduling once created. Kubernetes scheduler does its due diligence to find nodes to place all pending Pods. However, in a real-world case, some Pods may stay in a "miss-essential-resources" state for a long period. These pods actually churn the scheduler (and downstream integrators like Cluster AutoScaler) in an unnecessary manner.
Lacking the knob to flag Pods as scheduling-paused/ready wastes scheduling cycles on retrying Pods that are determined to be unschedulable. As a result, it delays the scheduling of other Pods, and also lowers the overall scheduling throughput. Moreover, it imposes restrictions to vendors to develop some in-house features (such as hierarchical quota) natively.
On the other hand, a condition {type:PodScheduled, reason:Unschedulable}
works as canonical info
exposed by scheduler to guide downstream integrators (e.g., ClusterAutoscaler) to supplement cluster
resources. Our solution should not break this contract.
This proposal describes APIs and mechanics to allow users/controllers to control when a pod is ready to be considered for scheduling.
- Define an API to mark Pods as scheduling-paused/ready.
- Design a new Enqueue extension point to customize Pod's queueing behavior.
- Not mark scheduling-paused Pods as
Unschedulable
by updating theirPodScheduled
condition. - A default enqueue plugin to honor the new API semantics.
- Enforce updating the Pod's conditions to expose more context for scheduling-paused Pods.
- Focus on in-house use-cases of the Enqueue extension point.
- Permission control on individual schedulingGate (via fine-grained permissions)
We propose a new field .spec.schedulingGates
to the Pod API. The field is defaulted to nil
(equivalent to an empty map).
For Pods carrying non-nil .spec.schedulingGates
, they will be "parked" in scheduler's internal
unschedulablePods pool, and only get tried when the field is mutated to nil.
Practically, this field can be initialized by a single client and(or) multiple mutating webhooks, and afterwards each gate entry can be removed by external integrators when certain criteria is met.
As an orchestrator developer, such as a dynamic quota manager, I have the full picture to know when Pods are scheduling-ready; therefore, I want an API to signal to kube-scheduler when to consider a Pod for scheduling. The pattern for this story would be to use a mutating webhook to force creating pods in a "not-ready to schedule" state that the custom orchestrator changes to ready at a later time based on its own evaluations.
This story is an extension of the previous one in that more than one custom orchestrator could be deployed on the same cluster, therefore they want an API that enables establishing an agreement om when a pod is considered ready for scheduling.
As an advanced scheduler developer, I want to compose a series of scheduler PreEnqueue plugins to guard the schedulability of my Pods. This enables splitting custom enqueue admission logic into several building blocks, and thus offers the most flexibility. Meanwhile, some plugin needs to visit the in-memory state (like waiting Pods) that is only accessible via scheduler framework.
A custom workload orchestrator may wish to modify the Pod prior to consideration for scheduling, without having to fork or alter the workload controller. The orchestrator may wish to make time-varying (post-creation) decisions on Pod scheduling, perhaps to preserve scheduling constraints, avoid disruption, or prevent co-existence.
-
Restricted state transition: The
schedulingGates
field can be initialized only when a Pod is created (either by the client, or mutated during admission). After creation, eachschedulingGate
can be removed in arbitrary order, but addition of a new scheduling gate is disallowed. To ensure consistency, scheduled Pod must always have emptyschedulingGates
. This means that a client (an administrator or custom controller) cannot create a Pod that hasschedulingGates
populated in any way, if that Pod also specified aspec.nodeName
. In this case, API Server will return a validation error.non-nil schedulingGates⏸️ nil schedulingGates ▶️ unscheduled Pod
(nil nodeName)✅ create
❌ update✅ create
✅ updatescheduled Pod
(non-nil nodeName)❌ create
❌ update✅ create
✅ update -
New field disabled in Alpha but not scheduler extension: In a high level, this feature contains the following parts:
- a. Pod's spec, feature gate, and other misc bits describing the field
schedulingGates
- b. the
SchedulingGates
scheduler plugin - c. the underlying scheduler framework, i.e., the new PreEnqueue extension point
If the feature is disabled, part
a
andb
are disabled, butc
is enabled anyways. This implies a scheduler plugin developer can leverage partc
only to craft their owna'
andb'
to accomplish their own feature set. - a. Pod's spec, feature gate, and other misc bits describing the field
-
New phase literal in kubectl: To provide better UX, we're going to add a new phase literal
SchedulingPaused
to the "phase" column ofkubectl get pod
. This new literal indicates whether it's scheduling-paused or not.
-
The scheduler doesn't actively clear a Pod's
schedulingGates
field. This means if some controller is ill-implemented, some Pods may stay in Pending state incorrectly. If you noticed a Pod stays in Pending state for a long time and it carries non-nilschedulingGates
, you may find out which component owns the gate(s) (via.spec.managedFields
) and report the symptom to the component owner. -
Faulty controllers may forget to remove the Pod's
schedulingGates
, and hence results in a large number of unschedulable Pods. In Alpha, we don't limit the number of unschedulable Pods caused by potential faulty controllers. We will evaluate necessary options in the future to mitigate potential abuse. -
End-users may be confused by no scheduling events in the output of
kubectl describe pod xyz
. We will provide detailed documentation, along with metrics, tooling (kubectl) and(or) events.
A new API field SchedulingGates
will be added to Pod's spec:
type PodSpec struct {
// Each scheduling gate represents a particular scenario the scheduling is blocked upon.
// Scheduling is triggered only when SchedulingGates is empty.
// In the future, we may impose permission mechanics to restrict which controller can mutate
// which scheduling gate. It's dependent on a yet-to-be-implemented fined-grained
// permission (https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4)
// and needs to be consistent with how finalizes/liens work today.
SchedulingGates []PodSchedulingGate
}
type PodSchedulingGate struct {
// Name of the scheduling gate.
// Each scheduling gate must have a unique name field.
Name string
}
In the scheduler's ComponentConfig API, we'll add a new type of extension point called Enqueue
:
type Plugins struct {
......
Enqueue PluginSet
}
Inside scheduler, an internal queue called "activeQ" is designed to store all ready-to-schedule Pods.
In this KEP, we're going to wire the aforementioned Enqueue
extension to the logic deciding whether
or not to add Pods to "activeQ".
Specifically, prior to adding a Pod to "activeQ", scheduler iterates over registered Enqueue plugins.
An Enqueue plugin must implement the EnqueuePlugin
interface to return a Status
to tell scheduler
whether this Pod can be admitted or not:
// EnqueuePlugin is an interface that must be implemented by "EnqueuePlugin" plugins.
// These plugins are called prior to adding Pods to activeQ.
// Note: an enqueue plugin is expected to be lightweight and efficient, so it's not expected to
// involve expensive calls like accessing external endpoints; otherwise it'd block other
// Pods' enqueuing in event handlers.
type EnqueuePlugin interface {
Plugin
PreEnqueue(ctx context.Context, state *CycleState, p *v1.Pod) *Status
}
A Pod can be moved to activeQ only when all Enqueue plugins returns Success
. Otherwise, it's moved
and parked in the internal unschedulable Pods pool. The pseudo-code is roughly like this:
// pseudo-code
func RunEnqueuePlugins() *Status {
for _, pl := range enqueuePlugins {
if status := pl.PreEnqueue(); !status.IsSuccess {
// Logic: move Pod to unschedulable pod pool.
return status
}
}
// Logic: move pod to activeQ.
}
To honor the semantics of the new .schedulingGates
API, a default Enqueue plugin will be
introduced. It simply returns Success
or Unschedulable
depending on incoming Pod's schedulingGates
field.
This DefaultEnqueue
plugin will also implement the EventsToRegister
function to claim it's a
EnqueueExtension
object. So scheduler can move the Pod back to activeQ properly.
func (pl *DefaultEnqueue) EventsToRegister() []framework.ClusterEvent {
return []framework.ClusterEvent{
{Resource: framework.Pod, ActionType: framework.Update},
}
}
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
None.
All core changes must be covered by unit tests, in both API and scheduler sides:
-
API: API validation and strategy tests (
pkg/registry/core/pod
) to verify disabled fields when the feature gate is on/off. -
Scheduler: Core scheduling changes which includes Enqueue config API's validation, defaulting, integration and its implementation.
In particular, update existing UTs or add new UTs in the following packages:
pkg/api/pod
:10/3/2022
-70.1%
pkg/apis/core/validation
:10/3/2022
-82.3%
pkg/registry/core/pod
:10/3/2022
-60.4%
cmd/kube-scheduler/app
:10/3/2022
-32.9
pkg/scheduler
:10/3/2022
-75.9%
pkg/scheduler/framework/runtime
:10/3/2022
-81.9%
The following scenarios need to be covered in integration tests:
-
Feature gate's enabling/disabling
-
Configure an Enqueue plugin via MultiPoint and Enqueue extension point
-
Pod carrying nil
.spec.schedulingGates
functions as before -
Pod carrying non-nil
.spec.schedulingGates
will be moved to unscheduledPods pool -
Disable
flushUnschedulablePodsLeftover()
, then verify Pod with non-nil.spec.schedulingGates
can be moved back to activeQ when.spec.schedulingGates
is all cleared -
Ensure no significant performance degradation
-
test/integration/scheduler/queue_test.go#TestSchedulingGates
: k8s-triage -
test/integration/scheduler/plugins/plugins_test.go#TestPreEnqueuePlugin
: k8s-triage -
test/integration/scheduler_perf/scheduler_perf_test.go
: will add in Beta. (https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling)
An e2e test was created in Alpha with the following sequences:
- Create a Pod with non-nil
.spec.schedulingGates
. - Wait for 15 seconds to ensure (and then verify) it did not get scheduled.
- Clear the Pod's
.spec.schedulingGates
field. - Wait for 5 seconds for the Pod to be scheduled; otherwise error the e2e test.
In beta, it will be enabled by default in the k8s-triage dashboard.
- Feature disabled by default.
- Unit and integration tests completed and passed.
- API strategy test (
pkg/registry/*
) to verify disabled fields when the feature gate is on/off. - Additional tests are in Testgrid and linked in KEP.
- Determine whether any additional state is required per gate.
- Feature enabled by default.
- Gather feedback from developers and out-of-tree plugins.
- Benchmark tests passed, and there is no performance degradation.
- Update documents to reflect the changes.
- Identify whether gates can be added post-creation.
- Fix all reported bugs.
- Feature enabled and cannot be disabled. All feature gate guarded logic are removed.
- Update documents to reflect the changes.
- Upgrade
- Enable the feature gate in both API Server and Scheduler, and gate the Pod's scheduling
readiness by setting non-nil
.spec.schedulingGates
. Next, remove each scheduling gate when readiness criteria is met.
- Enable the feature gate in both API Server and Scheduler, and gate the Pod's scheduling
readiness by setting non-nil
- Downgrade
- Disable the feature gate in both API Server and Scheduler, so that previously configured
.spec.schedulingGates
value will be ignored. - However, the
.spec.schedulingGates
value of a Pod is preserved if it's previously configured; otherwise get silently dropped.
- Disable the feature gate in both API Server and Scheduler, so that previously configured
The skew between kubelet and control-plane components are not impacted.
If the API Server is at vCurrent and the feature is enabled, while scheduler is at vCurrent-n that the feature is not supported, controllers manipulating the new API field won't get their Pods scheduling gated. The Pod scheduling will behave like this feature is not introduced.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: PodSchedulingReadiness
- Components depending on the feature gate: kube-scheduler, kube-apiserver
No. It's a new API field, so no default behavior will be impacted.
Yes. If disabled, kube-apiserver will start rejecting Pod's mutation on .spec.schedulingGates
.
Mutation on Pod's .spec.schedulingGates
will be respected again.
Appropriate tests have been added in the integration tests. See Integration tests for more details.
It shouldn't impact already running workloads. It's an opt-in feature, and users need to set
.spec.schedulingGates
field to use this feature.
When this feature is disabled by the feature flag, the already created Pod's .spec.schedulingGates
field is preserved, however, the newly created Pod's .spec.schedulingGates
field is silently dropped.
A rollback might be considered if the metric scheduler_pending_pods{queue="gated"}
stays in a
high watermark for a long time since it may indicate that some controllers are not properly handling
removing the scheduling gates, which causes the pods to stay in pending state.
Another indicator for rollback is the 90-percentile value of metric scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}
exceeds 100ms steadily.
It will be tested manually prior to beta launch.
<> Add detailed scenarios and result here, and cc @wojtek-t. <>
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
A new "queue" type gated
is added to the existing metric scheduler_pending_pods
to distinguish
unschedulable Pods:
scheduler_pending_pods{queue="unschedulable"}
(existing): scheduler tried but cannot find any Node to host the Podscheduler_pending_pods{queue="gated"}
(new): scheduler respect the Pod's presentschedulingGates
and hence not schedule it
The metric scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}
gives a histogram
to show the Nth percentile value how SchedulingGates plugin is executed.
Moreover, to explicitly indicate a Pod's scheduling-unready state, a condition
{type:PodScheduled, reason:SchedulingGated}
is introduced.
- observe non-zero value for the metric
pending_pods{queue="gated"}
- observe entries for the metric
scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}
- observe non-empty value in a Pod's
.spec.schedulingGates
field
- API .spec
- Other field:
schedulingGates
- Other field:
- Events
- Event Type: PodScheduled
- Event Status: False
- Event Reason: SchedulingGated
- Event Message: Scheduling is blocked due to non-empty scheduling gates
N/A.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: scheduler_pending_pods{queue="gated"}
- Metric name: scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}
- Components exposing the metric: kube-scheduler
Are there any missing metrics that would be useful to have to improve observability of this feature?
N/A.
N/A.
N/A.
The feature itself doesn't generate API calls. But it's expected the API Server would receive additional update/patch requests to mutate the scheduling gates, by external controllers.
No.
No.
- No to existing API objects that doesn't use this feature.
- For API objects that use this feature:
- API type: Pod
- Estimated increase in size: new field
.spec.schedulingGates
about ~64 bytes (in the case of 2 scheduling gates)
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
During the downtime of API server and/or etcd:
- Running workloads that don't need to remove their scheduling gates function well.
- Running workloads that need to update their scheduling gates will stay in scheduling gated state as API requests will be rejected.
In a highly-available cluster, if there are skewed API Servers, some update requests may get accepted and some may get rejected.
N/A.
- 2022-09-16: Initial KEP
- 2022-01-14: Graduate the feature to Beta
Define a boolean filed .spec.schedulingPaused
. Its value is optionally initialized to True
to
indicate it's not scheduling-ready, and flipped to False
(by a controller) afterwards to trigger
this Pod's scheduling cycle.
This approach is not chosen because it cannot support multiple independent controllers to control specific aspect a Pod's scheduling readiness.