KEP-3521: Pod Scheduling Readiness

Release Signoff Checklist
Summary
Motivation
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP aims to add a .spec.schedulingGates field to Pod's API, to mark a Pod's schedule readiness. Integrators can mutate this field to signal to scheduler when a Pod is ready for scheduling.

Motivation

Pods are currently considered ready for scheduling once created. Kubernetes scheduler does its due diligence to find nodes to place all pending Pods. However, in a real-world case, some Pods may stay in a "miss-essential-resources" state for a long period. These pods actually churn the scheduler (and downstream integrators like Cluster AutoScaler) in an unnecessary manner.

Lacking the knob to flag Pods as scheduling-paused/ready wastes scheduling cycles on retrying Pods that are determined to be unschedulable. As a result, it delays the scheduling of other Pods, and also lowers the overall scheduling throughput. Moreover, it imposes restrictions to vendors to develop some in-house features (such as hierarchical quota) natively.

On the other hand, a condition {type:PodScheduled, reason:Unschedulable} works as canonical info exposed by scheduler to guide downstream integrators (e.g., ClusterAutoscaler) to supplement cluster resources. Our solution should not break this contract.

This proposal describes APIs and mechanics to allow users/controllers to control when a pod is ready to be considered for scheduling.

Goals

Define an API to mark Pods as scheduling-paused/ready.
Design a new Enqueue extension point to customize Pod's queueing behavior.
Not mark scheduling-paused Pods as Unschedulable by updating their PodScheduled condition.
A default enqueue plugin to honor the new API semantics.

Non-Goals

Enforce updating the Pod's conditions to expose more context for scheduling-paused Pods.
Focus on in-house use-cases of the Enqueue extension point.

Future Work

Permission control on individual schedulingGate (via fine-grained permissions)

Proposal

We propose a new field .spec.schedulingGates to the Pod API. The field is defaulted to nil (equivalent to an empty map).

For Pods carrying non-nil .spec.schedulingGates, they will be "parked" in scheduler's internal unschedulablePods pool, and only get tried when the field is mutated to nil.

Practically, this field can be initialized by a single client and(or) multiple mutating webhooks, and afterwards each gate entry can be removed by external integrators when certain criteria is met.

User Stories (Optional)

Story 1

As an orchestrator developer, such as a dynamic quota manager, I have the full picture to know when Pods are scheduling-ready; therefore, I want an API to signal to kube-scheduler when to consider a Pod for scheduling. The pattern for this story would be to use a mutating webhook to force creating pods in a "not-ready to schedule" state that the custom orchestrator changes to ready at a later time based on its own evaluations.

Story 2

This story is an extension of the previous one in that more than one custom orchestrator could be deployed on the same cluster, therefore they want an API that enables establishing an agreement om when a pod is considered ready for scheduling.

Story 3

As an advanced scheduler developer, I want to compose a series of scheduler PreEnqueue plugins to guard the schedulability of my Pods. This enables splitting custom enqueue admission logic into several building blocks, and thus offers the most flexibility. Meanwhile, some plugin needs to visit the in-memory state (like waiting Pods) that is only accessible via scheduler framework.

Story 4

A custom workload orchestrator may wish to modify the Pod prior to consideration for scheduling, without having to fork or alter the workload controller. The orchestrator may wish to make time-varying (post-creation) decisions on Pod scheduling, perhaps to preserve scheduling constraints, avoid disruption, or prevent co-existence.

Notes/Constraints/Caveats (Optional)

Restricted state transition: The schedulingGates field can be initialized only when a Pod is created (either by the client, or mutated during admission). After creation, each schedulingGate can be removed in arbitrary order, but addition of a new scheduling gate is disallowed. To ensure consistency, scheduled Pod must always have empty schedulingGates. This means that a client (an administrator or custom controller) cannot create a Pod that has schedulingGates populated in any way, if that Pod also specified a spec.nodeName. In this case, API Server will return a validation error.

	non-nil schedulingGates⏸️	nil schedulingGates▶️
unscheduled Pod (nil `nodeName`)	✅ create ❌ update	✅ create ✅ update
scheduled Pod (non-nil `nodeName`)	❌ create ❌ update	✅ create ✅ update

New field disabled in Alpha but not scheduler extension: In a high level, this feature contains the following parts:
- a. Pod's spec, feature gate, and other misc bits describing the field schedulingGates
- b. the SchedulingGates scheduler plugin
- c. the underlying scheduler framework, i.e., the new PreEnqueue extension point
If the feature is disabled, part a and b are disabled, but c is enabled anyways. This implies a scheduler plugin developer can leverage part c only to craft their own a' and b' to accomplish their own feature set.
New phase literal in kubectl: To provide better UX, we're going to add a new phase literal SchedulingPaused to the "phase" column of kubectl get pod. This new literal indicates whether it's scheduling-paused or not.

Risks and Mitigations

The scheduler doesn't actively clear a Pod's schedulingGates field. This means if some controller is ill-implemented, some Pods may stay in Pending state incorrectly. If you noticed a Pod stays in Pending state for a long time and it carries non-nil schedulingGates, you may find out which component owns the gate(s) (via .spec.managedFields) and report the symptom to the component owner.
Faulty controllers may forget to remove the Pod's schedulingGates, and hence results in a large number of unschedulable Pods. In Alpha, we don't limit the number of unschedulable Pods caused by potential faulty controllers. We will evaluate necessary options in the future to mitigate potential abuse.
End-users may be confused by no scheduling events in the output of kubectl describe pod xyz. We will provide detailed documentation, along with metrics, tooling (kubectl) and(or) events.

Design Details

API

A new API field SchedulingGates will be added to Pod's spec:

type PodSpec struct {
    // Each scheduling gate represents a particular scenario the scheduling is blocked upon.
    // Scheduling is triggered only when SchedulingGates is empty.
    // In the future, we may impose permission mechanics to restrict which controller can mutate
    // which scheduling gate. It's dependent on a yet-to-be-implemented fined-grained
    // permission (https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4)
    // and needs to be consistent with how finalizes/liens work today.
    SchedulingGates []PodSchedulingGate
}

type PodSchedulingGate struct {
    // Name of the scheduling gate.
    // Each scheduling gate must have a unique name field.
    Name string
}

In the scheduler's ComponentConfig API, we'll add a new type of extension point called Enqueue:

type Plugins struct {
    ......
    Enqueue PluginSet
}

Implementation

Inside scheduler, an internal queue called "activeQ" is designed to store all ready-to-schedule Pods. In this KEP, we're going to wire the aforementioned Enqueue extension to the logic deciding whether or not to add Pods to "activeQ".

Specifically, prior to adding a Pod to "activeQ", scheduler iterates over registered Enqueue plugins. An Enqueue plugin must implement the EnqueuePlugin interface to return a Status to tell scheduler whether this Pod can be admitted or not:

// EnqueuePlugin is an interface that must be implemented by "EnqueuePlugin" plugins.
// These plugins are called prior to adding Pods to activeQ.
// Note: an enqueue plugin is expected to be lightweight and efficient, so it's not expected to
// involve expensive calls like accessing external endpoints; otherwise it'd block other
// Pods' enqueuing in event handlers.
type EnqueuePlugin interface {
    Plugin
    PreEnqueue(ctx context.Context, state *CycleState, p *v1.Pod) *Status
}

A Pod can be moved to activeQ only when all Enqueue plugins returns Success. Otherwise, it's moved and parked in the internal unschedulable Pods pool. The pseudo-code is roughly like this:

// pseudo-code
func RunEnqueuePlugins() *Status {
    for _, pl := range enqueuePlugins {
        if status := pl.PreEnqueue(); !status.IsSuccess {
            // Logic: move Pod to unschedulable pod pool.
            return status
        }
    }
    // Logic: move pod to activeQ.
}

To honor the semantics of the new .schedulingGates API, a default Enqueue plugin will be introduced. It simply returns Success or Unschedulable depending on incoming Pod's schedulingGates field.

This DefaultEnqueue plugin will also implement the EventsToRegister function to claim it's a EnqueueExtension object. So scheduler can move the Pod back to activeQ properly.

func (pl *DefaultEnqueue) EventsToRegister() []framework.ClusterEvent {
	  return []framework.ClusterEvent{
		  {Resource: framework.Pod, ActionType: framework.Update},
	  }
}

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None.

Unit tests

All core changes must be covered by unit tests, in both API and scheduler sides:

API: API validation and strategy tests (pkg/registry/core/pod) to verify disabled fields when the feature gate is on/off.
Scheduler: Core scheduling changes which includes Enqueue config API's validation, defaulting, integration and its implementation.

In particular, update existing UTs or add new UTs in the following packages:

pkg/api/pod: 10/3/2022 - 70.1%
pkg/apis/core/validation: 10/3/2022 - 82.3%
pkg/registry/core/pod: 10/3/2022 - 60.4%
cmd/kube-scheduler/app: 10/3/2022 - 32.9
pkg/scheduler: 10/3/2022 - 75.9%
pkg/scheduler/framework/runtime: 10/3/2022 - 81.9%

Integration tests

The following scenarios need to be covered in integration tests:

Feature gate's enabling/disabling
Configure an Enqueue plugin via MultiPoint and Enqueue extension point
Pod carrying nil .spec.schedulingGates functions as before
Pod carrying non-nil .spec.schedulingGates will be moved to unscheduledPods pool
Disable flushUnschedulablePodsLeftover(), then verify Pod with non-nil .spec.schedulingGates can be moved back to activeQ when .spec.schedulingGates is all cleared
Ensure no significant performance degradation
test/integration/scheduler/queue_test.go#TestSchedulingGates: k8s-triage
test/integration/scheduler/plugins/plugins_test.go#TestPreEnqueuePlugin: k8s-triage
test/integration/scheduler_perf/scheduler_perf_test.go: will add in Beta. (https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling)

e2e tests

An e2e test was created in Alpha with the following sequences:

Create a Pod with non-nil .spec.schedulingGates.
Wait for 15 seconds to ensure (and then verify) it did not get scheduled.
Clear the Pod's .spec.schedulingGates field.
Wait for 5 seconds for the Pod to be scheduled; otherwise error the e2e test.

In beta, it will be enabled by default in the k8s-triage dashboard.

Graduation Criteria

Alpha

Feature disabled by default.
Unit and integration tests completed and passed.
API strategy test (pkg/registry/*) to verify disabled fields when the feature gate is on/off.
Additional tests are in Testgrid and linked in KEP.
Determine whether any additional state is required per gate.

Beta

Feature enabled by default.
Gather feedback from developers and out-of-tree plugins.
Benchmark tests passed, and there is no performance degradation.
Update documents to reflect the changes.
Identify whether gates can be added post-creation.

GA

Fix all reported bugs.
Feature enabled and cannot be disabled. All feature gate guarded logic are removed.
Update documents to reflect the changes.

Upgrade / Downgrade Strategy

Upgrade
- Enable the feature gate in both API Server and Scheduler, and gate the Pod's scheduling readiness by setting non-nil .spec.schedulingGates. Next, remove each scheduling gate when readiness criteria is met.
Downgrade
- Disable the feature gate in both API Server and Scheduler, so that previously configured .spec.schedulingGates value will be ignored.
- However, the .spec.schedulingGates value of a Pod is preserved if it's previously configured; otherwise get silently dropped.

Version Skew Strategy

The skew between kubelet and control-plane components are not impacted.

If the API Server is at vCurrent and the feature is enabled, while scheduler is at vCurrent-n that the feature is not supported, controllers manipulating the new API field won't get their Pods scheduling gated. The Pod scheduling will behave like this feature is not introduced.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: PodSchedulingReadiness
- Components depending on the feature gate: kube-scheduler, kube-apiserver

Does enabling the feature change any default behavior?

No. It's a new API field, so no default behavior will be impacted.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. If disabled, kube-apiserver will start rejecting Pod's mutation on .spec.schedulingGates.

What happens if we reenable the feature if it was previously rolled back?

Mutation on Pod's .spec.schedulingGates will be respected again.

Are there any tests for feature enablement/disablement?

Appropriate tests have been added in the integration tests. See Integration tests for more details.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

It shouldn't impact already running workloads. It's an opt-in feature, and users need to set .spec.schedulingGates field to use this feature.

When this feature is disabled by the feature flag, the already created Pod's .spec.schedulingGates field is preserved, however, the newly created Pod's .spec.schedulingGates field is silently dropped.

What specific metrics should inform a rollback?

A rollback might be considered if the metric scheduler_pending_pods{queue="gated"} stays in a high watermark for a long time since it may indicate that some controllers are not properly handling removing the scheduling gates, which causes the pods to stay in pending state.

Another indicator for rollback is the 90-percentile value of metric scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"} exceeds 100ms steadily.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

It will be tested manually prior to beta launch.

<> Add detailed scenarios and result here, and cc @wojtek-t. <>

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

A new "queue" type gated is added to the existing metric scheduler_pending_pods to distinguish unschedulable Pods:

scheduler_pending_pods{queue="unschedulable"} (existing): scheduler tried but cannot find any Node to host the Pod
scheduler_pending_pods{queue="gated"} (new): scheduler respect the Pod's present schedulingGates and hence not schedule it

The metric scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"} gives a histogram to show the Nth percentile value how SchedulingGates plugin is executed.

Moreover, to explicitly indicate a Pod's scheduling-unready state, a condition {type:PodScheduled, reason:SchedulingGated} is introduced.

How can an operator determine if the feature is in use by workloads?

observe non-zero value for the metric pending_pods{queue="gated"}
observe entries for the metric scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}
observe non-empty value in a Pod's .spec.schedulingGates field

How can someone using this feature know that it is working for their instance?

API .spec
- Other field: schedulingGates
Events
- Event Type: PodScheduled
- Event Status: False
- Event Reason: SchedulingGated
- Event Message: Scheduling is blocked due to non-empty scheduling gates

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: scheduler_pending_pods{queue="gated"}
- Metric name: scheduler_plugin_execution_duration_seconds{plugin="SchedulingGates"}
- Components exposing the metric: kube-scheduler

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A.

Dependencies

N/A.

Does this feature depend on any specific services running in the cluster?

N/A.

Scalability

Will enabling / using this feature result in any new API calls?

The feature itself doesn't generate API calls. But it's expected the API Server would receive additional update/patch requests to mutate the scheduling gates, by external controllers.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No to existing API objects that doesn't use this feature.
For API objects that use this feature:
- API type: Pod
- Estimated increase in size: new field .spec.schedulingGates about ~64 bytes (in the case of 2 scheduling gates)

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

During the downtime of API server and/or etcd:

Running workloads that don't need to remove their scheduling gates function well.
Running workloads that need to update their scheduling gates will stay in scheduling gated state as API requests will be rejected.

What are other known failure modes?

In a highly-available cluster, if there are skewed API Servers, some update requests may get accepted and some may get rejected.

What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Implementation History

2022-09-16: Initial KEP
2022-01-14: Graduate the feature to Beta

Drawbacks

Alternatives

Define a boolean filed .spec.schedulingPaused. Its value is optionally initialized to True to indicate it's not scheduling-ready, and flipped to False (by a controller) afterwards to trigger this Pod's scheduling cycle.

This approach is not chosen because it cannot support multiple independent controllers to control specific aspect a Pod's scheduling readiness.

Files

README.md

Latest commit

History

README.md

File metadata and controls

KEP-3521: Pod Scheduling Readiness

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Future Work

Proposal

User Stories (Optional)

Story 1

Story 2

Story 3

Story 4

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

API

Implementation

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)