KEP-3838: Pod Mutable Scheduling Directives

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Alternatives
- Use webhooks to inject affinities
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

In #3521 we introduced PodSchedulingReadiness. The main use case for PodSchedulingReadiness was to enable building external resource controllers (like extended schedulers, dynamic quota managers) to make a decision on when the pod should be eligible for scheduling by kube-scheduler.

This enhancement proposes to make the pod scheduling directives node selector and node affinity mutable as long as the pod is gated and the update to node selector/affinity further limits the existing constraints.

This allows external resource controllers to not only decide when the pod should be eligible for scheduling, but also where it should land. This is similar to what we did for suspended Job in #2926.

Motivation

Adding the ability to mutate a Pod's scheduling directives before it is allowed to schedule gives an external resource controller (like a network toplogy-aware scheduler or a dynamic quota manager) the ability to influence pod placement while at the same time offload actual pod-to-node assignment to kube-scheduler.

In general, this opens the door for a new pattern of adding scheduling features to Kubernetes. Specifically, building lightweight schedulers that implement features not supported by kube-scheduler, while relying on the existing kube-scheduler to support all upstream features and handle the pod-to-node binding. This pattern should be the preferred one if the custom feature doesn't require implementing a scheduler plugin, which entails re-building and maintaining a custom kube-scheduler binary.

Goals

Allow mutating a pod that is blocked on a scheduling readiness gate with a more constrained node affinity/selector.

Non-Goals

Allow mutating node affinity/selector of pods unconditionally.
Restrict adding tolerations on scheduling gates. We already allow adding tolerations unconditionally, and so restricting that to gated pods is not a backward compatible change.
Allow mutating pod (anti)affinity constraints. It is not clear how this can be done without potentially violating a policy check that was previously done at pod CREATE.

Proposal

Node affinity and selector are currently immutable for Pods. The proposal is to relax this validation for Pods that are blocked on a scheduling readiness gate (i.e., not yet considered for scheduling by kube-scheduler). Note that adding tolerations and mutating labels and annotations are already allowed.

This has no impact on applications controllers (like the ReplicaSet controller). Application controllers generally has no dependency on the scheduling directives expressed in the application's pod template, and they don't reconcile existing pods if they changed.

User Stories (Optional)

Story 1

I want to build a controller that implements workload queueing and influences when and where a workload should run. The workload controller creates the pods of a workload instance. To control when and where the workload should run, I have a webhook that injects a scheduling readiness gate, and a controller that removes the gate when capacity becomes available.

At the time of creating the workload, it may not be known on which subset of the nodes (e.g., a zone or a VM type) a workload should run. To control where a workload should run, the controller updates the node selector/affinity of the workload's pods; by doing so, the queue controller is able to influence the scheduler and autoscaler decisions.

Notes/Constraints/Caveats (Optional)

Note that constraining the node affinity/selector is a one way operation; meaning once done, it can't be reserved and that a subsequent update can only further constraint the selection.

Risks and Mitigations

New calls from a queue controller to the API server to update node affinity/selector. The mitigation is for such controllers to make a single API call for both updating affinity/selector and removing the scheduling gate.
App controllers may find it surprising that the pod they created is different from their template. Typically apps don't continusly reconile existing pods to match the template, they react only to change to the pod template in the app's spec. Moreover, the side affects of this proposal is no different from a mutating webhook that injects affinity onto the pod.

Design Details

The pod update validation logic in the API server already allows updating labels and annotations, it also allows adding new tolerations. What we need to do is to relax the update validation on node affinity/selector. + There are two main problems we need to address:

Race conditions between updating node affinity/selector constraints and the kube-scheduler when scheduling the pod. As mentioned before, the solution is to restrict the updates to pods with scheduling gates (it is still valid to send a single update that removes the last gate and update node affinity/selector).
Invalidating a decision made by a policy admission webhook. The concern here is that some of those policy webhooks validate node affinity/selector on CREATE only and don't trigger that validation on UPDATE because they assumed the fields are immutable. This can be addressed by allowing updates to node selector/affinity that only further constrain the selection.

For spec.nodeSelector, only additions are allowed. If absent, it will be allowed to be set.

For spec.affinity.nodeAffinity, if nil, then setting with anything, is allowed.

For .requiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms, the terms are ORed while nodeSelectorTerms[].matchExpressions list and nodeSelectorTerms[].fieldExpressions are ANDed. If NodeSelectorTerms was empty, it will be allowed to be set. If not empty, then only additions of NodeSelectorRequirements to matchExpressions or fieldExpressions are allowed, and no changes to existing matchExpressions and fieldExpressions will be allowed.

For .preferredDuringSchedulingIgnoredDuringExecution, all updates are allowed. This is because preferred terms are not authoritative, and so policy controllers don't validate those terms.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/apis/core/validation: 02/03/2023 - 83%

Integration tests

Test that adding node affinity to a pod with no scheduling gates should be rejected.
Test that adding node affinity/selector that relaxes the existing constraints should always be rejected, even for pods with scheduling gates.
Test that adding node affinity/selector that constraints the node selection to a pod with scheduling gates should be allowed. Also, test that the scheduler observes the update and acts on it when the scheduling gate is lifted.

e2e tests

Integration tests offer enough coverage.

Graduation Criteria

We will release the feature directly in Beta state. Because the feature is opt-in and doesn't add a new field, there is no benefit in having an alpha release.

Beta

Feature implemented behind the same feature flag we used for Pod Scheduling Readiness feature.
Unit and integration tests passing

GA

Fix any potentially reported bugs

Upgrade / Downgrade Strategy

No changes required to existing cluster to use this feature.

Version Skew Strategy

N/A. This feature doesn't impact nodes.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: PodSchedulingReadiness; note that this is a shared flag with the PodSchedulingReadiness feature.
- Components depending on the feature gate: kube-apiserver, kube-scheduler
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

Yes, it relaxes validation of updates to pods. Specifically, it will allow mutating the node selector and affinity.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. If disabled, kube-apiserver will start rejecting mutating node affinity/selector of pods.

What happens if we reenable the feature if it was previously rolled back?

kube-apiserver will accept node affinity/selector updates for pods.

Are there any tests for feature enablement/disablement?

N/A. No explicit enablement/disablement test as this is purely in-memory feature.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The change is opt-in, it doesn't impact already running workloads. But problems with the updated validation logic may cause crashes in the apiserver.

What specific metrics should inform a rollback?

Crashes in the apiserver because of potential problems with the updated validation logic.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No, will be tested manually when implementing the validation changes.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

N/A. This is not a feature that workloads use directly.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details: Create a pod with a scheduling gate, then update the node selector of the pod.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: apiserver_request_total[resource=pod, group=, verb=UPDATE, code=400]
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

The feature itself doesn't generate API calls. But it will allow the apiserver to accept update requests to mutate part of the pod spec, which will encourage implementing controllers that does this.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Update requests will be rejected.

What are other known failure modes?

In a multi-master setup, when the cluster has skewed apiservers, some update requests may get accepted and some may get rejected.

What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Implementation History

2023-02-03: Proposed KEP starting in beta status.

Alternatives

Use webhooks to inject affinities

There are cases where the decision where to place the pod is not available on pod creation time. For example, in the case of the dynamic quota manager, the decision to offer spot vs on-demand quota will be known at a later time when potentially the quota become available.

Files

README.md

Latest commit

History