-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically add system critical priority classes at cluster boostrapping #60519
Conversation
c9b7564
to
d1eff44
Compare
fe3f75a
to
b7359ae
Compare
@@ -168,7 +168,7 @@ func IsCriticalPodBasedOnPriority(ns string, priority int32) bool { | |||
if ns != kubeapi.NamespaceSystem { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to this PR, but critical pods based on priority should not be limited to the kube-system namespace... that namespace should not have any special significance as far as the API is concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, that's unrelated this to PR, but K8s docs clearly state that critical pods can only exist in system namespace only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K8s docs clearly state that critical pods can only exist in system namespace only
that is left over from critical pods being determined from the uncontrollable pod annotation. that limitation should not apply to the priority field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will file a separate issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed: #60596
return admission.NewForbidden(a, fmt.Errorf("the name of the priority class is a reserved name for system use only: %v", pc.Name)) | ||
// API server adds system critical priority classes at bootstrapping. We should | ||
// not enforce restrictions on adding system level priority classes for API server. | ||
if userInfo := a.GetUserInfo(); userInfo == nil || userInfo.GetName() != user.APIServerUser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't an appropriate use of this username. if you want to require additional authorization to set certain priority classes, pass in an authorizer and perform the authorization check (see PodSecurityPolicyPlugin for an example of obtaining and using the authorizer). The API server always has superuser permissions when making API calls to itself, and will pass any authorization check, but cluster-admins should be able to modify these objects as well.
I also find it odd to do this enforcement in admission... why not in the rest handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at PodSecurityPolicyPlugin. I am not sure if the authorization done there is enough though. We don't want to allow any user-defined priority classes to start with "system-" or have a value in the system range (>1 billion).
Performing the check in the rest handler can be done too if there is a way to obtain userInfo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you let the rest handler have the map of allowed system priorities and only allow those to be created. On upgrade, any newly allowed system priority classes would be bootstrapped by a new apiserver using the loopback connection, which would always speak to a rest handler that agreed on what system priority classes were allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you don't need to worry about userinfo at all. Anyone authorized to create priority classes can create the system priority classes the apiserver expects to be bootstrapped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a very good idea. I changed the PR accordingly. The new changes are under a new commit.
} | ||
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) { | ||
return admission.NewForbidden(a, fmt.Errorf("priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only")) | ||
} | ||
} | ||
// If the new PriorityClass tries to be the default priority, make sure that no other priority class is marked as default. | ||
if pc.GlobalDefault { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated, but it still appears we're trying to enforce a singleton GlobalDefault... I thought we dropped that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never dropped it. We just added code to handle the case of having more than one global default when they are added due to race in HA clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, that means admission can reject updates that swap which one is the default
- apply a file containing two priority classes,
a
andb
(default) - apply an update of the file with
a
(default) andb
(non-default)
admission rejects if the update to a
happens first. it would also reject if the update to b
happened first but the watch event didn't make it into the informer-fed cache the admission plugin consulted.
this guard seems more problematic than helpful to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there are races causing one to add more than one global default, but are you suggesting that it is better to allow people to add as many global default priority classes as they like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we cannot guarantee a single default (which we can't), and we have a deterministic and documented behavior when there's more than one (which we do), then breaking legitimate updates in order to half-accomplish correctness doesn't seem worthwhile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. We can remove it and document the behavior. I will do that in a separate PR to ensure we are not mixing the two.
@@ -49,6 +63,64 @@ func (p RESTStorageProvider) v1alpha1Storage(apiResourceConfigSource serverstora | |||
return storage | |||
} | |||
|
|||
func (p RESTStorageProvider) PostStartHook() (string, genericapiserver.PostStartHookFunc, error) { | |||
return PostStartHookName, AddSystemPriorityClasses(), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can i delete them? what happens if i do? will they come back only on restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are deleted, no more new Pods with these priority classes can be created. These PriorityClasses are added back on API server restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it also means no new critical pods could be created. Can we avoid that? Say something like Delete would return error, if priorityclass name is one of systemclustercritical or systemnodecritical? Not sure if that takes flexibility away from user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we don't have a validation mechanism in our REST layer for delete operations. So, I am not sure if it is possible to prevent deletion of these priority classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information. I did a quicksearch in the codebase and it seems there are few things like immortalNamespaces
(basically 'kube-system' and 'defaut' etc), where at the admission plugin level we are prohibiting user to delete them.(https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L77). Wondering if we could do the same thing but I think it can be a different admission controller from priority AdmissionController.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inside the impl of Delete
(which the priorityclassstorage would need to override, put the guard in, then delegate, like:
func (r *REST) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
// if name is in set of expected system priority classes, forbid and return
return r.store.Delete(ctx, name, options)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Thanks @liggitt. Unrelated to this change, when do we use this mechanism vs one I mentioned earlier for kube-system(is this mechanism only for top level objects?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @liggitt and @ravisantoshgudimetla. I added the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating.
b5a18e4
to
2192ec8
Compare
allErrs = append(allErrs, apivalidation.ValidateObjectMeta(&pc.ObjectMeta, false, apivalidation.NameIsDNSSubdomain, field.NewPath("metadata"))...) | ||
// If the priorityClass starts with a system prefix, it must be one of the | ||
// predefined system priority classes. | ||
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) && !scheduling.IsKnownSystemPriorityClass(pc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since both the name and value are immutable, comparison to the bootstrap priority classes should only be done on create.
update should not concern itself with these checks, since during rolling upgrades or in downgraded apiserver cases, you might need to update/delete a system priority class bootstrapped by a newer apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is already the case. ValidatePriorityClassUpdate
is called to validate updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strategy.ValidateUpdate calls this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I will remove the call.
pkg/apis/scheduling/helpers.go
Outdated
for _, spc := range systemPriorityClasses { | ||
spcCopy := *spc | ||
spcCopy.Description = "" | ||
if reflect.DeepEqual(spcCopy, pcCopy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only compare name and value. descriptions, annotations, labels, etc, we don't care about.
also, if the value mismatches, surfacing a better error message is important ("system priority class %s may only have value %d", etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that I used DeepEqual was to make it resilient to changes of fields in PriorityClass. Anyway, I changed it to compare the three fields only, but I didn't add more logic to report errors for each priority class. Reporting better errors would make the logic hard to maintain. These prioirty classes are supposed to be created by the API server automatically and not by humans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, adding more detailed error does not make the function much different. Let me add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PTAL.
0291abb
to
494dabf
Compare
pkg/apis/scheduling/helpers.go
Outdated
if spc.Value != pc.Value { | ||
return false, fmt.Errorf("value of %v PrioityClass must be %v", spc.Name, spc.Value) | ||
} | ||
if spc.GlobalDefault != pc.GlobalDefault { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we care about this either... just the name and value. If they want to make the default for pods in their cluster "system-node-critical", that shouldn't matter to us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting such a high priority as default could cause eviction of system critical components of a cluster. We should prevent catastrophic human errors as much as possible.
} else if pc.Value > scheduling.HighestUserDefinablePriority { | ||
// Similarly, if the Value is in the system range, it must be one of the | ||
// predefined system priority classes. | ||
if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it doesn't start with the SystemPriorityClassPrefix, there's no need to check this here
// predefined system priority classes. | ||
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) { | ||
if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is { | ||
allErrs = append(allErrs, field.Forbidden(field.NewPath("metadata", "Name"), "priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only. error: "+err.Error())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"name"
should match the json field
5a21b7d
to
8f82acd
Compare
8f82acd
to
f66b952
Compare
@@ -56,3 +61,14 @@ var _ rest.ShortNamesProvider = &REST{} | |||
func (r *REST) ShortNames() []string { | |||
return []string{"pc"} | |||
} | |||
|
|||
// Delete ensures that system priority classes are not deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have guard for update as well or else we will able to update system priority class to lower value and then create pods that have higher priority than system priority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update validation already prevents changing value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh I missed that, it seems it is part of https://github.com/kubernetes/kubernetes/pull/60519/files#diff-35601e9483eab62205294e45c6df7712R71. Thanks.
@liggitt @aveshagarwal @ravisantoshgudimetla @derekwaynecarr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
/lgtm |
/lgtm |
(needs squash) |
f66b952
to
9592a9e
Compare
squashed commits. |
@liggitt Could you please re-lgtm this after squash? |
ping |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, derekwaynecarr, liggitt The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
/test pull-kubernetes-integration |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
Automatic merge from submit-queue (batch tested with PRs 60519, 61099, 61218, 61166, 61714). If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
We had two PriorityClasses that were hardcoded and special cased in our code base. These two priority classes never existed in API server. Priority admission controller had code to resolve these two names. This PR removes the hardcoded PriorityClasses and adds code to create these PriorityClasses automatically when API server starts.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #60178
ref/ #57471
Special notes for your reviewer:
Release note:
/sig scheduling