-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4742: Expose Node Labels via Downward API #4747
base: master
Are you sure you want to change the base?
Conversation
docandrew
commented
Jul 1, 2024
- Adding new KEP 4742: Expose Node Labels via Downward API
- Issue link: Expose Node labels via downward API #4742
- Other comments: In response to discussion here: Consider exposing node labels via downward api kubernetes#40610
Welcome @docandrew! |
Hi @docandrew. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: docandrew The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @kerthcet |
As discussed at kubernetes/kubernetes#40610 |
I think we can do this in a safe way. Let me know if the reply makes sense. Happy to work through this and give examples. |
After reading the entire conversation on the KEP issue and related links, I am ok with this now. As NFD maintainer I am eager to help here, as NFD creates labels for the underlying infra |
/ok-to-test |
/sig auth need opinion on security and potential abuse |
some concerns to address: kubernetes/kubernetes#40610 (comment) |
|
||
The initial design includes: | ||
|
||
* Being able to pass node labels to volumes with fieldPath of `node.metadata.labels` where only labels containing `topology.k8s.io/` are passed through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this overloading the namespace of topology.k8s.io? Are we encouraging users to overload it with their own labels (toplogy.k8s.io/hostuefipassword=foo
)? Will users accidentally see side effects because they use the same fields inside the topology.k8s.io namespace for another purpose?
Maybe it would be better to either use a config-controlled allow-list within the kubelet(?) config or a limited allow-list of well-known labels plus a new namespace like alpha.exposednodelabels.k8s.io/*
(feel free to come up with a better name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could follow a similar logic to https://kubernetes-sigs.github.io/node-feature-discovery/v0.16/usage/features.html#built-in-labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think "topology.k8s.io" in an extensibility mechanism, we need to spec that very clearly.
e.g. x.topology.k8s.io/<anything>
is for extensions, while topology.k8s.io/<anything>
is reserved for k8s use.
And even then we need to define "local". Can vendors define topology extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking to update this a KEP a bit, but i think I am a little lost. So we need to strictly document and define through code what the <anything>
flag can be? Is this the ask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag as in part of the label that would be applied to the pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking for the KEP to say SOMETHING LIKE the below. Note that this is a suggested approach, we should consider alternatives:
"""
In KEP 1659, the following labels are defined:
* topology.kubernetes.io/region
* topology.kubernetes.io/zone
In addition to those, KEP 1659 declares the entire topology.kubernetes.io
prefix space as reserved for use by the Kubernetes project.
This KEP expands upon KEP1659 in the following ways:
- The
x.topology.kubernetes.io
prefix is allocated for use by end users. The kubernetes project itself will not define any standard labels with that prefix. - The
<domain>.x.topology.kubernetes.io
prefix is likewise allocated for use by end users or third-parties. The<domain>
portion is treated the same as a "normal" label prefix. For example,example.com.x.topology.k8s.io/label-name
. - All labels using the
topology.kubernetes.io
or*.topology.kubernetes.io
prefix spaces are considered "safe" for workloads. A workload may be exposed to the values of these labels which directly apply to the workload. For example, a pod may learn the topology of the node on which it is running.
"""
Then this KEP can describe how it will expose those labels from nodes to pods (is it a literal copy or a virtual one, who wins in case of conflict, etc).
If we think there is a need for ANOTHER prefix which is "safe" but isn't "topology", we can discuss doing a similar carve-out for another prefix. But I'd argue it stops at 2.
|
||
Workarounds today typically involve using an initContainer to query the | ||
Kubernetes API and then pass data via shared volume to other containers within | ||
the same pod. This adds additional demand on the API server and is burdensome |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be really clear about my position on this:
- DownwardAPI for anything inside the Pod object seems OK (we opened that door a long time ago).
- DownwardAPI for things outside of the Pod object is not OK unless it follows authz rules (pod's SA).
- The fact that workarounds exist means we should NOT consider this an extraoridnary need and should NOT be bending rules.
- Arguments based on reducing apiserver load need to show proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the argument for doing this in spite of those workarounds is that we don't have a fine-grained enough authorisation model that would permit a Pod to only be able to read these 'topology labels', therefore we end out granting far broader permissions (which in some Kubernetes deployments is an unacceptable trade-off).
I think the scale argument is valid, albeit less of a primary driving factor than us being able to work around gaps in authZ/RBAC today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK the first point (sadly).
WRT scale, I think we are not disagreeing :)
|
||
The initial design includes: | ||
|
||
* Being able to pass node labels to volumes with fieldPath of `node.metadata.labels` where only labels containing `topology.k8s.io/` are passed through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think "topology.k8s.io" in an extensibility mechanism, we need to spec that very clearly.
e.g. x.topology.k8s.io/<anything>
is for extensions, while topology.k8s.io/<anything>
is reserved for k8s use.
And even then we need to define "local". Can vendors define topology extensions?
attained through a configmap or secret mounted to a pod, being passed on | ||
creation but not guaranteed to be immutable and thus should be treated as so. | ||
|
||
* Can potentially be featuregated to make this feature strictly opt-in, if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature gates are not long-term flags -- if you need a mechanism for admisn to disable this entirely, that's not a feature gate (and I would argue it indicates a deeper problem)
## Motivation | ||
|
||
We’d like to change the runtime behavior of containers based on node labels. | ||
In our case, we’re using a CNI with DaemonSets to perform network setup, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For very specific cases, we have workarounds like sidecars. If these DaemonSets need more info, that's their escape path.
Notes I got from the meeting so they can be fixed and address later when there is time:
updated since we are not considering using kubelet flags |
I am -2 on kubelet flags. In most managed providers, those flags are not accessible to users or admins, and differeing opinions between providers will make portability WORSE, not better. We'd do better to define a small set of extension mechanisms and leave it fixed at that. |
--> | ||
|
||
## Alternatives | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading @thockin comments about kubelet-sidecar. Maybe you can comment here about some alternatives that have been purusued or discussed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other alternative I can think of is a mutating webhook that intercepts pods/binding
requests and sets the topology values as annotations on the Binding. Due to how the Pod binding registry works, these should then be copied across to the Pod object (as seen here: https://github.com/kubernetes/kubernetes/blob/0ce7f7e25e98ab503e317163aee986395d8f8272/pkg/registry/core/pod/storage/storage.go#L238-L243).
The downside of this being that we can't do the same for pod labels, although this is a non-issue when using the downward API as it also supports annotations as a source.
|
||
Workarounds today typically involve using an initContainer to query the | ||
Kubernetes API and then pass data via shared volume to other containers within | ||
the same pod. This adds additional demand on the API server and is burdensome |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the argument for doing this in spite of those workarounds is that we don't have a fine-grained enough authorisation model that would permit a Pod to only be able to read these 'topology labels', therefore we end out granting far broader permissions (which in some Kubernetes deployments is an unacceptable trade-off).
I think the scale argument is valid, albeit less of a primary driving factor than us being able to work around gaps in authZ/RBAC today.
the same pod. This adds additional demand on the API server and is burdensome | ||
compared to the ease of using downwardAPI for pod labels and metadata. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also call out that this would become a property that load balancers/Service objects could select upon too, if it is a label.
This may create a bit of confusion wrt topology aware routing, and its usage as a selector label should likely be discouraged/its caveats noted in documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@munnerz you are implying a property which is not stated here, which your POC provides but I think the original idea of this KEP does not.
This KEP says that the ONLY goals are for a pod to be able to access the node's labels via volumens and env. It does NOT say that it is a goal to make those labels actually visible in the API for use by outsider observers (which would include LBs).
IOW - do we think it is a goal for users, LBs, etc to be able to do
kubectl get pods -l topology.k8s.io/zone=central
?
- The `<domain>.x.topology.kubernetes.io` prefix is likewise allocated for use by end users or third-parties. The `<domain>` portion is treated the same as a "normal" label prefix. For example, `example.com.x.topology.k8s.io/label-name`. | ||
- All labels using the `topology.kubernetes.io` or `*.topology.kubernetes.io` prefix spaces are considered "safe" for workloads. A workload may be exposed to the values of these labels which directly apply to the workload. For example, a pod may learn the topology of the node on which it is running. | ||
|
||
The idea is that we will expose those labels from nodes to pods via a literal copy from the Node, for instance using the method `GetNode` from Kubelet in the `podFieldSelectorRuntimeValue` function and `volume.VolumeHost` `GetNodeLabels` function in the `CollectData` function in the downward API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of plumbing this via the kubelet, making the actual node itself responsible for setting the value, I think we should consider setting this within the 'PodBinding' registry implementation here: https://github.com/kubernetes/kubernetes/blob/0ce7f7e25e98ab503e317163aee986395d8f8272/pkg/registry/core/pod/storage/storage.go#L202-L252
This allows us to ensure we atomically set the topology label values at the same time the pod is bound to the Node, and also happens within the apiserver at the time the Pod is persisted.
I am a little uncertain on who the field manager should be for these labels however, and if we set the label values here I believe they won't have a properly defined field manager either (though would need to confirm). As an alternative, we could extend this area of code to copy across labels from Binding objects (similar to how annotations are already copied today in this function), which would allow us to set the label values during admission which I believe happens prior to the managed fields entries being re-computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important difference and the tradeoffs need to be documented
--> | ||
|
||
## Alternatives | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other alternative I can think of is a mutating webhook that intercepts pods/binding
requests and sets the topology values as annotations on the Binding. Due to how the Pod binding registry works, these should then be copied across to the Pod object (as seen here: https://github.com/kubernetes/kubernetes/blob/0ce7f7e25e98ab503e317163aee986395d8f8272/pkg/registry/core/pod/storage/storage.go#L238-L243).
The downside of this being that we can't do the same for pod labels, although this is a non-issue when using the downward API as it also supports annotations as a source.
I have put together an example implementation (NOT setting via the kubelet however) to gather feedback on approaches here: kubernetes/kubernetes#127092 - hopefully this can help facilitate some discussion! |
topology labels would provide users a straightforward, maintainable way to | ||
optimize their workloads given topology constraints. | ||
|
||
Workarounds today typically involve using an initContainer to query the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention the NRI workaround as well: kubernetes/kubernetes#40610 (comment)
### Non-Goals | ||
|
||
* Not to expose additional node info outside of labels | ||
* Not to pass any additional node labels other than `topology.k8s.io/*` to pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may/will likely cause an abuse of topology labels to pass other information beyond topology. Please add a section expanding on this way the feature will be abused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually hope that this feature can be used (abuse is a strong word 😄) as an extensibility mechanism, and added the caveat that topology.k8s.io/*
is strictly reserved for the Kubernetes project, but *.topology.k8s.io/*
is free for user-defined node fields that pods should have access to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Sergey's point is that it will likely get used for things that are not topology related. I agree it's likely and I would be OK (in a different KEP) to discuss reserving another prefix space for other things.
|
||
* Not to expose additional node info outside of labels | ||
* Not to pass any additional node labels other than `topology.k8s.io/*` to pods | ||
* Not to guarantee the label value assigned at pod creation is the most recent node label value because it is assigned at pod creation time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we guarantee consistency across containers? Across container restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should follow the current behavior with pod label and pod resource limit/request(after pod resource inplace update support, it is tunable) of Downward API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be written down explicitly
but to use existing logic to extend the current fieldRef as it is used today | ||
to pass pod metadata to pods using the downwardAPI. | ||
|
||
## Implementation Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we need to somehow allow pod owner to find out which values were used? How the troubleshooting experience will look like for the cases when some labels were changed
but to use existing logic to extend the current fieldRef as it is used today | ||
to pass pod metadata to pods using the downwardAPI. | ||
|
||
## Implementation Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any race conditions between pod binding and some controller assigning labels? Anything we should do about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if it is implemented as an init container with permissions, there may be implementation that waits for those labels. If it's natively implemented by kubelet, there is no synchronization possible and people switching from one model to another may get into the race condition situation
/assign @nilekhc |
|
||
Workarounds today typically involve using an initContainer to query the | ||
Kubernetes API and then pass data via shared volume to other containers within | ||
the same pod. This adds additional demand on the API server and is burdensome |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK the first point (sadly).
WRT scale, I think we are not disagreeing :)
- The `<domain>.x.topology.kubernetes.io` prefix is likewise allocated for use by end users or third-parties. The `<domain>` portion is treated the same as a "normal" label prefix. For example, `example.com.x.topology.k8s.io/label-name`. | ||
- All labels using the `topology.kubernetes.io` or `*.topology.kubernetes.io` prefix spaces are considered "safe" for workloads. A workload may be exposed to the values of these labels which directly apply to the workload. For example, a pod may learn the topology of the node on which it is running. | ||
|
||
The idea is that we will expose those labels from nodes to pods via a literal copy from the Node, for instance using the method `GetNode` from Kubelet in the `podFieldSelectorRuntimeValue` function and `volume.VolumeHost` `GetNodeLabels` function in the `CollectData` function in the downward API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important difference and the tradeoffs need to be documented
the same pod. This adds additional demand on the API server and is burdensome | ||
compared to the ease of using downwardAPI for pod labels and metadata. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@munnerz you are implying a property which is not stated here, which your POC provides but I think the original idea of this KEP does not.
This KEP says that the ONLY goals are for a pod to be able to access the node's labels via volumens and env. It does NOT say that it is a goal to make those labels actually visible in the API for use by outsider observers (which would include LBs).
IOW - do we think it is a goal for users, LBs, etc to be able to do
kubectl get pods -l topology.k8s.io/zone=central
?
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
milestone: | ||
alpha: "v1.31" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shooting for alpha in 32? Clock is ticking...
I think KubeRay would really benefit from this proposal in order to propoagate node topology into the Raylet. Are we planning to push this forward in v1.32? |
The KEP window is almost done - is this shooting to make 1.32? |
I'd love that but will likely need some hand-holding to help shepherd this through the process. |
@munnerz has a POC of some parts of this - the question to be answered is what are the tradeoffs we are making between different approaches, and which do you think are "best". |
I'd be happy to help - I'll ping you on slack 😊 Edit: or ping me @munnerz as I'm not 100% sure which slack account is yours 👀 |
I see some pushes but I don't know if that means "please go read it again" or if it is still in-progress. I will assume the latter (WIP) until I hear otherwise. |
Not quite ready yet - I think @munnerz was going to add some additional details and hopefully correct me on some things, but getting close! |
are useful for pods as well to be able to make application decisions based on | ||
the region or zone the pod is running in. This KEP proposes to make these labels | ||
available to pods while also expanding upon KEP 1659 to allow for user-defined | ||
labels in the `*.topology.kubernetes.io` namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that kubernetes.io/hostname
is also needed
Can we prioritize this for v1.33? Seems like all the pieces are there but we just need to make some final changes to the KEP. I'm happy to drive this if there are no owners. |