Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4742: Expose Node Labels via Downward API #4747

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

docandrew
Copy link

  • Adding new KEP 4742: Expose Node Labels via Downward API

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 1, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @docandrew!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 1, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @docandrew. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: docandrew
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 1, 2024
@pacoxu
Copy link
Member

pacoxu commented Jul 5, 2024

/cc @kerthcet

@k8s-ci-robot k8s-ci-robot requested a review from kerthcet July 5, 2024 07:25
@ArangoGutierrez
Copy link
Contributor

As discussed at kubernetes/kubernetes#40610
this is open a lot of concerns around security

@cmwylie19
Copy link

As discussed at kubernetes/kubernetes#40610 this is open a lot of concerns around security

I think we can do this in a safe way. Let me know if the reply makes sense. Happy to work through this and give examples.

@ArangoGutierrez
Copy link
Contributor

As discussed at kubernetes/kubernetes#40610 this is open a lot of concerns around security

I think we can do this in a safe way. Let me know if the reply makes sense. Happy to work through this and give examples.

After reading the entire conversation on the KEP issue and related links, I am ok with this now. As NFD maintainer I am eager to help here, as NFD creates labels for the underlying infra

@SergeyKanzhelev
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 9, 2024
@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Jul 9, 2024

/sig auth

need opinion on security and potential abuse

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Jul 9, 2024
@SergeyKanzhelev
Copy link
Member

some concerns to address: kubernetes/kubernetes#40610 (comment)

keps/sig-node/4742-node-labels-downward/README.md Outdated Show resolved Hide resolved

The initial design includes:

* Being able to pass node labels to volumes with fieldPath of `node.metadata.labels` where only labels containing `topology.k8s.io/` are passed through
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this overloading the namespace of topology.k8s.io? Are we encouraging users to overload it with their own labels (toplogy.k8s.io/hostuefipassword=foo)? Will users accidentally see side effects because they use the same fields inside the topology.k8s.io namespace for another purpose?

Maybe it would be better to either use a config-controlled allow-list within the kubelet(?) config or a limited allow-list of well-known labels plus a new namespace like alpha.exposednodelabels.k8s.io/* (feel free to come up with a better name).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think "topology.k8s.io" in an extensibility mechanism, we need to spec that very clearly.

e.g. x.topology.k8s.io/<anything> is for extensions, while topology.k8s.io/<anything> is reserved for k8s use.

And even then we need to define "local". Can vendors define topology extensions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking to update this a KEP a bit, but i think I am a little lost. So we need to strictly document and define through code what the <anything> flag can be? Is this the ask?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag as in part of the label that would be applied to the pod

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking for the KEP to say SOMETHING LIKE the below. Note that this is a suggested approach, we should consider alternatives:

"""
In KEP 1659, the following labels are defined:
* topology.kubernetes.io/region
* topology.kubernetes.io/zone

In addition to those, KEP 1659 declares the entire topology.kubernetes.io prefix space as reserved for use by the Kubernetes project.

This KEP expands upon KEP1659 in the following ways:

  1. The x.topology.kubernetes.io prefix is allocated for use by end users. The kubernetes project itself will not define any standard labels with that prefix.
  2. The <domain>.x.topology.kubernetes.io prefix is likewise allocated for use by end users or third-parties. The <domain> portion is treated the same as a "normal" label prefix. For example, example.com.x.topology.k8s.io/label-name.
  3. All labels using the topology.kubernetes.io or *.topology.kubernetes.io prefix spaces are considered "safe" for workloads. A workload may be exposed to the values of these labels which directly apply to the workload. For example, a pod may learn the topology of the node on which it is running.
    """

Then this KEP can describe how it will expose those labels from nodes to pods (is it a literal copy or a virtual one, who wins in case of conflict, etc).

If we think there is a need for ANOTHER prefix which is "safe" but isn't "topology", we can discuss doing a similar carve-out for another prefix. But I'd argue it stops at 2.

@thockin thockin self-assigned this Jul 9, 2024
keps/sig-node/4742-node-labels-downward/README.md Outdated Show resolved Hide resolved

Workarounds today typically involve using an initContainer to query the
Kubernetes API and then pass data via shared volume to other containers within
the same pod. This adds additional demand on the API server and is burdensome
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be really clear about my position on this:

  • DownwardAPI for anything inside the Pod object seems OK (we opened that door a long time ago).
  • DownwardAPI for things outside of the Pod object is not OK unless it follows authz rules (pod's SA).
  • The fact that workarounds exist means we should NOT consider this an extraoridnary need and should NOT be bending rules.
  • Arguments based on reducing apiserver load need to show proof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the argument for doing this in spite of those workarounds is that we don't have a fine-grained enough authorisation model that would permit a Pod to only be able to read these 'topology labels', therefore we end out granting far broader permissions (which in some Kubernetes deployments is an unacceptable trade-off).

I think the scale argument is valid, albeit less of a primary driving factor than us being able to work around gaps in authZ/RBAC today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK the first point (sadly).

WRT scale, I think we are not disagreeing :)


The initial design includes:

* Being able to pass node labels to volumes with fieldPath of `node.metadata.labels` where only labels containing `topology.k8s.io/` are passed through
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think "topology.k8s.io" in an extensibility mechanism, we need to spec that very clearly.

e.g. x.topology.k8s.io/<anything> is for extensions, while topology.k8s.io/<anything> is reserved for k8s use.

And even then we need to define "local". Can vendors define topology extensions?

keps/sig-node/4742-node-labels-downward/README.md Outdated Show resolved Hide resolved
attained through a configmap or secret mounted to a pod, being passed on
creation but not guaranteed to be immutable and thus should be treated as so.

* Can potentially be featuregated to make this feature strictly opt-in, if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature gates are not long-term flags -- if you need a mechanism for admisn to disable this entirely, that's not a feature gate (and I would argue it indicates a deeper problem)

## Motivation

We’d like to change the runtime behavior of containers based on node labels.
In our case, we’re using a CNI with DaemonSets to perform network setup, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For very specific cases, we have workarounds like sidecars. If these DaemonSets need more info, that's their escape path.

@cmwylie19
Copy link

cmwylie19 commented Jul 9, 2024

Notes I got from the meeting so they can be fixed and address later when there is time:

  • Document behavior around restarts as it relates to fileMounts and env (values may change)
  • Elaborate/Fix use-cases (can't use vpn for example based on fixed values)

updated since we are not considering using kubelet flags

@thockin
Copy link
Member

thockin commented Jul 9, 2024

I am -2 on kubelet flags. In most managed providers, those flags are not accessible to users or admins, and differeing opinions between providers will make portability WORSE, not better.

We'd do better to define a small set of extension mechanisms and leave it fixed at that.

-->

## Alternatives

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading @thockin comments about kubelet-sidecar. Maybe you can comment here about some alternatives that have been purusued or discussed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other alternative I can think of is a mutating webhook that intercepts pods/binding requests and sets the topology values as annotations on the Binding. Due to how the Pod binding registry works, these should then be copied across to the Pod object (as seen here: https://github.com/kubernetes/kubernetes/blob/0ce7f7e25e98ab503e317163aee986395d8f8272/pkg/registry/core/pod/storage/storage.go#L238-L243).

The downside of this being that we can't do the same for pod labels, although this is a non-issue when using the downward API as it also supports annotations as a source.


Workarounds today typically involve using an initContainer to query the
Kubernetes API and then pass data via shared volume to other containers within
the same pod. This adds additional demand on the API server and is burdensome
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the argument for doing this in spite of those workarounds is that we don't have a fine-grained enough authorisation model that would permit a Pod to only be able to read these 'topology labels', therefore we end out granting far broader permissions (which in some Kubernetes deployments is an unacceptable trade-off).

I think the scale argument is valid, albeit less of a primary driving factor than us being able to work around gaps in authZ/RBAC today.

the same pod. This adds additional demand on the API server and is burdensome
compared to the ease of using downwardAPI for pod labels and metadata.

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also call out that this would become a property that load balancers/Service objects could select upon too, if it is a label.

This may create a bit of confusion wrt topology aware routing, and its usage as a selector label should likely be discouraged/its caveats noted in documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@munnerz you are implying a property which is not stated here, which your POC provides but I think the original idea of this KEP does not.

This KEP says that the ONLY goals are for a pod to be able to access the node's labels via volumens and env. It does NOT say that it is a goal to make those labels actually visible in the API for use by outsider observers (which would include LBs).

IOW - do we think it is a goal for users, LBs, etc to be able to do
kubectl get pods -l topology.k8s.io/zone=central
?

keps/sig-node/4742-node-labels-downward/README.md Outdated Show resolved Hide resolved
- The `<domain>.x.topology.kubernetes.io` prefix is likewise allocated for use by end users or third-parties. The `<domain>` portion is treated the same as a "normal" label prefix. For example, `example.com.x.topology.k8s.io/label-name`.
- All labels using the `topology.kubernetes.io` or `*.topology.kubernetes.io` prefix spaces are considered "safe" for workloads. A workload may be exposed to the values of these labels which directly apply to the workload. For example, a pod may learn the topology of the node on which it is running.

The idea is that we will expose those labels from nodes to pods via a literal copy from the Node, for instance using the method `GetNode` from Kubelet in the `podFieldSelectorRuntimeValue` function and `volume.VolumeHost` `GetNodeLabels` function in the `CollectData` function in the downward API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of plumbing this via the kubelet, making the actual node itself responsible for setting the value, I think we should consider setting this within the 'PodBinding' registry implementation here: https://github.com/kubernetes/kubernetes/blob/0ce7f7e25e98ab503e317163aee986395d8f8272/pkg/registry/core/pod/storage/storage.go#L202-L252

This allows us to ensure we atomically set the topology label values at the same time the pod is bound to the Node, and also happens within the apiserver at the time the Pod is persisted.

I am a little uncertain on who the field manager should be for these labels however, and if we set the label values here I believe they won't have a properly defined field manager either (though would need to confirm). As an alternative, we could extend this area of code to copy across labels from Binding objects (similar to how annotations are already copied today in this function), which would allow us to set the label values during admission which I believe happens prior to the managed fields entries being re-computed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important difference and the tradeoffs need to be documented

-->

## Alternatives

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other alternative I can think of is a mutating webhook that intercepts pods/binding requests and sets the topology values as annotations on the Binding. Due to how the Pod binding registry works, these should then be copied across to the Pod object (as seen here: https://github.com/kubernetes/kubernetes/blob/0ce7f7e25e98ab503e317163aee986395d8f8272/pkg/registry/core/pod/storage/storage.go#L238-L243).

The downside of this being that we can't do the same for pod labels, although this is a non-issue when using the downward API as it also supports annotations as a source.

@munnerz
Copy link
Member

munnerz commented Sep 3, 2024

I have put together an example implementation (NOT setting via the kubelet however) to gather feedback on approaches here: kubernetes/kubernetes#127092 - hopefully this can help facilitate some discussion!

topology labels would provide users a straightforward, maintainable way to
optimize their workloads given topology constraints.

Workarounds today typically involve using an initContainer to query the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention the NRI workaround as well: kubernetes/kubernetes#40610 (comment)

### Non-Goals

* Not to expose additional node info outside of labels
* Not to pass any additional node labels other than `topology.k8s.io/*` to pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may/will likely cause an abuse of topology labels to pass other information beyond topology. Please add a section expanding on this way the feature will be abused

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually hope that this feature can be used (abuse is a strong word 😄) as an extensibility mechanism, and added the caveat that topology.k8s.io/* is strictly reserved for the Kubernetes project, but *.topology.k8s.io/* is free for user-defined node fields that pods should have access to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Sergey's point is that it will likely get used for things that are not topology related. I agree it's likely and I would be OK (in a different KEP) to discuss reserving another prefix space for other things.


* Not to expose additional node info outside of labels
* Not to pass any additional node labels other than `topology.k8s.io/*` to pods
* Not to guarantee the label value assigned at pod creation is the most recent node label value because it is assigned at pod creation time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we guarantee consistency across containers? Across container restarts?

Copy link
Member

@pacoxu pacoxu Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should follow the current behavior with pod label and pod resource limit/request(after pod resource inplace update support, it is tunable) of Downward API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be written down explicitly

but to use existing logic to extend the current fieldRef as it is used today
to pass pod metadata to pods using the downwardAPI.

## Implementation Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we need to somehow allow pod owner to find out which values were used? How the troubleshooting experience will look like for the cases when some labels were changed

but to use existing logic to extend the current fieldRef as it is used today
to pass pod metadata to pods using the downwardAPI.

## Implementation Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any race conditions between pod binding and some controller assigning labels? Anything we should do about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if it is implemented as an init container with permissions, there may be implementation that waits for those labels. If it's natively implemented by kubelet, there is no synchronization possible and people switching from one model to another may get into the race condition situation

@ibihim
Copy link

ibihim commented Sep 9, 2024

/assign @nilekhc


Workarounds today typically involve using an initContainer to query the
Kubernetes API and then pass data via shared volume to other containers within
the same pod. This adds additional demand on the API server and is burdensome
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK the first point (sadly).

WRT scale, I think we are not disagreeing :)

- The `<domain>.x.topology.kubernetes.io` prefix is likewise allocated for use by end users or third-parties. The `<domain>` portion is treated the same as a "normal" label prefix. For example, `example.com.x.topology.k8s.io/label-name`.
- All labels using the `topology.kubernetes.io` or `*.topology.kubernetes.io` prefix spaces are considered "safe" for workloads. A workload may be exposed to the values of these labels which directly apply to the workload. For example, a pod may learn the topology of the node on which it is running.

The idea is that we will expose those labels from nodes to pods via a literal copy from the Node, for instance using the method `GetNode` from Kubelet in the `podFieldSelectorRuntimeValue` function and `volume.VolumeHost` `GetNodeLabels` function in the `CollectData` function in the downward API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important difference and the tradeoffs need to be documented

the same pod. This adds additional demand on the API server and is burdensome
compared to the ease of using downwardAPI for pod labels and metadata.

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@munnerz you are implying a property which is not stated here, which your POC provides but I think the original idea of this KEP does not.

This KEP says that the ONLY goals are for a pod to be able to access the node's labels via volumens and env. It does NOT say that it is a goal to make those labels actually visible in the API for use by outsider observers (which would include LBs).

IOW - do we think it is a goal for users, LBs, etc to be able to do
kubectl get pods -l topology.k8s.io/zone=central
?


# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.31"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shooting for alpha in 32? Clock is ticking...

@andrewsykim
Copy link
Member

I think KubeRay would really benefit from this proposal in order to propoagate node topology into the Raylet. Are we planning to push this forward in v1.32?

@thockin
Copy link
Member

thockin commented Sep 29, 2024

The KEP window is almost done - is this shooting to make 1.32?

@docandrew
Copy link
Author

I'd love that but will likely need some hand-holding to help shepherd this through the process.

@thockin
Copy link
Member

thockin commented Sep 30, 2024

@munnerz has a POC of some parts of this - the question to be answered is what are the tradeoffs we are making between different approaches, and which do you think are "best".

@munnerz
Copy link
Member

munnerz commented Sep 30, 2024

I'd love that but will likely need some hand-holding to help shepherd this through the process.

I'd be happy to help - I'll ping you on slack 😊

Edit: or ping me @munnerz as I'm not 100% sure which slack account is yours 👀

@thockin
Copy link
Member

thockin commented Oct 4, 2024

I see some pushes but I don't know if that means "please go read it again" or if it is still in-progress. I will assume the latter (WIP) until I hear otherwise.

@docandrew
Copy link
Author

Not quite ready yet - I think @munnerz was going to add some additional details and hopefully correct me on some things, but getting close!

are useful for pods as well to be able to make application decisions based on
the region or zone the pod is running in. This KEP proposes to make these labels
available to pods while also expanding upon KEP 1659 to allow for user-defined
labels in the `*.topology.kubernetes.io` namespace.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that kubernetes.io/hostname is also needed

@andrewsykim
Copy link
Member

Can we prioritize this for v1.33? Seems like all the pieces are there but we just need to make some final changes to the KEP. I'm happy to drive this if there are no owners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.