Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass down resources to CRI #4113

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

marquiz
Copy link
Contributor

@marquiz marquiz commented Jun 28, 2023

  • One-line PR description: KEP for extending the CRI API to pass down unmodified resource information from the kubelet to the CRI runtime.
  • Other comments:

Co-authored-by: Antti Kervinen <antti.kervinen@intel.com>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jun 28, 2023
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 28, 2023
@marquiz
Copy link
Contributor Author

marquiz commented Jun 28, 2023

@k8s-ci-robot
Copy link
Contributor

@marquiz: GitHub didn't allow me to request PR reviews from the following users: fidencio.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @haircommander @mikebrow @zvonkok @fidencio @kad

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines 289 to 290
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2;
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the keys here be a special type instead of unstructured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible to have smth like type ResourceName string in protobuf. Please correct me if I'm wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @haircommander, are you satisfied with the reply (close as resolved)?

@marquiz marquiz mentioned this pull request Jul 18, 2023
4 tasks
@marquiz
Copy link
Contributor Author

marquiz commented Jul 18, 2023

/retitle Pass down resources to CRI

@k8s-ci-robot k8s-ci-robot changed the title KEP: Initial version of the Pass down resources to CRI Pass down resources to CRI Jul 18, 2023
@zvonkok
Copy link

zvonkok commented Aug 2, 2023

@marquiz We need to check how this will work with DRA and CDI devices. If we have enough information to know which devices need to be added to the sandbox just by the resource claim name.

@zvonkok
Copy link

zvonkok commented Aug 2, 2023

@marquiz There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

@zvonkok
Copy link

zvonkok commented Aug 2, 2023

@bergwolf @egernst FYI

Copy link
Contributor

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marquiz.

It would be good to get more concrete details on the use cases that this would enable.
There is also the question of complex devices that are managed by device plugins where there isn't a clear mapping from the resources entry (e.g. vendor.com/xpu: 1) to the resources added to the container, or DRA where the associated resources.requests.claims entry is not mentioned.


#### Story 3

As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on this use case? How does extending the CRI translate to modifications in the OCI runtime specification which is interpreted by runc (or wrappers)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRI changes (in this KEP) would not directly translate to anything in the OCI config. It's just "informational" that a possible hook/wrapper/plugin can then use to tweak the OCI config. Say you want to do customized cpu pinning in your plugin. I'll come up with some more flesh on this section...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar I updated Story 3, PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @elezar may I close this as resolved?

requests:
cpu: 100m
memory: 100M
vendor.com/xpu: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification: This does not indicate the properties of the resource that was actually allocated for a container requesting one of these devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very much true. I think I'll add a note about this in the KEP somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar I added a not about device plugin resources after this example. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @elezar may I close this as resolved?

keps/sig-node/4112-passdown-resources-to-cri/README.md Outdated Show resolved Hide resolved
WindowsPodSandboxConfig windows = 9;
+
+ // Kubernetes resource spec of the containers in the pod.
+ PodResourceConfig pod_resources = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeZappa87 since you shared recently something along these lines for the networking capabilities, this KEP also means to interface with NRI

@zvonkok
Copy link

zvonkok commented Aug 3, 2023

Another point to consider is how we're going to integrate or not these enhancements with the new containerd Sandbox API.

@marquiz
Copy link
Contributor Author

marquiz commented Aug 3, 2023

There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

@zvonkok that one is just the native resources and gives the resources in the "obfuscated" form i.e. not telling the actual reqeusts/limits (plus it's for Linux resources only). I think we wouldn't, or even couldn't, touch this, i.e. keep it.

@zvonkok
Copy link

zvonkok commented Aug 3, 2023

@zvonkok
Copy link

zvonkok commented Sep 1, 2023

Since the DevicePlugin API supports CDI devices with this KEP: #4011 we should try to add more restrictions and requirements how we want to design this passthrough interface. @marquiz FYI

@johnbelamaric
Copy link
Member

OK, PRR looks good - I will approve after the SIG gives approval.

Comment on lines +403 to +404
+ INIT_CONTAINER = 0;
+ SIDECAR_CONTAINER = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this distinction one we want to make? does the kubelet make that distinction? cc @SergeyKanzhelev

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inside kubelet, it is known that "init" container would be "one shot" or will be running during whole pod life.
In case of passing down to CRI, we need to have that info for sizing VMs (so, to know how many containers expected to run during whole life of the pod).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a slippery slope to start introducing new types of containers. With the sidecar containers you need to know two things:

  • that it is a sidecar
  • position of a container in the list of init containers.

Position is very important as it defines how the total resources will be calculated.

Keeping structure as close to Pod Spec as possible would be helpful it we will introduce any other container types in future where ordering or any other attributes will also be important

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

to think if we can change the structure to make it closer to how pod spec look like

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and thus that structure version number thing I was thinking we might need...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily to version it the same as pod. I think once we get to Pod v2, we may have a new CRI API as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal had separate lists for init container and containers. Based on review feedback (e.g. #4113 (comment)) it was changed in b3703c6 to a single list, with a "container type" field.

The order IS important. Kubelet creates containers sequentially, The order of the single list is supposed to be that order (effectively append(initContainers, containers))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the suggestion to mark init containers as sidecars? I am thinking - what if we will introduce a new property of a container that can be exhibited by a container of any type? Will it require us to introduce 6 more container types? Or we will just add a new field? If a new field - why not have a field "restartable" on init container in this implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for the proposed way is to keep it simple here, have the minimum to enable right-sizing of the Pod. Encode the expected lifecycle of each container in the Pod, enabling the calculation of "how many containers will/can be running at the same time".

FWIW, the only new type of container I came up with (but didn't include here) would be a "teardown container", running after all regular containers have finished.

+enum ContainerType {
+ INIT_CONTAINER = 0;
+ SIDECAR_CONTAINER = 1;
+ REGULAR_CONTAINER = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I'd just call this CONTAINER

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree regular is extraneous/default. Typing the container could prove useful. What happens when a container is not enumerated? I wonder if any container not defined at the time of running the pod be considered ephemeral?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K, changed as suggested to plain CONTAINER

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if any container not defined at the time of running the pod be considered ephemeral?

Yes, I think that's the case

@haircommander
Copy link
Contributor

a couple of questions about the exact API, but no need to address those here, we can come to consensus on implementation

/approve

from my side, @mikebrow any additional thoughts?

Comment on lines 256 to 257
- add UpdatePodSandboxResources CRI rpc (this is covered by [KEP-1287][kep-1287], [PR][kep-1287-beta-pr])
- add pod-level resource requirements (this is covered by [KEP-2837][kep-2837], [PR][kep-2837-alpha-pr])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please clarify what does those two statements mean? Do you mean that those KEPs will need to contribute to this API?

The 1287 is targeting beta, while this one is targeting alpha. So this KEP will need to clarify how it will work with InPlace updates

cc: @tallclair @ndixita

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also remove these from non-goals. I added these here because we talk about UpdatePodSandboxResources message and the pod level resource requirements later on in the proposal - I wanted to clarify that these additions are not part of this proposal (KEP-4112).

UpdatePodSandboxResources is not yet in the CRI API. KEP-1287 proposes to add that in Beta. So we'd want to sync with that

WDYT @SergeyKanzhelev ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly want to make sure we are not making changes that are not compatible with these two KEPs. So listing them as non-goals sounds like a bad idea. Maybe not a goal for alpha, but definitely a goal for beta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's certainly what I want, too. To be in-sync between all the KEPs. I tried to state that this KEP (4112)

  • is NOT proposing to add the new UpdatePodSandboxResources CRI rpc
  • is NOT proposing to add pod-level resource requirements
  • is proposing to align with those features when they land

The design details section contains details what changes (to align) would be made.

I now dropped these two non-goals as they cause more confusion than clarity. Maybe, instead, we should add in Goals to align with them?


With this information, the runtime can for example do detailed resource
allocation so that CPU, memory and other resources for each container are
optimally aligned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it work with CPU manager? Is there any conflicts or any information we need to pass earlier than we do today w.r.t. decisions CPU manager made?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU manager is not affected. It does its work before we pass the info down to the runtime. This particular sentence is referring to scenarios where cpu manager none policy is be used in Kubernetes. Do you think that this should be mentioned in the proposal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be mentioned.. with (story 1/2) and without cpu manager (story 3)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to mention. Any hints that kubelet has needs to be passed into the sandbox creation now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to mention.

I the following sentence in this paragraph:
This applies to scenarios where the kubelet CPU manager is disabled (by using the none CPU manager policy).

Any hints that kubelet has needs to be passed into the sandbox creation now?

No. I don't think we want/need spill these details over. Also, the runtime (or NRI plugin) might have better understanding of the HW topology

Also the CreateContainer request is extended to include the unmodified resource
requirements. This make it possible for the CRI runtime to detect any changes
in the pod resources that happen between the Pod creation and container
creation in e.g. scenarios where in-place pod updates are involved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a possibility? Would there be a sandbox adjustment call in-between? Or it is not expected?

@tallclair are we allowing the container resource changes between the sandbox creation and conatiner creation? Maybe while init container is executing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I'm not entirely sure about all the dirty details here. But that exactly was the thinking: a loong-running init container, and pod gets updated before that is completed. And, because there are a lot of delicate details, this would also make small adjustments/touch the container lifecycle handling in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for ephemeral?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, ephemeral containers cannot have any resource requirements (in the Kubernetes API/PodSpec level). But your comment @mikebrow is good in that it might change in the future(?) Nothing in CRI or OCI prevents ephemeral containers having resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would have to check but I think the extra mounts may be different to include debug tools missing from a copied container in the pod..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, at least the mount for the logfile is probably different

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod.. so a certain exception for that container type.. also not clear we should explicitly fail if the mounts have changed which they might on a container restart due to modified volume drivers..?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect sure.. what to do with that detection is probably a long running set of discussions and use case definitions. Stretch goals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no, there are no changes wrt mounts (or devices) in CreateContainer request. Everything continues to work as it does now. Not related to this KEP, but with ephemeral containers there might scenarios where they cannot be supported, e.g. CoCo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ias @mikebrow said, if we have this information on sandbox creation and for some reason anything took a dependency on it, it will open the whole can of worms on how to handle all sorts of edge cases

in the pod resources that happen between the Pod creation and container
creation in e.g. scenarios where in-place pod updates are involved.

The UpdatePodSandboxResources CRI message is also updated when/if that is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in a sense that there will be an additional call in-between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the message will be updated. Re-phrased this paragraph now to:

[KEP-1287][kep-1287] Beta ([PR][kep-1287-beta-pr]) proposes to add new
UpdatePodSandboxResources rpc to the CRI API. If/when KEP-1287 is implemented
as proposed, the UpdatePodSandboxResources CRI message is updated to include
the resource information of all containers (aligning with
UpdateContainerResourcesRequest).

The UpdatePodSandboxResources CRI message is also updated when/if that is
introduced by the [KEP-1287][kep-1287] Beta ([PR][kep-1287-beta-pr]).

Information about the Pod-level resources are added when/if the Pod-level
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndixita can you please confirm it is tracked in your KEP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-phrased this paragraph to

[KEP-2837][kep-2837] Alpha ([PR][kep-2837-alpha-pr]) proposes to add new        
Pod-level resource requirements field to the PodSpec. This information will be  
be added to the PodResourceConfig message, similar to the container resource    
information, if/when KEP-2837 is implemented as proposed.

The intention is to sync with the other KEPs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not tracked in Pod Level Resources KEP. While I am trying to understand this KEP, I have a question: Will the VM based runtimes benefit from knowing the overall requirements of all the containers in a pod and not requirements of individual containers? Again, this question is just for my understanding...

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2024
request.

```diff
message PodSandboxConfig {
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how big this structure will be? What would be the max number of containers we can report in this structure? (what is the message size limit today?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By quick search the gRPC message size limit is 4MB. This is a lot more than the maximum size of a Pod object in Kubernetes. I believe we should not hit a limit in any practical scenario. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4Mb my be ok.

- rename enum REGULAR_CONTAINER -> CONTAINER
- update references to other related keps under Design Details
WindowsPodSandboxConfig windows = 9;
+
+ // Kubernetes resource spec of the containers in the pod.
+ PodResourceConfig pod_resources = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what runtime suppose to do when this does not match values provided in resources of LinuxPodSandboxConfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no overlapping information between these two structures. Thus a conflict should not be possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. Maybe I am missing something here, but you should be able to calculate what is in resources based on pod spec. Is it correct? So some implementation may take dependency on specific way one is calculated. Thus the question, what if they will not be matching at some point. Also if they are inter-dependent, perhaps some of it may be deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime is not calculating resources of LinuxPodSandboxConfig. They are set by the kubelet


The resources (mounts, devices, CDI devices, Kubernetes resources) in the
CreateContainer request should be identical to what was (pre-)informed in the
RunPodSandbox request. If they are different, the CRI runtime may fail the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we separate cases when VM was restarted and we don't really know what was originally passed to the sandbox creation any longer? I do not understand this consistency check argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this paragraph seems outdated, especially in the light of the in-place pod updates. The paragraph was added as a response to some review feedback (I just couldn't find that comment now)

How about re-phrasing it for example:

In some usage scenarios the container creation may fail if
the resources (mounts, devices, CDI devices, Kubernetes resources) in the       
CreateContainer request do not match what was (pre-)informed in the   
RunPodSandbox request. This may be the case e.g. in when 
changes are not allowed after a VM-based Pod has been created.

WDYT @SergeyKanzhelev ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe he means the original pod metadata plus any pod updates must be applied to the pod before the container is created.

### kubelet

Kubelet code is refactored/modified so that all container resources are known
before sandbox creation. This mainly consists of preparing all mounts (of all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possibility that today some mounts can ONLY be created when container is being created (some thing that init container needs to initialize). Will this break this behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed/investigated in #4113 (comment)

I think there should breakages because of this.

WDYT @SergeyKanzhelev ?

Copy link
Member

@mikebrow mikebrow Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When appropriate, image mounts included in the container config are added at create container time when we build the rootfs.. additionally oci level plugins, nri plugins, cdi, and cri proxy tools may add mounts. But these are not the "extra" mounts passed in from the host level CRI client (kubelet), additionally the OCI image volume mount is added by the container runtime during create container. In the OCI volume case this is why I'm asking for the oci volume image reference..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Mount structure is exactly the same (with the same information) that what will be passed as part of the CreateContainerRequest/ContainerConfig message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it actually goes back to this comment: #4113 (comment)

Would it be expected that runtimes will need to mount everything at pod sandbox creation time now? Or the behavior is not defined? Will we require that the old way (without passing all this extra context in sandbox creation) should continue working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtimes do not need to change the way they operate. They don't need to mount everything at sandbox creation.

All of the data is something that the runtime CAN use if it needs it. E.g. this paragraph of the proposal tries to state that. Maybe we need to state that more clearly in more places(?) And certainly comments in the cri proto file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CAN part worries me. If something CAN be used the only assumption we can make is that it IS BEING USED. This is why maybe we should have two modes of operations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev as it was mentioned above, this information is really needed in some scenarios, and OCI runtimes that are dependent on that information will be using it, to fix current problem. Other runtimes that don't have problems now are expected to ignore it.

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Oct 9, 2024

Besides comments above, I think it is important to clarify and even test as part of critest the requirement for Container Runtime. It goes to this question: #4113 (comment)

Some critest ideas might be:

  1. We must add a test in critest that old way (without all those specs) must continue work? (unless I am missing something and we are deprecating the current way)
  2. We should add tests on all sorts of mismatches and the runtime behavior in those cases.
  3. Test scenarios (whatever we may think) when some resources (paths?) passed in sandbox cannot be mounted yet (permissions? remounting?) and will be OK to mount later when init container completed. (unless again we want to explicitly break this scenario)

I understand the challenge with the VM-based runtimes, but also worried that we are going against the direction of making Pod more dynamic. All the duplication and mismatch of what we thought Pod and contianers were before and what they will be is scary to me.

Another alternative would be to have two modes of runtime. One mode with all information available, and every dynamic change recreates the Pod. And another mode - what we have today. And runtime will need to tell on startup which mode it wants to work in.

@zvonkok
Copy link

zvonkok commented Oct 10, 2024

I understand the challenge with the VM-based runtimes, but also worried that we are going against the direction of making Pod more dynamic.

Can you elaborate on the direction of "making Pod more dynamic"?

All the duplication and mismatch of what we thought Pod and contianers were before and what they will be is scary to me.

What are you scared of, or what are you imagining where this is going?

I understand the challenge with the VM-based runtimes

We will have more and more sandboxed environments that will need such information. The use case with Confidential Containers and GPUs (or any other accelerator) is not realizable with the current architecture.

Additionally, there is HW that you cannot hot-plug during runtime, and you need a static configuration at sandbox creation time.

@kad
Copy link
Member

kad commented Oct 10, 2024

  1. We must add a test in critest that old way (without all those specs) must continue work? (unless I am missing something and we are deprecating the current way)

This thing is targeted to solve the problem that CoCo VMs have (and impossible to fix without additional info passed down), and not trying to deprecate or change meaning of any other Pod lifecycle messages or events meaning.

  1. We should add tests on all sorts of mismatches and the runtime behavior in those cases.

If there are no bugs in the kubelet code, there should be no mismatches between raw data derived from pod spec with fields calculated to LinuxContainerResources. Kubelet owns it and passing it/calculates form single in-memory object inside kubelet.
Or which mismatches you're talking about?

  1. Test scenarios (whatever we may think) when some resources (paths?) passed in sandbox cannot be mounted yet (permissions? remounting?) and will be OK to mount later when init container completed. (unless again we want to explicitly break this scenario)

Device and mount paths are provided at pod sandbox creation time as additional information. At the time when RunSandbox is issued to CRI, kubelet already resolved and prepared all the paths and devices. And even in situation when exact paths might be not yet available, that is still fine, they will be vaild during creation of container time. However, getting all mount/devices at VM creation time it is really critical for fixing currently broken scenario in CoCo.

I understand the challenge with the VM-based runtimes, but also worried that we are going against the direction of making Pod more dynamic. All the duplication and mismatch of what we thought Pod and contianers were before and what they will be is scary to me.

In fact, we are not going against direction of pod and containers to be dynamic. It is the opposite: in every piece of additional data passed, we are reflecting current state of the object. e.g. "At pod creation we know this much info of workload", on "container creation, something already got changed, so here is current state about that container.". This allows us to do righsizing of VM during creation including all possible heuristics based on information known at that time, as well failing creation of container on (re-)start if something changed since RunPodSandbox and it doesn't fit anymore.

Another alternative would be to have two modes of runtime. One mode with all information available, and every dynamic change recreates the Pod. And another mode - what we have today. And runtime will need to tell on startup which mode it wants to work in.

We don't really need two modes. We just need enough information to make decisions at existing pod lifecycle events, as well proper error handling for scenarios where dynamic resize of the pod or container is not feasible due to some reasons.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for moving forward with the KEP under the proviso that a follow up KEP update is required to sync with 1287 and 2837, to cover message/api details not yet discovered, and other dangling issues still in open discussion. I also expect us to find needed additions when we have a cc prototype.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haircommander, marquiz, mikebrow
Once this PR has been reviewed and has the lgtm label, please assign dchen1107, wojtek-t for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mikebrow
Copy link
Member

Besides comments above, I think it is important to clarify and even test as part of critest the requirement for Container Runtime. It goes to this question: #4113 (comment)

Some critest ideas might be:

  1. We must add a test in critest that old way (without all those specs) must continue work? (unless I am missing something and we are deprecating the current way)
  2. We should add tests on all sorts of mismatches and the runtime behavior in those cases.
  3. Test scenarios (whatever we may think) when some resources (paths?) passed in sandbox cannot be mounted yet (permissions? remounting?) and will be OK to mount later when init container completed. (unless again we want to explicitly break this scenario)

I understand the challenge with the VM-based runtimes, but also worried that we are going against the direction of making Pod more dynamic. All the duplication and mismatch of what we thought Pod and contianers were before and what they will be is scary to me.

Another alternative would be to have two modes of runtime. One mode with all information available, and every dynamic change recreates the Pod. And another mode - what we have today. And runtime will need to tell on startup which mode it wants to work in.

Interesting points. Mode works to a certain degree and we already have handler as just such a mode selection. I think the kep will definitely need to be updated to handle the dynamic pod resource changes.. (new networks, different resource claims/devices) and when a handler is found to not be dynamic.. we'll just have to cycle the pod by returning error handler does not support this update type and / or tainting the handler type ahead of time.

@zvonkok
Copy link

zvonkok commented Oct 31, 2024

@SergeyKanzhelev @mrunalp How to move this forward?

@Apokleos
Copy link

Apokleos commented Dec 6, 2024

Hi folks, any updates of this KEP ? Any ideas to help move it forward ?

- update kep.yaml to target v1.33
- update references to kep-1287 and kep-2837
@marquiz
Copy link
Contributor Author

marquiz commented Dec 10, 2024

Updated:

The related changes in other referenced KEPs (#1287, #2837) are now merged which hopefully makes this slightly easier read. Those changes/KEPs (for Kubernetes v1.32) are still mentioned in the corresponding sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status
Projects
None yet
Development

Successfully merging this pull request may close these issues.