ODH-ADR-Operator-0005: Whitelist component fields for user customizations #32

VaishnaviHire · 2024-03-07T22:17:39Z

Description

This ADR is a subset of feature requests defined in #23 .

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

VaishnaviHire · 2024-03-07T22:25:24Z

/cc @israel-hdez @danielezonca @etirelli @LaVLaS

zdtsw · 2024-03-08T07:24:36Z

operator/ODH-ADR-Operator-0005-configure-resources.md

+      customization:
+        - name: kserve-controller-manager
+           resources:
+             requests:
+              cpu: 100m
+              memory: 128Mi
+             limits:
+              cpu: 200m
+              memory: 256Mi
+        - name: odh-model-controller
+          resources:
+             requests:
+              cpu: 100m


do we want this to be in the "default" config, e.g when create default-dsc from a clean installation or the template "Yaml Form" should set the default request values.
biggest concern is, user will need to know exactly the name of the deployment, i.e kserve-controller-manager in order to make this logic work.

No, this will not be in default config. Default resources will be set to the one mentioned in manifests. This is an optional config to be added by users when required.

bartoszmajsak

After reading this ADR, the main question I have is:

Why do we reconcile everything back to the default state?

Shouldn't we treat the components' manifests as a sort of "reasonable" default and allow users to adjust them according to their needs? KfDef version (v1) might have been too liberal in this regard, as it merely checked if a given resource already existed and skipped applying if that was the case. I understand that we want to have control over certain aspects, but this proposal suggests we might have gone too far with this approach.

Perhaps we should start considering what changes users can make (or define what is not changeable) to the resources we create and allow them to do so.

bartoszmajsak · 2024-03-08T11:28:39Z

operator/ODH-ADR-Operator-0005-configure-resources.md

+        - name: kserve-controller-manager
+           resources:
+             requests:
+              cpu: 100m
+              memory: 128Mi
+             limits:
+              cpu: 200m
+              memory: 256Mi


Similarly to the previous #23 ADR, I am against this type of API change. By doing that we are leaking implementation details of RHOAI components to the rather abstract concept of DSC and Components. Moreover, unless it's a free-form field (e.g. raw extension), we are locking API to the current state of component implementation, making it tricky with changes moving forward.

I do agree with the deployment name being exposed. Alternatively we can just get resource configuration and apply it to all deployments for a component. However that may cause issues with components having multiple deployments/controllers

bartoszmajsak · 2024-03-08T08:54:12Z

operator/ODH-ADR-Operator-0005-configure-resources.md

+ODH component deployments lack a mechanism for customizing resource limits and requests. Deployment resources are hard-coded in the manifests with no means to adjust them according to user requirements.
+We need a mechanism for users to configure resources when available resources are limited.


Can we add concrete use cases that need this functionality so that we can understand the rationale better?

I am aware of too many replicas of dashboard deployment (which in general I am wondering why we have so high in the first place). What are the others in e.g. KServe?

Yeah @israel-hdez Can you elaborate on the Kserve usecase for resource configuration?

Right now we have this issue in upstream KServe: kserve/kserve#3467.
We have applied a workaround by adding more resources in ODH manifests here to unblock some people on using KServe: opendatahub-io/kserve#261.

Despite a fix is already in progress, these kind of workarounds would not require users to wait for a release if we could configure resource limits/requests. And, well... I think users should be able to customize the resource limits, based on their expected number of resources.

I also barely remember there were other similar instances in other model serving components -- so, this is not KServe specific.

bartoszmajsak · 2024-03-08T11:28:39Z

operator/ODH-ADR-Operator-0005-configure-resources.md

+        - name: kserve-controller-manager
+           resources:
+             requests:
+              cpu: 100m
+              memory: 128Mi
+             limits:
+              cpu: 200m
+              memory: 256Mi


Similarly to the previous #23 ADR, I am against this type of API change. By doing that we are leaking implementation details of RHOAI components to the rather abstract concept of DSC and Components. Moreover, unless it's a free-form field (e.g. raw extension), we are locking API to the current state of component implementation, making it tricky with changes moving forward.

VaishnaviHire · 2024-03-08T15:21:46Z

After reading this ADR, the main question I have is:

Why do we reconcile everything back to the default state?

Shouldn't we treat the components' manifests as a sort of "reasonable" default and allow users to adjust them according to their needs? KfDef version (v1) might have been too liberal in this regard, as it merely checked if a given resource already existed and skipped applying if that was the case. I understand that we want to have control over certain aspects, but this proposal suggests we might have gone too far with this approach.

Perhaps we should start considering what changes users can make (or define what is not changeable) to the resources we create and allow them to do so.

I do agree with letting users update the resources, I am just not sure how we can limit the scope of the updates. So if we do not reconcile to the desired state, there can be usecase when users do multiple config updates and result in an error state. Maybe we can add a way for user to return to defaults if they are in such error state?

Kfdef actually provided no way to customize, it used to reconcile everything back to the default state.

bartoszmajsak · 2024-03-08T22:00:41Z

I do agree with letting users update the resources, I am just not sure how we can limit the scope of the updates

Maybe we could find a simple way to define paths/fields of a particular resource that we will skip during reconciliation. We can for example remove such a field from the patch operation when applying the original, "desired" state. Or flipping this around - reconciliation will apply a patch with only the fields that we really want to make sure are not changed.

So if we do not reconcile to the desired state, there can be usecase when users do multiple config updates and result in an error state. Maybe we can add a way for user to return to defaults if they are in such error state?

I am not sure about this one. My line of thinking is more of what kind of changes would result in "unsupported configuration". For example, we might want to make sure our images are always reconciled. But things like resources are very dependent on the target env and usage patterns so this could be skipped from reconciliation.

Kfdef actually provided no way to customize, it used to reconcile everything back to the default state.

I have only worked on the plugin part of KfDef and in there when you were creating so-called "raw resources" it was skipping apply during reconciliation when the resource already existed. So I stand corrected :)

OTOH it lets you provide your own overlays to customize the kustomize :) I am not saying this is the way to go, but it was possible without blowing up the KfDef API.

israel-hdez · 2024-03-13T01:09:47Z

Why do we reconcile everything back to the default state?

Shouldn't we treat the components' manifests as a sort of "reasonable" default and allow users to adjust them according to their needs?

I know we can go this way to be quite flexible. However, I'm more on the side that it is the operator CRDs that should expose the needed settings, rather than instructing the users to do it themselves.

IMHO, the more we allow for such flexibility, the more the operator becomes less valuable -- i.e. why not just provide the manifests to the user?

Personally, I'm quite sold to the explanation in operator framework website:

With Operators you can stop treating an application as a collection of primitives like Pods, Deployments, Services or ConfigMaps, but instead as a single object that only exposes the knobs that make sense for the application.

So, the general idea that this ADR is proposing is good to me.

israel-hdez · 2024-03-13T01:13:09Z

operator/ODH-ADR-Operator-0005-configure-resources.md

+              cpu: 200m
+              memory: 256Mi
+        - name: odh-model-controller
+          resources:


The odh-model-controller is weird, as this is shared between ModelMesh and KServe.
It starts to become more problematic.

bartoszmajsak · 2024-03-13T07:01:17Z

Why do we reconcile everything back to the default state?
Shouldn't we treat the components' manifests as a sort of "reasonable" default and allow users to adjust them according to their needs?

I know we can go this way to be quite flexible. However, I'm more on the side that it is the operator CRDs that should expose the needed settings, rather than instructing the users to do it themselves.

IMHO, the more we allow for such flexibility, the more the operator becomes less valuable -- i.e. why not just provide the manifests to the user?
Personally, I'm quite sold to the explanation in operator framework website:

With Operators you can stop treating an application as a collection of primitives like Pods, Deployments, Services or ConfigMaps, but instead as a single object that only exposes the knobs that make sense for the application.

So, the general idea that this ADR is proposing is good to me.

Treating a component as a whole sounds like the right thing to do, I am not opposing that, nor do I suggest going back to "using primitives". I fear that CRD can become bloated with different levels of abstraction. That's why I started thinking that perhaps we can open up certain aspects of our building blocks for the users to fine-tune instead of exposing those in our CRDs.

Is the suggested approach based on the assumption that the "user" is someone not knowledgable in the k8s domain, so instead of letting them fine-tune certain "low-level" settings, we are going to expose those low-level settings in some other place? I wonder if by doing so aren't we exactly exposing knobs?

israel-hdez · 2024-03-13T18:28:13Z

perhaps we can open up certain aspects of our building blocks for the users to fine-tune

But, then, aren't these building blocks the primitives? I think this confuses me.

so instead of letting them fine-tune certain "low-level" settings, we are going to expose those low-level settings in some other place? I wonder if by doing so aren't we exactly exposing knobs?

On this point I agree with you that the different high/low level settings look weird. However, I do think resource management is still within the scope of an operator. IMO, it depends on how much an operator is planned to evolve on its capability level. A level-1 operator would require users to tune these low level settings, while a level-5 operator would relief the user from resource tuning.

AFAIK, we don't have the data to automate the resource tuning, so IMO we need the user to do it for now. I think this is quite clear, and the discussion seems to be around on how the user is supposed to do it.

Maybe, a middle ground is to expose in the CRD a high level setting like kserve.resources.expectedDeployedModels and similarly modelmesh.resources.expectedDeployedModels so that the operator does some math and adjusts the low-level settings of all impacted components (kserve, modelmesh, odh-model-controller, ¿etc?). But I don't think we have data to do this.

VaishnaviHire · 2024-03-18T15:33:59Z

Maybe, a middle ground is to expose in the CRD a high level setting like kserve.resources.expectedDeployedModels and similarly modelmesh.resources.expectedDeployedModels so that the operator does some math and adjusts the low-level settings of all impacted components (kserve, modelmesh, odh-model-controller, ¿etc?). But I don't think we have data to do this.

Will this still require us to expose component specific details?
Should we consider components to have a minimal and default config(overlays). This will not be limited to only resources.

spec.components.Kserve.config: minimal
OR
spec.components.Kserve.config: default

@israel-hdez However I think this will require us to have data for minimal resources required by a component.

israel-hdez · 2024-03-19T18:49:31Z

@VaishnaviHire

Will this still require us to expose component specific details? Should we consider components to have a minimal and default config(overlays). This will not be limited to only resources.

My rationale tells me yes.

Should we consider components to have a minimal and default config(overlays). This will not be limited to only resources.
spec.components.Kserve.config: minimal
OR
spec.components.Kserve.config: default

I'm not sure overlays are suitable, because resources are something can need fine-tuning, and we already had got this feedback from the field.

This ADR is proposing to use the DataScienceCluster for that, and I think it is fine. Then, my understanding is that @bartoszmajsak is suggesting that it is better to whitelist fields of the resources so that the user can manually change those fields... this is also fine! My personal preference would be to do it in the DSC just because my understanding is that centralized configs is one of the added benefits of using operators (and this is what this ADR is proposing, so this is looking fine to me!).

In model serving we just need a mechanism to configure the resources. Whether this ADR is refined, or dropped in favor of whitelisting fields, or something else... that's more up to you :-)

VaishnaviHire · 2024-04-11T02:21:24Z

I will update the ADR to reflect the discussion to provide list of fields to whitelist across all components. See PR for Kserve specific changes.

VaishnaviHire · 2024-05-01T17:57:56Z

I have updated the adr to reflect implementation to whitelist fields across components
cc: @bartoszmajsak

etirelli · 2024-05-28T12:53:05Z

@VaishnaviHire is this ready for merging?

VaishnaviHire · 2024-06-05T12:43:55Z

Hi Yes, Just need lgtm from component owners

VaishnaviHire mentioned this pull request Mar 7, 2024

ODH-ADR-Operator-0004-replica-image-support #23

Closed

VaishnaviHire changed the title ~~Configure component resources through DSC~~ ODH-ADR-Operator-0005: Configure Component resources through DSC Mar 7, 2024

zdtsw reviewed Mar 8, 2024

View reviewed changes

bartoszmajsak reviewed Mar 8, 2024

View reviewed changes

israel-hdez reviewed Mar 13, 2024

View reviewed changes

VaishnaviHire mentioned this pull request Apr 11, 2024

Add Ability to configure resources in Kserve opendatahub-io/opendatahub-operator#965

Merged

3 tasks

Whitelist some component fields for user customizations

66a8f97

VaishnaviHire force-pushed the config_resources branch from 46bb171 to 66a8f97 Compare May 1, 2024 17:01

VaishnaviHire changed the title ~~ODH-ADR-Operator-0005: Configure Component resources through DSC~~ ODH-ADR-Operator-0005: Whitelist component fields for user customizations May 1, 2024

Add stakeholders

998c74b

etirelli approved these changes Jun 14, 2024

View reviewed changes

etirelli merged commit 54f60aa into opendatahub-io:main Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ODH-ADR-Operator-0005: Whitelist component fields for user customizations #32

ODH-ADR-Operator-0005: Whitelist component fields for user customizations #32

VaishnaviHire commented Mar 7, 2024 •

edited

Loading

VaishnaviHire commented Mar 7, 2024

zdtsw Mar 8, 2024

VaishnaviHire Mar 8, 2024

bartoszmajsak left a comment •

edited

Loading

bartoszmajsak Mar 8, 2024

VaishnaviHire Mar 8, 2024

bartoszmajsak Mar 8, 2024

VaishnaviHire Mar 8, 2024

israel-hdez Mar 13, 2024

bartoszmajsak Mar 8, 2024

VaishnaviHire commented Mar 8, 2024

bartoszmajsak commented Mar 8, 2024 •

edited

Loading

israel-hdez commented Mar 13, 2024

israel-hdez Mar 13, 2024

bartoszmajsak commented Mar 13, 2024 •

edited

Loading

israel-hdez commented Mar 13, 2024 •

edited

Loading

VaishnaviHire commented Mar 18, 2024

israel-hdez commented Mar 19, 2024

VaishnaviHire commented Apr 11, 2024

VaishnaviHire commented May 1, 2024

etirelli commented May 28, 2024

VaishnaviHire commented Jun 5, 2024 •

edited

Loading

		ODH component deployments lack a mechanism for customizing resource limits and requests. Deployment resources are hard-coded in the manifests with no means to adjust them according to user requirements.
		We need a mechanism for users to configure resources when available resources are limited.

ODH-ADR-Operator-0005: Whitelist component fields for user customizations #32

ODH-ADR-Operator-0005: Whitelist component fields for user customizations #32

Conversation

VaishnaviHire commented Mar 7, 2024 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

VaishnaviHire commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartoszmajsak left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VaishnaviHire commented Mar 8, 2024

bartoszmajsak commented Mar 8, 2024 • edited Loading

israel-hdez commented Mar 13, 2024

Choose a reason for hiding this comment

bartoszmajsak commented Mar 13, 2024 • edited Loading

israel-hdez commented Mar 13, 2024 • edited Loading

VaishnaviHire commented Mar 18, 2024

israel-hdez commented Mar 19, 2024

VaishnaviHire commented Apr 11, 2024

VaishnaviHire commented May 1, 2024

etirelli commented May 28, 2024

VaishnaviHire commented Jun 5, 2024 • edited Loading

VaishnaviHire commented Mar 7, 2024 •

edited

Loading

bartoszmajsak left a comment •

edited

Loading

bartoszmajsak commented Mar 8, 2024 •

edited

Loading

bartoszmajsak commented Mar 13, 2024 •

edited

Loading

israel-hdez commented Mar 13, 2024 •

edited

Loading

VaishnaviHire commented Jun 5, 2024 •

edited

Loading