Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ODH-ADR-Operator-0005: Whitelist component fields for user customizations #32

Merged
merged 2 commits into from
Jun 14, 2024

Conversation

VaishnaviHire
Copy link
Member

@VaishnaviHire VaishnaviHire commented Mar 7, 2024

Description

This ADR is a subset of feature requests defined in #23 .

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@VaishnaviHire
Copy link
Member Author

@VaishnaviHire VaishnaviHire changed the title Configure component resources through DSC ODH-ADR-Operator-0005: Configure Component resources through DSC Mar 7, 2024
Comment on lines 31 to 43
customization:
- name: kserve-controller-manager
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
- name: odh-model-controller
resources:
requests:
cpu: 100m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want this to be in the "default" config, e.g when create default-dsc from a clean installation or the template "Yaml Form" should set the default request values.
biggest concern is, user will need to know exactly the name of the deployment, i.e kserve-controller-manager in order to make this logic work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this will not be in default config. Default resources will be set to the one mentioned in manifests. This is an optional config to be added by users when required.

Copy link
Contributor

@bartoszmajsak bartoszmajsak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this ADR, the main question I have is:

Why do we reconcile everything back to the default state?

Shouldn't we treat the components' manifests as a sort of "reasonable" default and allow users to adjust them according to their needs? KfDef version (v1) might have been too liberal in this regard, as it merely checked if a given resource already existed and skipped applying if that was the case. I understand that we want to have control over certain aspects, but this proposal suggests we might have gone too far with this approach.

Perhaps we should start considering what changes users can make (or define what is not changeable) to the resources we create and allow them to do so.

Comment on lines 32 to 39
- name: kserve-controller-manager
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the previous #23 ADR, I am against this type of API change. By doing that we are leaking implementation details of RHOAI components to the rather abstract concept of DSC and Components. Moreover, unless it's a free-form field (e.g. raw extension), we are locking API to the current state of component implementation, making it tricky with changes moving forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with the deployment name being exposed. Alternatively we can just get resource configuration and apply it to all deployments for a component. However that may cause issues with components having multiple deployments/controllers

Comment on lines 20 to 21
ODH component deployments lack a mechanism for customizing resource limits and requests. Deployment resources are hard-coded in the manifests with no means to adjust them according to user requirements.
We need a mechanism for users to configure resources when available resources are limited.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add concrete use cases that need this functionality so that we can understand the rationale better?

I am aware of too many replicas of dashboard deployment (which in general I am wondering why we have so high in the first place). What are the others in e.g. KServe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @israel-hdez Can you elaborate on the Kserve usecase for resource configuration?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we have this issue in upstream KServe: kserve/kserve#3467.
We have applied a workaround by adding more resources in ODH manifests here to unblock some people on using KServe: opendatahub-io/kserve#261.

Despite a fix is already in progress, these kind of workarounds would not require users to wait for a release if we could configure resource limits/requests. And, well... I think users should be able to customize the resource limits, based on their expected number of resources.

I also barely remember there were other similar instances in other model serving components -- so, this is not KServe specific.

Comment on lines 32 to 39
- name: kserve-controller-manager
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the previous #23 ADR, I am against this type of API change. By doing that we are leaking implementation details of RHOAI components to the rather abstract concept of DSC and Components. Moreover, unless it's a free-form field (e.g. raw extension), we are locking API to the current state of component implementation, making it tricky with changes moving forward.

@VaishnaviHire
Copy link
Member Author

After reading this ADR, the main question I have is:

Why do we reconcile everything back to the default state?

Shouldn't we treat the components' manifests as a sort of "reasonable" default and allow users to adjust them according to their needs? KfDef version (v1) might have been too liberal in this regard, as it merely checked if a given resource already existed and skipped applying if that was the case. I understand that we want to have control over certain aspects, but this proposal suggests we might have gone too far with this approach.

Perhaps we should start considering what changes users can make (or define what is not changeable) to the resources we create and allow them to do so.

I do agree with letting users update the resources, I am just not sure how we can limit the scope of the updates. So if we do not reconcile to the desired state, there can be usecase when users do multiple config updates and result in an error state. Maybe we can add a way for user to return to defaults if they are in such error state?

Kfdef actually provided no way to customize, it used to reconcile everything back to the default state.

@bartoszmajsak
Copy link
Contributor

bartoszmajsak commented Mar 8, 2024

I do agree with letting users update the resources, I am just not sure how we can limit the scope of the updates

Maybe we could find a simple way to define paths/fields of a particular resource that we will skip during reconciliation. We can for example remove such a field from the patch operation when applying the original, "desired" state. Or flipping this around - reconciliation will apply a patch with only the fields that we really want to make sure are not changed.

So if we do not reconcile to the desired state, there can be usecase when users do multiple config updates and result in an error state. Maybe we can add a way for user to return to defaults if they are in such error state?

I am not sure about this one. My line of thinking is more of what kind of changes would result in "unsupported configuration". For example, we might want to make sure our images are always reconciled. But things like resources are very dependent on the target env and usage patterns so this could be skipped from reconciliation.

Kfdef actually provided no way to customize, it used to reconcile everything back to the default state.

I have only worked on the plugin part of KfDef and in there when you were creating so-called "raw resources" it was skipping apply during reconciliation when the resource already existed. So I stand corrected :)

OTOH it lets you provide your own overlays to customize the kustomize :) I am not saying this is the way to go, but it was possible without blowing up the KfDef API.

@israel-hdez
Copy link

Why do we reconcile everything back to the default state?

Shouldn't we treat the components' manifests as a sort of "reasonable" default and allow users to adjust them according to their needs?

I know we can go this way to be quite flexible. However, I'm more on the side that it is the operator CRDs that should expose the needed settings, rather than instructing the users to do it themselves.

IMHO, the more we allow for such flexibility, the more the operator becomes less valuable -- i.e. why not just provide the manifests to the user?

Personally, I'm quite sold to the explanation in operator framework website:

With Operators you can stop treating an application as a collection of primitives like Pods, Deployments, Services or ConfigMaps, but instead as a single object that only exposes the knobs that make sense for the application.

So, the general idea that this ADR is proposing is good to me.

cpu: 200m
memory: 256Mi
- name: odh-model-controller
resources:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The odh-model-controller is weird, as this is shared between ModelMesh and KServe.
It starts to become more problematic.

@bartoszmajsak
Copy link
Contributor

bartoszmajsak commented Mar 13, 2024

Why do we reconcile everything back to the default state?
Shouldn't we treat the components' manifests as a sort of "reasonable" default and allow users to adjust them according to their needs?

I know we can go this way to be quite flexible. However, I'm more on the side that it is the operator CRDs that should expose the needed settings, rather than instructing the users to do it themselves.

IMHO, the more we allow for such flexibility, the more the operator becomes less valuable -- i.e. why not just provide the manifests to the user?
Personally, I'm quite sold to the explanation in operator framework website:

With Operators you can stop treating an application as a collection of primitives like Pods, Deployments, Services or ConfigMaps, but instead as a single object that only exposes the knobs that make sense for the application.

So, the general idea that this ADR is proposing is good to me.

Treating a component as a whole sounds like the right thing to do, I am not opposing that, nor do I suggest going back to "using primitives". I fear that CRD can become bloated with different levels of abstraction. That's why I started thinking that perhaps we can open up certain aspects of our building blocks for the users to fine-tune instead of exposing those in our CRDs.

Is the suggested approach based on the assumption that the "user" is someone not knowledgable in the k8s domain, so instead of letting them fine-tune certain "low-level" settings, we are going to expose those low-level settings in some other place? I wonder if by doing so aren't we exactly exposing knobs?

@israel-hdez
Copy link

israel-hdez commented Mar 13, 2024

perhaps we can open up certain aspects of our building blocks for the users to fine-tune

But, then, aren't these building blocks the primitives? I think this confuses me.

so instead of letting them fine-tune certain "low-level" settings, we are going to expose those low-level settings in some other place? I wonder if by doing so aren't we exactly exposing knobs?

On this point I agree with you that the different high/low level settings look weird. However, I do think resource management is still within the scope of an operator. IMO, it depends on how much an operator is planned to evolve on its capability level. A level-1 operator would require users to tune these low level settings, while a level-5 operator would relief the user from resource tuning.

AFAIK, we don't have the data to automate the resource tuning, so IMO we need the user to do it for now. I think this is quite clear, and the discussion seems to be around on how the user is supposed to do it.

Maybe, a middle ground is to expose in the CRD a high level setting like kserve.resources.expectedDeployedModels and similarly modelmesh.resources.expectedDeployedModels so that the operator does some math and adjusts the low-level settings of all impacted components (kserve, modelmesh, odh-model-controller, ¿etc?). But I don't think we have data to do this.

@VaishnaviHire
Copy link
Member Author

Maybe, a middle ground is to expose in the CRD a high level setting like kserve.resources.expectedDeployedModels and similarly modelmesh.resources.expectedDeployedModels so that the operator does some math and adjusts the low-level settings of all impacted components (kserve, modelmesh, odh-model-controller, ¿etc?). But I don't think we have data to do this.

Will this still require us to expose component specific details?
Should we consider components to have a minimal and default config(overlays). This will not be limited to only resources.

spec.components.Kserve.config: minimal
OR
spec.components.Kserve.config: default

@israel-hdez However I think this will require us to have data for minimal resources required by a component.

@israel-hdez
Copy link

@VaishnaviHire

Will this still require us to expose component specific details? Should we consider components to have a minimal and default config(overlays). This will not be limited to only resources.

My rationale tells me yes.

Should we consider components to have a minimal and default config(overlays). This will not be limited to only resources.

spec.components.Kserve.config: minimal
OR
spec.components.Kserve.config: default

I'm not sure overlays are suitable, because resources are something can need fine-tuning, and we already had got this feedback from the field.

This ADR is proposing to use the DataScienceCluster for that, and I think it is fine. Then, my understanding is that @bartoszmajsak is suggesting that it is better to whitelist fields of the resources so that the user can manually change those fields... this is also fine! My personal preference would be to do it in the DSC just because my understanding is that centralized configs is one of the added benefits of using operators (and this is what this ADR is proposing, so this is looking fine to me!).

In model serving we just need a mechanism to configure the resources. Whether this ADR is refined, or dropped in favor of whitelisting fields, or something else... that's more up to you :-)

@VaishnaviHire
Copy link
Member Author

I will update the ADR to reflect the discussion to provide list of fields to whitelist across all components. See PR for Kserve specific changes.

@VaishnaviHire VaishnaviHire changed the title ODH-ADR-Operator-0005: Configure Component resources through DSC ODH-ADR-Operator-0005: Whitelist component fields for user customizations May 1, 2024
@VaishnaviHire
Copy link
Member Author

I have updated the adr to reflect implementation to whitelist fields across components
cc: @bartoszmajsak

@etirelli
Copy link
Contributor

@VaishnaviHire is this ready for merging?

@VaishnaviHire
Copy link
Member Author

VaishnaviHire commented Jun 5, 2024

Hi Yes, Just need lgtm from component owners

@etirelli etirelli merged commit 54f60aa into opendatahub-io:main Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants