Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add model-mesh and odh-model-controller manifests from component repos #580

Closed
wants to merge 1 commit into from
Closed

add model-mesh and odh-model-controller manifests from component repos #580

wants to merge 1 commit into from

Conversation

VedantMahabaleshwarkar
Copy link
Contributor

@VedantMahabaleshwarkar VedantMahabaleshwarkar commented Sep 28, 2023

Description

  • Added model-mesh and odh-model-controller manifests to get_all_manifests.sh
  • added logic to allow kserve and model-mesh to share a singular odh-model-controller deployment

Testing

The testing for this PR is to be done along with the testing for this and this PR since they target the same set of overall changes.
The ODH operator image has been built with the following changes :

How Has This Been Tested?

ModelMesh testing

  • cluster login
  • git clone git@github.com:VedantMahabaleshwarkar/opendatahub-operator.git
  • cd opendatahub-operator
  • switch to branch manifests_transition_mlserving
  • make deploy -e IMG=quay.io/vedantm/opendatahub-operator:latest
  • This will install the ODH operator in NS opendatahub-operator-system with the image quay.io/vedantm/opendatahub-operator:latest
  • create the following DSC in NS opendatahub
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: example
spec:
  components:
    modelmeshserving:
      managementState: Managed
  • deploy and test a model with modelmesh

Kserve testing

Note: Cleanup kserve CRDs before testing modelmesh
  • cluster login
  • install ODH Kserve + dependencies stack by following instructions
Note: Only install the dependencies, the ODH operator will be a custom install according to next steps
  • git clone git@github.com:VedantMahabaleshwarkar/opendatahub-operator.git
  • cd opendatahub-operator
  • switch to branch manifests_transition_mlserving
  • make deploy -e IMG=quay.io/vedantm/opendatahub-operator:latest
  • This will install the ODH operator in NS opendatahub-operator-system with the image quay.io/vedantm/opendatahub-operator:latest
  • create the following DSC in NS opendatahub
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default
spec:
  components:
    kserve:
      managementState: Managed

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@openshift-ci
Copy link

openshift-ci bot commented Sep 28, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Sep 28, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign etirelli for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
@VedantMahabaleshwarkar VedantMahabaleshwarkar changed the title Manifests transition mlserving add model-mesh and odh-model-controller manifests from component repos Sep 28, 2023
@VedantMahabaleshwarkar VedantMahabaleshwarkar marked this pull request as ready for review September 28, 2023 21:57
@openshift-ci openshift-ci bot requested review from etirelli and LaVLaS September 28, 2023 21:57
@openshift-ci
Copy link

openshift-ci bot commented Sep 28, 2023

@VedantMahabaleshwarkar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-operator-e2e 3978b7c link true /test opendatahub-operator-e2e
ci/prow/opendatahub-operator-pr-image-mirror 3978b7c link true /test opendatahub-operator-pr-image-mirror
ci/prow/images 3978b7c link true /test images
ci/prow/ci-index 3978b7c link true /test ci-index

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@VedantMahabaleshwarkar
Copy link
Contributor Author

Ready to be tested, but because of the branches specified in get_all_manifests.sh assumes dependent PRs are merged, this needs to be merged after opendatahub-io/odh-model-controller#93 and opendatahub-io/modelmesh-serving#222

@@ -11,6 +11,8 @@ REPO_LIST=(
"kubeflow:v1.7-branch:components/notebook-controller/config:odh-notebook-controller/kf-notebook-controller"
"kubeflow:v1.7-branch:components/odh-notebook-controller/config:odh-notebook-controller/odh-notebook-controller"
"notebooks:main:manifests:notebooks"
"odh-model-controller:release-0.11:config:odh-model-controller"
"modelmesh-serving:release-0.11:opendatahub/manifests:model-mesh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so we cannot merge this PR till opendatahub-io/odh-model-controller#92 and opendatahub-io/modelmesh-serving#221 PRs are in

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think you can also update line 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opendatahub-io/odh-model-controller#92 and opendatahub-io/modelmesh-serving#221 are the PRs for the main branches to those repos,
the PRs that are a prereq to this one are opendatahub-io/odh-model-controller#93 and opendatahub-io/modelmesh-serving#222 which are targeted at the release branches

if err := deploy.DeployManifestsFromPath(cli, owner, DependentPath, dscispec.ApplicationsNamespace, k.GetComponentName(), enabled); err != nil {
return err
if err := deploy.DeployManifestsFromPath(cli, owner, DependentPath, dscispec.ApplicationsNamespace, k.GetComponentName(), enabled); err != nil {
if strings.Contains(err.Error(), "spec.selector") && strings.Contains(err.Error(), "field is immutable") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, kserve and modelserving are explicitly ignore these errors of deployments on odh-model-controller ?
then it wont be any reconcile on operator even the "newer" deployment is not working because of immutable.
guess the issue wont be resolved by these 2 PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error, if I understand it correctly, was because when both kserve and modelmesh are enabled together, they both try to update the selector label with their own component names, which is not possible.
With this change,

  • kserve enabled first, modelmesh enabled second
    -- here, kserve will create the odh-model-controller deployment and pass k.ComponentName
    -- modelmesh will later try to update the selector with m.ComponentName and it will fail, but the error is ok to be ignored as odh-model-controller deployment already exists.
    -- every time modelmesh reconciles, this failure will keep happening but will continue to be ignored as long as kserve+odh-model-controller already exists.

  • modelmesh enabled first, kserve enabled second
    -- same as the above scenario, but the odh-model-controller deployment will have m.ComponentName

  • When kserve and modelmesh are both enabled, and kserve is removed (assuming kserve was enabled first)
    -- when kserve is removed, that should also remove the odh-model-controller deployment that exists with k.ComponentName
    -- when this happens, on the next reconcile loop for modelmesh, the odh-model-controller deployment will be re-created with m.ComponentName

the same scenario would apply(only reversed) if modelmesh was installed first and then deleted later when kserve was also installed

this issue, was because when kserve was enabled first, the odh-model-controller deployment contained the kserve-enabled=true flag, which meant that odh-model-controller was in kserve mode. Since 2 odh-model-controller deployments could not co-exist, there was no odh-model-controller with the modelmesh functionality enabled on the cluster in this scenario. Hence the storage-secret related error, as that is a modelmesh function performed by odh-model-controller.
Since we no longer have the kserve-enabled flag altogether, this issue should not occur IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, ignoring errors it not fair for this case.

  -- when this happens, on the next reconcile loop for modelmesh, the odh-model-controller deployment will be re-created with `m.ComponentName`

Although this is right, the "next reconciliation" could be almost immediate, or could be (worst case) on some manual event.

It is going to be perceived as a bug if the operator does not actively pursues to achieve the desired state in the DSC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@israel-hdez Do you have a better idea to ensure 1 deployment for odh-model-controller in a case where both kserve and modelmesh are enabled together?
Without having the concept of "shared components" defined in the operator logic, it's difficult to handle the ownership label of odh-model-controller. Currently it has to be a part of the kserve OR modelmesh label. This way of "ignoring" the label mismatch error (while not ideal and should not be long term) was the only idea I could get to achieve what we want without going into the specifics of defining a "shared component"

Copy link
Contributor

@israel-hdez israel-hdez Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract odh-model-controller as an independent component under /components.

Then, in the root of the controller (around here), instantiate programatically as Managed/Removed (depending on KServe and ModelMesh state) and invoke reconciliation.

This way, odh-model-controller will have its own component name, and its managementState will be controlled programmatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed that already with @VaishnaviHire previously I think, there's ambiguity around having a component not directly referenced in the DSC spec etc

Copy link
Contributor

@israel-hdez israel-hdez Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, IMHO, it is better than the current approach. You will no longer need to ignore (or whitelist) errors, and it will be perceived as working better -- e.g. with the current approach, users may find confusing that the operator needs to uninstall and re-install odh-model-controller, just because either MM or KServe was uninstalled and the other one still remained installed.

I can partially understand that there may be ambiguity. I may need a detailed explanation to fully understand why it is an issue. Without the full understanding, my POV is that it is better to have the ambiguity (which I understand as simple user perception) than to rely on the expectation that future reconciliations will cover for fixing the installation of odh-model-controller.

Those are my 2cents. Please, anyway, talk with the operator team to check what is more acceptable. If they give you the green light, you will be all set.

@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@VedantMahabaleshwarkar
Copy link
Contributor Author

replaced by #639 due to a big rebase and needing a different destination and source branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants