Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Mesh cannot work together in the cluster where Kserve was first enabled #508

Closed
bdattoma opened this issue Sep 4, 2023 · 4 comments
Assignees

Comments

@bdattoma
Copy link

bdattoma commented Sep 4, 2023

Describe the bug
If kserve gets enabled in the cluster before modelmesh, the last one cannot successfully deploy models. The MM pod in the DS Project gets stuck with the following error in the events: storage-config secret not found

To Reproduce
Steps to reproduce the behavior:

  1. Install RHODS v2
  2. Enable KServe in the DSC object
  3. Deploy a model with Kserve
  4. Go back to DSC and enable modelmesh too
  5. Deploy a model with modelmesh - pod remains 0/5 container Running

Expected behavior
both models in kserve and modelmesh are successfully deployed

Additional context
It seems working fine if we invert the order of enablements - MM first and kserve later

@bdattoma bdattoma added the bug Something isn't working label Sep 4, 2023
@zdtsw zdtsw self-assigned this Sep 14, 2023
@zdtsw zdtsw moved this from Todo to In Progress in ODH Platform Planning Sep 15, 2023
@bdattoma bdattoma changed the title Model Mesh cannot work together in the cluster where Kserve was fist enabled Model Mesh cannot work together in the cluster where Kserve was first enabled Sep 15, 2023
@zdtsw
Copy link
Member

zdtsw commented Sep 15, 2023

To wrap up some findings:

when first enable Kserve and have MM disabled, the deployment of odh-model-controller set

metadata:
 labels:
    app.kubernetes.io/part-of: kserve
    app.opendatahub.io/kserve: 'true'
    control-plane: odh-model-controller
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: kserve
      app.opendatahub.io/kserve: 'true'
      control-plane: odh-model-controller

later when MM is enabled
deployment of odh-model-controller try to set
"app.kubernetes.io/part-of":"model-mesh" which is a immutable filed for deployment.

this could happen even in the different order: first enable MM then Kserve

workaround: to rename MM's deployment to something different, e.g mm-odh-model-controller. so we do not need to change logic in operator.

for the inferenceservices CRD, the current one in odh-manifests from MM is with old content, apart from this, it does not set ".spec.conversion" but the inferenceservices CRD in odh-manifests from Kserve has the ".spec.conversion" to use webhook.
in the order of enabling these two components:

  • first enable kserve, CRD gets conversion, then enable MM, cannot pass k8s api check because does not have conversion => operator reconcile, unless manually delete this CRD
  • first enable MM, CRD without conversion, then enable Kserve, it will work, because webhook convert from v1beta1 to v1beta1
  • (not sure why upstream kserve like to keep v1beta1 even changes schema...)

for the speicific error in this issue's description, try to reproduce it,

i am not sure, if KServe is dependency for "Models and model serves" to "Deploy model"?
I only have MM enabled, and the inferenceserivces CRD is with "conversion.strategy:None"
but from the error

Error creating model server
Internal error occurred: failed calling webhook "inferenceservice.kserve-webhook-server.defaulter": failed to call webhook: Post "https://kserve-webhook-server-service.opendatahub.svc:443/mutate-serving-kserve-io-v1beta1-inferenceservice?timeout=10s": service "kserve-webhook-server-service" not found

it is actually using the CRD from Kserve (which should have a webhook service running there)
Screenshot from 2023-09-15 13-36-13
Screenshot from 2023-09-15 13-36-34

and the original error should be caused by https://github.com/opendatahub-io/odh-model-controller/blob/e24d77a9efa8a2a2483ef8656940b9de6553ecfa/controllers/storageconfig_controller.go#L35
which should be done by the odh-model-controller reconcile()
continue with my previous step. even i got error to "deploy model" but the secret aws-connection-wen-mock-dataconnection was actually created , along with the "storage-config" one on the fly.
=> "sotoage-config" is only created when user created a "data connection" and all the "data connections" will write their secret data into "storage-config", when all "data connections" are deleted, "storage-config" will be deleted as well.
=> i could not find the Kserve CRD in my cluster, only the one match MM, so my guess is, dashboard might "cache" kserve CRD and not respect what's the scheme of the current MM CRD ( i enabled kserver + dashboard, then enabled MM, then disabled kserve, manually removed inferenceservice of kserve to stop operator reconcile)

@zdtsw zdtsw removed their assignment Sep 19, 2023
@lugi0
Copy link

lugi0 commented Sep 20, 2023

An additional consideration:
Once Kserve is enabled in a cluster, even after disabling it and then trying to enable modelmesh will result in the same error in @zdtsw 's screenshot. We've tried manually cleaning up the cluster but were unsuccessful in restoring functionality to modelmesh.

@zdtsw
Copy link
Member

zdtsw commented Sep 25, 2023

for the "deployment" part, we(operator) will find a way to fix the "immutable" error

@zdtsw
Copy link
Member

zdtsw commented Dec 1, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: No status
Status: Done
Development

No branches or pull requests

5 participants