-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add model-mesh and odh-model-controller manifests from component repos #580
add model-mesh and odh-model-controller manifests from component repos #580
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
@VedantMahabaleshwarkar: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Ready to be tested, but because of the branches specified in |
@@ -11,6 +11,8 @@ REPO_LIST=( | |||
"kubeflow:v1.7-branch:components/notebook-controller/config:odh-notebook-controller/kf-notebook-controller" | |||
"kubeflow:v1.7-branch:components/odh-notebook-controller/config:odh-notebook-controller/odh-notebook-controller" | |||
"notebooks:main:manifests:notebooks" | |||
"odh-model-controller:release-0.11:config:odh-model-controller" | |||
"modelmesh-serving:release-0.11:opendatahub/manifests:model-mesh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so we cannot merge this PR till opendatahub-io/odh-model-controller#92 and opendatahub-io/modelmesh-serving#221 PRs are in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think you can also update line 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opendatahub-io/odh-model-controller#92 and opendatahub-io/modelmesh-serving#221 are the PRs for the main branches to those repos,
the PRs that are a prereq to this one are opendatahub-io/odh-model-controller#93 and opendatahub-io/modelmesh-serving#222 which are targeted at the release branches
if err := deploy.DeployManifestsFromPath(cli, owner, DependentPath, dscispec.ApplicationsNamespace, k.GetComponentName(), enabled); err != nil { | ||
return err | ||
if err := deploy.DeployManifestsFromPath(cli, owner, DependentPath, dscispec.ApplicationsNamespace, k.GetComponentName(), enabled); err != nil { | ||
if strings.Contains(err.Error(), "spec.selector") && strings.Contains(err.Error(), "field is immutable") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, kserve
and modelserving
are explicitly ignore these errors of deployments on odh-model-controller
?
then it wont be any reconcile on operator even the "newer" deployment is not working because of immutable.
guess the issue wont be resolved by these 2 PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error, if I understand it correctly, was because when both kserve and modelmesh are enabled together, they both try to update the selector label with their own component names, which is not possible.
With this change,
-
kserve enabled first, modelmesh enabled second
-- here, kserve will create the odh-model-controller deployment and passk.ComponentName
-- modelmesh will later try to update the selector withm.ComponentName
and it will fail, but the error is ok to be ignored as odh-model-controller deployment already exists.
-- every time modelmesh reconciles, this failure will keep happening but will continue to be ignored as long askserve+odh-model-controller
already exists. -
modelmesh enabled first, kserve enabled second
-- same as the above scenario, but the odh-model-controller deployment will havem.ComponentName
-
When kserve and modelmesh are both enabled, and kserve is removed (assuming kserve was enabled first)
-- when kserve is removed, that should also remove the odh-model-controller deployment that exists withk.ComponentName
-- when this happens, on the next reconcile loop for modelmesh, the odh-model-controller deployment will be re-created withm.ComponentName
the same scenario would apply(only reversed) if modelmesh was installed first and then deleted later when kserve was also installed
this issue, was because when kserve was enabled first, the odh-model-controller deployment contained the kserve-enabled=true
flag, which meant that odh-model-controller was in kserve mode. Since 2 odh-model-controller deployments could not co-exist, there was no odh-model-controller with the modelmesh functionality enabled on the cluster in this scenario. Hence the storage-secret
related error, as that is a modelmesh function performed by odh-model-controller.
Since we no longer have the kserve-enabled
flag altogether, this issue should not occur IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, ignoring errors it not fair for this case.
-- when this happens, on the next reconcile loop for modelmesh, the odh-model-controller deployment will be re-created with `m.ComponentName`
Although this is right, the "next reconciliation" could be almost immediate, or could be (worst case) on some manual event.
It is going to be perceived as a bug if the operator does not actively pursues to achieve the desired state in the DSC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@israel-hdez Do you have a better idea to ensure 1 deployment for odh-model-controller in a case where both kserve and modelmesh are enabled together?
Without having the concept of "shared components" defined in the operator logic, it's difficult to handle the ownership
label of odh-model-controller
. Currently it has to be a part of the kserve OR modelmesh label. This way of "ignoring" the label mismatch error (while not ideal and should not be long term) was the only idea I could get to achieve what we want without going into the specifics of defining a "shared component"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract odh-model-controller
as an independent component under /components
.
Then, in the root of the controller (around here), instantiate programatically as Managed/Removed (depending on KServe and ModelMesh state) and invoke reconciliation.
This way, odh-model-controller
will have its own component name, and its managementState
will be controlled programmatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed that already with @VaishnaviHire previously I think, there's ambiguity around having a component
not directly referenced in the DSC spec etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, IMHO, it is better than the current approach. You will no longer need to ignore (or whitelist) errors, and it will be perceived as working better -- e.g. with the current approach, users may find confusing that the operator needs to uninstall and re-install odh-model-controller, just because either MM or KServe was uninstalled and the other one still remained installed.
I can partially understand that there may be ambiguity. I may need a detailed explanation to fully understand why it is an issue. Without the full understanding, my POV is that it is better to have the ambiguity (which I understand as simple user perception) than to rely on the expectation that future reconciliations will cover for fixing the installation of odh-model-controller.
Those are my 2cents. Please, anyway, talk with the operator team to check what is more acceptable. If they give you the green light, you will be all set.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
replaced by #639 due to a big rebase and needing a different destination and source branch |
Description
get_all_manifests.sh
Testing
The testing for this PR is to be done along with the testing for this and this PR since they target the same set of overall changes.
The ODH operator image has been built with the following changes :
quay.io/vedantm/odh-model-controller:remove-kserve-flag
which contains the changes for this PR.How Has This Been Tested?
ModelMesh testing
git clone git@github.com:VedantMahabaleshwarkar/opendatahub-operator.git
cd opendatahub-operator
manifests_transition_mlserving
make deploy -e IMG=quay.io/vedantm/opendatahub-operator:latest
opendatahub-operator-system
with the imagequay.io/vedantm/opendatahub-operator:latest
opendatahub
Kserve testing
git clone git@github.com:VedantMahabaleshwarkar/opendatahub-operator.git
cd opendatahub-operator
manifests_transition_mlserving
make deploy -e IMG=quay.io/vedantm/opendatahub-operator:latest
opendatahub-operator-system
with the imagequay.io/vedantm/opendatahub-operator:latest
opendatahub
Merge criteria: