Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add temp Job to delete old monitoring stack #146

Conversation

VedantMahabaleshwarkar
Copy link
Contributor

@VedantMahabaleshwarkar VedantMahabaleshwarkar commented Jan 19, 2024

Fixes:

https://issues.redhat.com/browse/RHOAIENG-293

Context:

The model-monitoring stack was deprecated in RHODS 2.5. This means that starting from 2.5, the RHODS Operator does not create the resources for model-monitoring during new installs. However, for upgrades from 2.4, the existing resources are not deleted. This PR adds a Job for RHODS 2.6 that deletes the resources from the model-monitoring stack that are no longer needed.
Resources being deleted :

  • Deployment rhods-prometheus-operator
  • StatefulSet (containing 3 pods) prometheus-rhods-model-monitoring
  • ServiceMonitor modelmesh-federated-metrics

Note: This Job is a temporary addition for RHODS 2.6 ONLY, and will be reverted in 2.7 as it will not be needed anymore.

PR changes :

  • Added initContainer to odh-model-controller deployment to delete the resources mentioned above
  • Added permissions to odh-model-controller-role to allow it the necessary privileges to perform the deletion

Testing Instructions :

Prerequisites

  • Install RHODS 2.4
  • We install 2.4 because we want to test if the initContainer correctly deletes the resources created by the old monitoring stack.
  • I used RC3 build for 2.4 because I could not find a way to install the 2.4 version of the product from operatorhub.
Catalog Source Images for 2.4 RC3 
Index image v4.11: registry-proxy.engineering.redhat.com/rh-osbs/iib:625499
Index image v4.12: registry-proxy.engineering.redhat.com/rh-osbs/iib:625502
Index image v4.13: registry-proxy.engineering.redhat.com/rh-osbs/iib:625517
Index image v4.14: registry-proxy.engineering.redhat.com/rh-osbs/iib:625527
Index image v4.15: registry-proxy.engineering.redhat.com/rh-osbs/iib:625532

Instructions to install RC builds on clusters (doesn't work on ROSA):

Testing Steps

  • Spin down RHODS Operator deployment to 0 in NS redhat-ods-operator

    • We do this because we are testing the deletion of resources that are no longer watched by the operator starting from 2.5.0
    • For more context, the operator PR is here
  • Modify role odh-model-controller-role in NS redhat-ods-applications as per the rbac changes in PR

    • Add get, list, delete permissions in apiGroup apps for resources deployments, statefulsets
    • Add get permissions for apiGroup project.openshift.io for resources projects
  • Manually create a Job from the remove-deprecated-monitoring.yaml file in the PR

  • Manually verify the following resources no longer exist in the cluster in the NS redhat-ods-monitoring

    • Deployment rhods-prometheus-operator
    • Statefulset prometheus-rhods-model-monitoring
    • ServiceMonitor modelmesh-federated-metrics
  • Success! :)

  • The commits are squashed in a cohesive manner and have meaningful messages.

  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).

  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly opendatahub-io/opendatahub-operator#703, the ownership of the no longer existent components belong to the opendatahub-operator. So, I suggest to move this logic (either in same way or in another equivalent form) to the opendatahub-operator.

The odh-model-controller should not be responsible for fixing install, nor upgrade issues.

@VedantMahabaleshwarkar VedantMahabaleshwarkar changed the title add temp initContainer to delete old monitoring stack add temp Job to delete old monitoring stack Jan 19, 2024
@VedantMahabaleshwarkar VedantMahabaleshwarkar force-pushed the delete-monitoring branch 6 times, most recently from 347a4e9 to 71e2aa0 Compare January 19, 2024 18:07
@VedantMahabaleshwarkar
Copy link
Contributor Author

@israel-hdez

If I understand correctly opendatahub-io/opendatahub-operator#703, the ownership of the no longer existent components belong to the opendatahub-operator. So, I suggest to move this logic (either in same way or in another equivalent form) to the opendatahub-operator.
The odh-model-controller should not be responsible for fixing install, nor upgrade issues.

I'd say that since the model monitoring stack was a component specific to model serving, model serving related repos should own the logic to also delete it during upgrade.
opendatahub-io/opendatahub-operator#703 takes care of no longer deploying those components as that logic has existed since we had "odh-manifests". However, model serving is the logical owner of the deprecated monitoring stack and hence should take care of deleting it.

@israel-hdez
Copy link
Contributor

However, model serving is the logical owner of the deprecated monitoring stack and hence should take care of deleting it.

Looking at this as working groups, I agree with this statement. However, I'm more on the thinking that the component that created the resources is the one that should do the cleanup. Model serving WG is still the owner, but the implementation should be on opendatahub-operator.

Furthermore, we should guarantee clean-up even if the serving stack is fully disabled in any given cluster. In such case, if resources went stale for any reason, moving/running this code at the operator is what makes the most sense.

Well, I'm going to be strict in that fixing installation issues is not a concern of this repository/component. I'll tag other people to review and if they agree that this is the good repo to host the fix, I'll switch my review to an approval: @terrytangyuan @Jooho @danielezonca .

Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
@VedantMahabaleshwarkar
Copy link
Contributor Author

Well, I'm going to be strict in that fixing installation issues is not a concern of this repository/component. I'll tag other people to review and if they agree that this is the good repo to host the fix, I'll switch my review to an approval: @terrytangyuan @Jooho @danielezonca .

Discussed offline with @Jooho @terrytangyuan and @israel-hdez , for now we're ok with having this PR in this repo

@heyselbi
Copy link
Contributor

@israel-hdez @VedantMahabaleshwarkar @Jooho I manually tested on OSD cluster v4.12 with RC build. And all tests passed

@heyselbi
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 19, 2024
@heyselbi heyselbi removed their assignment Jan 19, 2024
Copy link
Contributor

openshift-ci bot commented Jan 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: heyselbi, VedantMahabaleshwarkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [VedantMahabaleshwarkar,heyselbi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@VedantMahabaleshwarkar VedantMahabaleshwarkar dismissed israel-hdez’s stale review January 19, 2024 21:54

Discussed edgar's comment offline and Selbi has reviewed instead

@VedantMahabaleshwarkar VedantMahabaleshwarkar merged commit 50c596a into opendatahub-io:main Jan 19, 2024
3 of 5 checks passed
israel-hdez added a commit to israel-hdez/odh-model-controller that referenced this pull request Jan 25, 2024
…)"

This reverts commit 50c596a.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants