Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): Support custom InstaScale container image #120

Merged
merged 1 commit into from
Jun 6, 2023
Merged

Conversation

astefanutti
Copy link
Contributor

Fixes #115.

@astefanutti
Copy link
Contributor Author

/assign @anishasthana @jbusche

@KPostOffice
Copy link
Collaborator

This could cause issues in cases where the requested controller version isn't compatible with the CRD versions that were installed with the CodeFlare operator.

@astefanutti
Copy link
Contributor Author

This could cause issues in cases where the requested controller version isn't compatible with the CRD versions that were installed with the CodeFlare operator.

Right, that occurred to me as I implemented it. We may want to put a strategy in place to mitigate that risk.

Maybe @jbusche and @anishasthana can provide more details about #115 before moving forward.

@anishasthana
Copy link
Member

Right, that occurred to me as I implemented it. We may want to put a strategy in place to mitigate that risk.
Maybe @jbusche and @anishasthana can provide more details about #115 before moving forward.

I'm curious -- if we only update the image with bundle updates, shouldn't OLM take care of this for us? (The bundle will include both the latest CRDs and the latest controller, right?)

@astefanutti
Copy link
Contributor Author

Right, that occurred to me as I implemented it. We may want to put a strategy in place to mitigate that risk.
Maybe @jbusche and @anishasthana can provide more details about #115 before moving forward.

I'm curious -- if we only update the image with bundle updates, shouldn't OLM take care of this for us? (The bundle will include both the latest CRDs and the latest controller, right?)

In the upgrade case, my understanding is the CRDs will be updated by OLM, but the controller will still run a custom image, if it's been specified in the InstaScale resource.

Theoretically, even the deployment resources, owned by the InstaScale resource, and generated by the operator, may become incompatible with the custom / past version controller image. If such case would arise in the lineage of the InstaScale controller, the CodeFlare operator could adapt its logic, provided with the controller version.

Similarly, incompatibilities between the specified custom image version and the installed CRD versions could be detected, and possibly reported into the InstaScale resource conditions.

@KPostOffice
Copy link
Collaborator

Right, that occurred to me as I implemented it. We may want to put a strategy in place to mitigate that risk.
Maybe @jbusche and @anishasthana can provide more details about #115 before moving forward.

I'm curious -- if we only update the image with bundle updates, shouldn't OLM take care of this for us? (The bundle will include both the latest CRDs and the latest controller, right?)

If you're testing a new version of InstaScale that relies on some new changes to the AppWrapper CRD that weren't included in the Codeflare Operator build that you're running, specifying a newer InstaScale image won't update the CRDs present on the cluster because they depend on the CodeFlare Operator version running.

@anishasthana
Copy link
Member

Ah yeah, I was thinking of images in a different context.. Perhaps for now we update the field documentation to make it very clear that using images that are not the default could result in broken deployments?

Similarly, incompatibilities between the specified custom image version and the installed CRD versions could be detected, and possibly reported into the InstaScale resource conditions.
I think that's a great follow-on item -- could you create an issue for it that we could look into later?

@astefanutti
Copy link
Contributor Author

Ah yeah, I was thinking of images in a different context.. Perhaps for now we update the field documentation to make it very clear that using images that are not the default could result in broken deployments?

+1, documenting the risk is the most straightforward way to mitigate it in the short term.

Similarly, incompatibilities between the specified custom image version and the installed CRD versions could be detected, and possibly reported into the InstaScale resource conditions.

I think that's a great follow-on item -- could you create an issue for it that we could look into later?

Sure, it'll be more difficult to address the forward compatibility case, as described by @KPostOffice, with this approach, but I think it'll be less frequent than the backward compatibility case, that can be convenient to revert back to a previous version, while OLM makes it very hard to do.

@jbusche
Copy link
Collaborator

jbusche commented Jun 5, 2023

@astefanutti This is great - I'm trying to think how I would test it today.

As for specifying a custom image, I understand there are times when a different image might break things, for example; specifying too old of an image, or jumping ahead to something too new. I wouldn't suggest we encourage people to mess with the default images unless directed by CodeFlare support. But I've been on support for a previous operator for two + years, and there were many times when customers (and dev/support) quickly needed an updated image:

  • It provides a hotfix that allowed them to continue working, versus waiting for another release cycle
  • It provides urgent security patches. (This was super important during the log4j security incident over the Holidays a few years ago)

@KPostOffice
Copy link
Collaborator

@astefanutti This is great - I'm trying to think how I would test it today.

As for specifying a custom image, I understand there are times when a different image might break things, for example; specifying too old of an image, or jumping ahead to something too new. I wouldn't suggest we encourage people to mess with the default images unless directed by CodeFlare support. But I've been on support for a previous operator for two + years, and there were many times when customers (and dev/support) quickly needed an updated image:

* It provides a hotfix that allowed them to continue working, versus waiting for another release cycle

* It provides urgent security patches. (This was super important during the log4j security incident over the Holidays a few years ago)

++ that makes sense, I think the change is good. @astefanutti Has clearly noted the potential for undefined behavior as well. this /lgtm but I haven't tested it.

Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti We probably need a testcase similar to these

@jbusche
Copy link
Collaborator

jbusche commented Jun 6, 2023

Hi @astefanutti, OK with @tedhtchang 's help, I tried it out. Here's what I did:

  1. oc login from my laptop
  2. I cleaned up the old codeflare deploy
oc delete -f https://raw.githubusercontent.com/opendatahub-io/distributed-workloads/main/codeflare-stack-kfdef.yaml -n opendatahub
oc delete sub codeflare-operator -n openshift-operators ; oc delete csv codeflare-operator.v0.0.3 -n openshift-operators
oc delete crd instascales.codeflare.codeflare.dev  mcads.codeflare.codeflare.dev
  1. Downloaded your code/switched to your branch:
git clone https://github.com/astefanutti/codeflare-operator.git
cd codeflare-operator
git checkout pr-04
  1. Install the PR's code:
make manifests # to compile new api(crd)
make install   # will install crd in the cluster
make run       # runs the operator locally, you'll see the operator logs in your terminal…
  1. Now in another window, let's deploy the regular kfdef
oc apply -f https://raw.githubusercontent.com/opendatahub-io/distributed-workloads/main/codeflare-stack-kfdef.yaml -n opendatahub
  1. and then edit the instascale cr to add a controllerimage line like this:
oc edit instascale instascale

and under spec: add something like this:

spec:   
  controllerImage: quay.io/project-codeflare/instascale-controller:v0.0.2
  1. Now the instascale deploy and pod should revert to image v0.0.2
oc describe deploy instascale-instascale |grep Image:
    Image:      quay.io/project-codeflare/instascale-controller:v0.0.2

and
oc describe pod instascale-instascale-67568d99db-gqmjt |grep Image:
    Image:         quay.io/project-codeflare/instascale-controller:v0.0.2

Looks great!

Copy link
Collaborator

@jbusche jbusche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
tested fine on my cluster

@openshift-ci openshift-ci bot added the lgtm label Jun 6, 2023
@openshift-ci
Copy link

openshift-ci bot commented Jun 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbusche

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jun 6, 2023
@jbusche
Copy link
Collaborator

jbusche commented Jun 6, 2023

/lgtm

@openshift-merge-robot openshift-merge-robot merged commit 15df44b into project-codeflare:main Jun 6, 2023
@astefanutti astefanutti deleted the pr-04 branch June 7, 2023 07:38
@astefanutti
Copy link
Contributor Author

@jbusche great, thanks for the feedback and the detailed instructions. I'll have them adapted and included as part of project-codeflare/codeflare-sdk#137, so it's documented how to test local changes manually.

@astefanutti
Copy link
Contributor Author

@astefanutti We probably need a testcase similar to these

@tedhtchang I've just created #124.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Custom Image Support for InstaScale in the CodeFlare Operator
6 participants