Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor, chore: refactor charm to use Deployment for workload, also bumps training-operator 1.7->1.8 #167

Merged
merged 7 commits into from
Jul 9, 2024

Conversation

DnPlas
Copy link
Contributor

@DnPlas DnPlas commented Jun 19, 2024

This PR merges the changes in the KF-5692-1.8-dev-branch into main

Fixes #159

@DnPlas DnPlas requested a review from a team as a code owner June 19, 2024 17:03
@DnPlas DnPlas force-pushed the KF-5692-training-1.8-dev-branch branch 3 times, most recently from 5bb0371 to 32ecee6 Compare June 19, 2024 17:07
DnPlas and others added 6 commits July 4, 2024 15:54
* pin integration test dependencies, refactor constants in tests (#155)

Pins dependencies in the integration tests to their corresponding channels for this release.

Ref: canonical/bundle-kubeflow#866

Co-authored-by: Andrew Scribner <ca.scribner+1@gmail.com>
* refactor: deploy the training-operator with kubernetes resources

This commit refactors the way the training-operator is deployed, as instead of using
a sidecar container that runs the workload, we are now applying the Deployment and all
the Kubernetes resources required by the training-operator controller to be able to mange
training resources.
We are introducing this change in preparation for the upcoming 1.8 version, as it introduces the
hard dependency on a Kubernetes Secret mounted in a volume for the training-operator workload
to start. For more details please refer to #159.
Build charmed-kubeflow-chisme for requirements-integration.txt.

Part of charmed-kubeflow-chisme#104
* tests: skip test_upgrade due to #170

#170 is affecting the execution of this test, but since the fix
is on juju, there is not much we can do at the moment other than
skipping the test.

Part of #170
…173)

* refactor: apply a workload Service instead of using juju created one

To avoid inconsistent behaviours, it is preferrable to apply and use a Service
owned by the charm so it can be rendered as needed by the controller.
This commit introduces the following changes:

* The charm now renders and applies a ValidatingWebhookConfiguration resource for 
  training-operator CRDs.
* The charm will render the Service to also serve on port 9443 for the webhook service.
* The oci-image is updated to v1.8 of the training-operator
* The training-operator Deployment now has a volume mount for mounting
  the secret that is used by the cert-controller to generate and rotate
  certificates for the ValidatingWebhookConfiguration
* The training-operator Deployment will now take an argument so the webhook service can use the training-operator workload's Service instead of the default
* Updates the examples directory with examples from kubeflow/training-operator v1.8-branch

Fixes #159
@DnPlas DnPlas force-pushed the KF-5692-training-1.8-dev-branch branch from 350b34e to 8381b07 Compare July 4, 2024 13:55
@DnPlas DnPlas enabled auto-merge (squash) July 4, 2024 19:51
@DnPlas DnPlas disabled auto-merge July 4, 2024 21:26
@DnPlas DnPlas enabled auto-merge (squash) July 4, 2024 21:27
* feat: relate to dashboard and add documentation link CKF 1.9
Copy link
Contributor

@NohaIhab NohaIhab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I have tested the upgrade path we'll have to go with described in #170 by:

  1. deploy training-operator 1.7/stable
  2. wait until active then remove the charm
  3. redeploy with the channel from this pr
  4. run a training job from the /examples

@DnPlas DnPlas merged commit d9197c5 into main Jul 9, 2024
7 checks passed
@DnPlas DnPlas deleted the KF-5692-training-1.8-dev-branch branch July 9, 2024 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update training-operator manifests
2 participants