Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kserve): automates installation and configuration of KServe prerequisites #691

Merged

Conversation

israel-hdez
Copy link
Contributor

@israel-hdez israel-hdez commented Nov 2, 2023

Fixes #666

Description

KServe pre-requisites are:

  • Service Mesh (Istio)
    • A minimal Control Plane is configured for KServe with only Pilot and default gateways.
    • An additional knative: ingressgateway is set for the Istio Ingress gateway workload.
    • Some ports are excluded from envoy to allow for metrics collection and KNative hooks.
    • Metrics collection is configured for Pilot and the gateways.
  • Serverless (KNative)
    • Only serving components are needed from KNative.
    • For the most part, a typical Serving deployment is configured, with Istio as networking layer.
    • By default, a self-signed certificate is generated using the OpenShift Ingress domain. Users can provide their own secret with a production ready TLS certificate.

Fixes opendatahub-io/kserve#105

How Has This Been Tested?

Manual testing, for the most part.

Some instructions (targeted for the serving team): opendatahub-io/kserve#105 (comment)

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

KServe pre-requisites are:
* Service Mesh (Istio)
  * A minimal Control Plane is configured for KServe with only Pilot and default gateways.
  * An additional knative: ingressgateway is set for the Istio Ingress gateway workload.
  * Some ports are excluded from envoy to allow for metrics collection and KNative hooks.
  * Metrics collection is configured for Pilot and the gateways.
* Serverless (KNative)
  * Only serving components are needed from KNative.
  * For the most part, a typical Serving deployment is configured, with Istio as networking layer.
  * By default, a self-signed certificate is generated using the OpenShift Ingress domain. Users can provide their own secret with a production ready TLS certificate.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
@zdtsw zdtsw added the rhods-2.5 label Nov 2, 2023
@zdtsw
Copy link
Member

zdtsw commented Nov 3, 2023

ref #666

pkg/feature/serverless/serverless_feature.go Outdated Show resolved Hide resolved
pkg/feature/servicemesh/conditions.go Outdated Show resolved Hide resolved
pkg/feature/servicemesh/conditions.go Outdated Show resolved Hide resolved
pkg/feature/servicemesh/servicemesh_feature.go Outdated Show resolved Hide resolved
pkg/feature/serverless/serverless_feature.go Outdated Show resolved Hide resolved
Copy link
Contributor

@bartoszmajsak bartoszmajsak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the great work! I have shared a few things worth considering before we merge it.

FWIW: There are some tests you might want to include as part of this PR (and perhaps write some more) - have a look here https://github.com/maistra/opendatahub-operator/blob/service-mesh-integration/tests/integration/servicemesh/service_mesh_setup_features_int_test.go

apis/dscinitialization/v1/serverless_types.go Outdated Show resolved Hide resolved
apis/dscinitialization/v1/servicemesh_types.go Outdated Show resolved Hide resolved
apis/dscinitialization/v1/servicemesh_types.go Outdated Show resolved Hide resolved
apis/dscinitialization/v1/dscinitialization_types.go Outdated Show resolved Hide resolved
pkg/feature/servicemesh/conditions.go Outdated Show resolved Hide resolved
pkg/feature/templates/servicemesh/base/member-roll.tmpl Outdated Show resolved Hide resolved
apis/dscinitialization/v1/serverless_types.go Outdated Show resolved Hide resolved
@israel-hdez israel-hdez force-pushed the kserve-mesh-serverless-prereq branch 2 times, most recently from b949942 to c241750 Compare November 7, 2023 00:53
* Implement EnsureServerlessAbsent func.
* Fix PR feedback.
* Revert get_all_manifests.sh

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
@israel-hdez
Copy link
Contributor Author

FWIW: There are some tests you might want to include as part of this PR (and perhaps write some more) - have a look here https://github.com/maistra/opendatahub-operator/blob/service-mesh-integration/tests/integration/servicemesh/service_mesh_setup_features_int_test.go

@bartoszmajsak Definitely! Let me work on the tests on my Tuesday.

Although, I'll let the operator team to decide if we are good to merge while I work on the tests. I know there is a scheduled release planned for this week.

@israel-hdez
Copy link
Contributor Author

/retest

@israel-hdez
Copy link
Contributor Author

@Jooho this is ready for testing.
I will work on units, in the meantime.

Comment on lines 30 to 31
// +kubebuilder:validation:Enum=Kubernetes;ThirdParty
// +kubebuilder:default=Kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can't pull maistra/api for this purpose? https://github.com/maistra/istio-operator/blob/d922eece2343343e27016ea0d498bda36bfa64d8/pkg/apis/maistra/v2/security.go#L202-L209

One big question I have in mind is that with these kind of settings, we are drifting away from a potential support of upstream Istio and locking it to Opesnhift Service Mesh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This options seems to be related to this: https://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens.

If it is, I think it still would be applicable to upstream.

Upstream docs seem to have a hint about automating this: looks like a check against the TokenRequest API would tell if we can use ThirdParty. Perhaps we should set this to ThirdParty unless the TokenRequest API is not available. From those upstream docs, it looks like ThirdParty should be preferred over first party. Although, I'm not sure what really would be the recommendation on Maistra.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is related, but wrapped in Maistra config/objects. How did you test it thus far without this feature as it is defaulting to Kubernetes (first-party-jwt)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly use a non-ROSA cluster, so first party tokens are OK.
And... well... I haven't tested this on ROSA. I'm just trusting this change would be enough on ROSA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i set it to ThirdParty in my ROSA cluster and it created the 3 featuretrackers ( not enable monitoring one)
previously when not set it (fall back to Kubernetes) had issue to set FTer up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion: we will document this scenario rather than expose it as a config in DSCI. Relevant docs on OSSM side: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.8/html-single/service_mesh/index#ossm-install-rosa_ossm-create-smcp

/cc @danielezonca

@bartoszmajsak
Copy link
Contributor

bartoszmajsak commented Nov 7, 2023

FWIW: There are some tests you might want to include as part of this PR (and perhaps write some more) - have a look here maistra/opendatahub-operator@service-mesh-integration/tests/integration/servicemesh/service_mesh_setup_features_int_test.go

@bartoszmajsak Definitely! Let me work on the tests on my Tuesday.

Although, I'll let the operator team to decide if we are good to merge while I work on the tests. I know there is a scheduled release planned for this week.

@zdtsw @VaishnaviHire It's essential that we establish a consensus on key principles moving forward. From my perspective, every significant modification should come with automated tests. Without this, we not only waste valuable time on manual verification but also do not build safety net against potential regressions. The V2 overhaul presents us with an excellent chance to lay down a foundation of high-quality codebase. This, in turn, will enable us to develop exceptional tools for OpenDataHub moving forward.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
@openshift-ci openshift-ci bot removed the lgtm label Nov 13, 2023
@israel-hdez
Copy link
Contributor Author

Pushed commit to fix @zdtsw comments.

@zdtsw
Copy link
Member

zdtsw commented Nov 14, 2023

When creating a DSC, the spec by default does not contain kserve.serving which is needed for enabling Kserve/Service-Mesh (i.e. kserve.serving.managementState: Managed). It would be more efficient to include kserve.serving by default.

image

I see 2 approaches:

  1. have serverless config under Kserve by default but set to Removed to both by default in ODH => what we have in the PR
  2. do not have serverless config under Kserve by default, this will ask users to do the configs of serverless by themselves (not just a flip of Removed to Managed in the state) => as Matt suggested to make the enablement easiler

@danielezonca @israel-hdez @VaishnaviHire @bartoszmajsak
i dont think this will be a "must-done" to get the PR in, we can have a follow-up if everyone feels (2) is better for user.

@bartoszmajsak
Copy link
Contributor

bartoszmajsak commented Nov 14, 2023

When creating a DSC, the spec by default does not contain kserve.serving which is needed for enabling Kserve/Service-Mesh (i.e. kserve.serving.managementState: Managed). It would be more efficient to include kserve.serving by default.
image

I see 2 approaches:

  1. have serverless config under Kserve by default but set to Removed to both by default in ODH => what we have in the PR
  2. do not have serverless config under Kserve by default, this will ask users to do the configs of serverless by themselves (not just a flip of Removed to Managed in the state) => as Matt suggested to make the enablement easiler

@danielezonca @israel-hdez @VaishnaviHire @bartoszmajsak i dont think this will be a "must-done" to get the PR in, we can have a follow-up if everyone feels (2) is better for user.

If we decide to make it all enabled by default it will mean having service mesh config enabled by default in DSCI too. I cannot answer from the requirements perspective what is preferred way. Perhaps @danielezonca can.

With that said I would rather make an issue and raise another PR as it seems like an improvement to me rather than an important fix.

@zdtsw
Copy link
Member

zdtsw commented Nov 14, 2023

to update:
from my live-build (of the latest commits from 13th Nov evening)
default-dsc was auto created with config like this:

Screenshot from 2023-11-14 08-30-26

Copy link
Member

@zdtsw zdtsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot removed the lgtm label Nov 14, 2023
@zdtsw
Copy link
Member

zdtsw commented Nov 14, 2023

Screenshot from 2023-11-14 12-09-27
forgot to add my screenshot after the test
to make it easier, only enabled kserve here

Screenshot from 2023-11-14 12-10-38

Screenshot from 2023-11-14 12-11-49

Screenshot from 2023-11-14 12-12-28

Copy link
Member

@VaishnaviHire VaishnaviHire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

openshift-ci bot commented Nov 14, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bartoszmajsak, cam-garrison, spolti, VaishnaviHire, zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@israel-hdez israel-hdez merged commit 0e086b4 into opendatahub-io:incubation Nov 14, 2023
6 of 7 checks passed
@israel-hdez israel-hdez deleted the kserve-mesh-serverless-prereq branch November 14, 2023 18:45
VaishnaviHire pushed a commit to VaishnaviHire/opendatahub-operator that referenced this pull request Nov 15, 2023
…equisites (opendatahub-io#691)

* Implement installation and configuration of KServe prerequisites

KServe pre-requisites are:
* Service Mesh (Istio)
  * A minimal Control Plane is configured for KServe with only Pilot and default gateways.
  * An additional knative: ingressgateway is set for the Istio Ingress gateway workload.
  * Some ports are excluded from envoy to allow for metrics collection and KNative hooks.
  * Metrics collection is configured for Pilot and the gateways.
* Serverless (KNative)
  * Only serving components are needed from KNative.
  * For the most part, a typical Serving deployment is configured, with Istio as networking layer.
  * By default, a self-signed certificate is generated using the OpenShift Ingress domain. Users can provide their own secret with a production ready TLS certificate.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* chore: updates apis/dscinitialization/v1/servicemesh_types.go

Co-authored-by: Cameron Garrison <cgarriso@redhat.com>

* wip: serverless as part of kserve

* chore: removes identitytype from CRD

* feat: ensures that serverless config is removed when KServe is enabled

* chore: checks if service mesh is configured when serving is enabled

* fix: returns error on fail to remove serverless

* chore(dsci): attempts to reduce test flakyness

* chore: removes misleading TODO

* Fix Dockerfile

Add new COPY statement, due to the new `infrastructure` directory

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* chore: regenerates bundle

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
Co-authored-by: Cameron Garrison <cgarriso@redhat.com>
VaishnaviHire pushed a commit to VaishnaviHire/opendatahub-operator that referenced this pull request Nov 15, 2023
…equisites (opendatahub-io#691)

* Implement installation and configuration of KServe prerequisites

KServe pre-requisites are:
* Service Mesh (Istio)
  * A minimal Control Plane is configured for KServe with only Pilot and default gateways.
  * An additional knative: ingressgateway is set for the Istio Ingress gateway workload.
  * Some ports are excluded from envoy to allow for metrics collection and KNative hooks.
  * Metrics collection is configured for Pilot and the gateways.
* Serverless (KNative)
  * Only serving components are needed from KNative.
  * For the most part, a typical Serving deployment is configured, with Istio as networking layer.
  * By default, a self-signed certificate is generated using the OpenShift Ingress domain. Users can provide their own secret with a production ready TLS certificate.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* chore: updates apis/dscinitialization/v1/servicemesh_types.go

Co-authored-by: Cameron Garrison <cgarriso@redhat.com>

* wip: serverless as part of kserve

* chore: removes identitytype from CRD

* feat: ensures that serverless config is removed when KServe is enabled

* chore: checks if service mesh is configured when serving is enabled

* fix: returns error on fail to remove serverless

* chore(dsci): attempts to reduce test flakyness

* chore: removes misleading TODO

* Fix Dockerfile

Add new COPY statement, due to the new `infrastructure` directory

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* chore: regenerates bundle

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
Co-authored-by: Cameron Garrison <cgarriso@redhat.com>
zdtsw pushed a commit to red-hat-data-services/rhods-operator that referenced this pull request Nov 15, 2023
* feat(kserve): automates installation and configuration of KServe prerequisites (opendatahub-io#691)

* Implement installation and configuration of KServe prerequisites

KServe pre-requisites are:
* Service Mesh (Istio)
  * A minimal Control Plane is configured for KServe with only Pilot and default gateways.
  * An additional knative: ingressgateway is set for the Istio Ingress gateway workload.
  * Some ports are excluded from envoy to allow for metrics collection and KNative hooks.
  * Metrics collection is configured for Pilot and the gateways.
* Serverless (KNative)
  * Only serving components are needed from KNative.
  * For the most part, a typical Serving deployment is configured, with Istio as networking layer.
  * By default, a self-signed certificate is generated using the OpenShift Ingress domain. Users can provide their own secret with a production ready TLS certificate.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* chore: updates apis/dscinitialization/v1/servicemesh_types.go

Co-authored-by: Cameron Garrison <cgarriso@redhat.com>

* wip: serverless as part of kserve

* chore: removes identitytype from CRD

* feat: ensures that serverless config is removed when KServe is enabled

* chore: checks if service mesh is configured when serving is enabled

* fix: returns error on fail to remove serverless

* chore(dsci): attempts to reduce test flakyness

* chore: removes misleading TODO

* Fix Dockerfile

Add new COPY statement, due to the new `infrastructure` directory

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* chore: regenerates bundle

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
Co-authored-by: Cameron Garrison <cgarriso@redhat.com>

* Disable Service Mesh IOR (opendatahub-io#738)

This disables automatic OpenShift routes creation.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
(cherry picked from commit 2c6f48c)

* Add Defaults for new installs

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Co-authored-by: Edgar Hernández <ehernand@redhat.com>
Co-authored-by: Bartosz Majsak <bartosz.majsak@gmail.com>
Co-authored-by: Cameron Garrison <cgarriso@redhat.com>
@zdtsw
Copy link
Member

zdtsw commented Nov 18, 2023

Btw, did someone try to run the operator build from this PR with the following DSC?

apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: example
  labels:
    app.kubernetes.io/name: datasciencecluster
spec:
  components:
    kserve:
      managementState: Managed

Results it:

2023-11-10T17:23:21Z DEBUG events DataScienceCluster instance example created, but have some failures in component 1 error occurred:
* DSCInitialization.dscinitialization.opendatahub.io "default" not found

The operator is creating the dsci as default-dsci. It seems to be changed by this pr: #708
Maybe this PR needs a rebase or?

correct. we need to to rebase this PR before testing it.

@spolti @zdtsw @VaishnaviHire The error is not affecting functionality of this PR. Yet, the error is present despite fetching changes from incubation branch.

I debugged a little bit, and found the hard-coded constant here:

.
Probably, it is worth to fix, before the freeze.

not sure why i got this message last night, but the code in the quote has been fixed in the incubation branch, unless you still see it in any of ODH nightly/downstream 2.5 build?

@bdattoma
Copy link

bdattoma commented Nov 29, 2023

@israel-hdez

Users can provide their own secret with a production ready TLS certificate.

how?
EDIT. I guess by setting a secret name like done in the instruction godc

@israel-hdez
Copy link
Contributor Author

@israel-hdez

Users can provide their own secret with a production ready TLS certificate.

how? EDIT. I guess by setting a secret name like done in the instruction godc

Yes, the secretName shown in the document is the default name.
You would need to use type: Provided if you want to use your own. Then, you will need to manually create the TLS secret with the specified name containing the certificate.

certificate:
  secretName: knative-serving-cert
  type: SelfSigned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Servicemesh Kserve integration New Manifests for required CRs installation
10 participants