Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm Upgrade Process #890

Closed
ukclivecox opened this issue Sep 26, 2019 · 9 comments · Fixed by #1229
Closed

Helm Upgrade Process #890

ukclivecox opened this issue Sep 26, 2019 · 9 comments · Fixed by #1229
Assignees
Milestone

Comments

@ukclivecox
Copy link
Contributor

Check upgrades between versions have a documented process and has tests.

@ukclivecox ukclivecox added this to the 0.5.x milestone Sep 26, 2019
@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Oct 11, 2019

I just did a helm upgrade from 0.4.2-SNAPSHOT to 0.5.0-SNAPSHOT. The CRD did momentarily disappear so my watch of kubectl get sdep -w was momentarily broken. However, the Pods all kept running and the Deployments stayed up. They all went through a normal rolling update and serve predictions fine so the process looks good.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Nov 6, 2019

I just tried this again. This time I did an install of 0.4.1 with:

helm install seldon-core-operator --name seldon-core --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --namespace seldon-system --version 0.4.1 --set istio.gateway="kubeflow-gateway.kubeflow.svc.cluster.local" --set istio.enabled="true" --set engine.logMessagesExternally="true"

I then tried to upgrade to master by running the following from the seldon-core/examples/centralized-logging/ dir:

helm upgrade --install seldon-core ../../helm-charts/seldon-core-operator/ --namespace seldon-system --set istio.gateway="kubeflow-gateway.kubeflow.svc.cluster.local" --set istio.enabled="true" --set engine.logMessagesExternally="true"

Unfortunately kubectl get sdep then failed with Error from server (NotFound): Unable to list "machinelearning.seldon.io/v1alpha3, Resource=seldondeployments": the server could not find the requested resource (get seldondeployments.machinelearning.seldon.io). The pods all got a mysterious rolling update but seemed to stay up.

I then deleted the helm release and installed it again and even then when I submitted a SeldonDeployment I got this error:

Error): Internal error occurred: failed calling webhook "mutating-create-update-seldondeployment.seldon.io": Post https://webhook-server-service.seldon-system.svc:443/mutating-create-update-seldondeployment?timeout=30s: service "webhook-server-service" not found

Seems to be related to the webhook changing its name. The helm upgrade doesn't seem to handle that and nor did helm delete. I had to manually run kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io mutating-webhook-configuration -n seldon-system and kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validating-webhook-configuration -n seldon-system

I am not sure how to avoid these issues in future except to always install the already-released version and attempt an upgrade to master.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Nov 6, 2019

The best I've been able to do is the below but it does result in a window of downtime.

helm install seldon-core-operator --name seldon-core --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --namespace seldon-system --version 0.4.1 --set istio.gateway="kubeflow-gateway.kubeflow.svc.cluster.local" --set istio.enabled="true" --set engine.logMessagesExternally="true"
<INSTALL SOME SDEPS, WAIT FOR PODS AND WATCH>
kubectl delete crd seldondeployments.machinelearning.seldon.io
kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io mutating-webhook-configuration -n seldon-system
kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validating-webhook-configuration -n seldon-system
helm upgrade --install seldon-core ../../helm-charts/seldon-core-operator/ --namespace seldon-system --set istio.gateway="kubeflow-gateway.kubeflow.svc.cluster.local" --set istio.enabled="true" --set engine.logMessagesExternally="true"
<REINSTALL THE SDEPS>

Not exactly clear why I have to remove CRD but in master it does look quite different from in 0.4.1. Could maybe be the addition of the labels.

@axsaucedo
Copy link
Contributor

Here are some notes from our meeting discussing different types of "upgrades" that can be supported, subject to re-scoping as we decide which one would be more feasible to start with, as it may actually be a more iterative approach:

Different types of "handling versions" of Seldon Core across clusters:

  • Only able to install one operator globally in the cluster
  • Able to install one operator globally in the cluster and restrict with RBAC towards specific namespaces
  • Able to install an operator in a single namespace, but all have to be same CRD version
  • Able to install multiple operators with multiple different CRDs, but leveraging a central / backwards compatible CRD
  • Able to restrict an operator to look after specific Seldon Deployments that have specific labels

Disjoin to this, it is the ability to do rolling upgrades of each single operator and backwards compatibility for all the deployed models.

Action points:

  • Test Upgrading from 0.5.0 to 0.5.1
  • Scope work as deliverable task
  • Explore feasibility of operators that can have access across multiple namespaces
  • Take what we’ve discussed and convert into tasks
  • Scope work for integration tests that run rolling updates with models

As a heads up, it was pointed out by @cliveseldon that you cannot have multiple versions of CRDs, but you can have a backwards compatible CRD. He pointed to the kubebuilder book to an example of how this can be achieved with a central CRD that interfaces with different versions of CRDs.

@adriangonz
Copy link
Contributor

@ryandawsonuk I just found out that the v0.4.1 charts specified a version v1alpha3 of the CRD:

- name: v1alpha2
served: true
storage: true
- name: v1alpha3
served: true
storage: false

This version got them removed in v0.5.0:

The error during the upgrade to v0.5.0 (and onwards) seems to point to this missing v1alpha3. I'm still not sure on how does Kubernetes treat these versions. Should we try to "phase out" v1alpha3 by setting served: false above?

@ukclivecox
Copy link
Contributor Author

Yes, that was a mistake the v1alpha3 got into a release. I'm not sure if we want to add this back in now that we have removed it in 0.5.x ?

@adriangonz
Copy link
Contributor

@cliveseldon it seems to break the upgrade path from v0.4.1 to v0.5.0+. We could add it as a "non-served version" so that Kubernetes doesn't think it has suddenly disappeared?

I'm gonna run a couple tests first to see if that actually solves the issue @ryandawsonuk described above.

@adriangonz
Copy link
Contributor

As an update on this, it seems that v0.4.1 stores the models as v1alpha3 regardless of having storage: false set:

apiVersion: machinelearning.seldon.io/v1alpha3
kind: SeldonDeployment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"annotations":{},"name":"sklearn","namespace":"seldon"},"spec":{"name":"iris","predictors":[{"graph":{"children":[],"implementation":"SKLEARN_SERVER","modelUri":"gs://seldon-models/sklearn/iris","name":"classifier"},"name":"default","replicas":1,"svcOrchSpec":{"env":[{"name":"SELDON_LOG_LEVEL","value":"DEBUG"}]}}]}}
  creationTimestamp: "2019-11-26T17:34:19Z"
  generation: 1
  name: sklearn
  namespace: seldon
  resourceVersion: "2051"
  selfLink: /apis/machinelearning.seldon.io/v1alpha3/namespaces/seldon/seldondeployments/sklearn
  uid: 571caa4e-8e64-4d98-8559-363038cd1eee
...

@adriangonz
Copy link
Contributor

adriangonz commented Nov 27, 2019

It seems that Kubernetes always tries to operate with the most recent version according to their naming convention. Therefore, by default, it will try to use v1alpha3, even when it's not the storage version.

This is at least what I get from kubectl get sdep. The only way to get the deployed resources as v1alpha2 is to fetch them explicitly as kubectl get sdep.v1alpha2.machinelearning.seldon.io.

Based on this, I think that the cleanest solution to ensure a safe upgrade path is to keep the v1alpha3 as it was on v0.4.1. This would be just a temporary solution until we release version v1.

What are your thoughts on this @ryandawsonuk @cliveseldon?

PS: I can confirm that adding v1alpha3 as a CRD version on the 0.5.1-SNAPSHOT charts fixes the could not find the requested resource error.

@ukclivecox ukclivecox changed the title Helm and Kustomize Upgrade Process Helm Upgrade Process Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants