Helm Upgrade Process #890

ukclivecox · 2019-09-26T10:28:12Z

Check upgrades between versions have a documented process and has tests.

ryandawsonuk · 2019-10-11T16:17:53Z

I just did a helm upgrade from 0.4.2-SNAPSHOT to 0.5.0-SNAPSHOT. The CRD did momentarily disappear so my watch of kubectl get sdep -w was momentarily broken. However, the Pods all kept running and the Deployments stayed up. They all went through a normal rolling update and serve predictions fine so the process looks good.

ryandawsonuk · 2019-11-06T16:03:05Z

I just tried this again. This time I did an install of 0.4.1 with:

helm install seldon-core-operator --name seldon-core --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --namespace seldon-system --version 0.4.1 --set istio.gateway="kubeflow-gateway.kubeflow.svc.cluster.local" --set istio.enabled="true" --set engine.logMessagesExternally="true"

I then tried to upgrade to master by running the following from the seldon-core/examples/centralized-logging/ dir:

helm upgrade --install seldon-core ../../helm-charts/seldon-core-operator/ --namespace seldon-system --set istio.gateway="kubeflow-gateway.kubeflow.svc.cluster.local" --set istio.enabled="true" --set engine.logMessagesExternally="true"

Unfortunately kubectl get sdep then failed with Error from server (NotFound): Unable to list "machinelearning.seldon.io/v1alpha3, Resource=seldondeployments": the server could not find the requested resource (get seldondeployments.machinelearning.seldon.io). The pods all got a mysterious rolling update but seemed to stay up.

I then deleted the helm release and installed it again and even then when I submitted a SeldonDeployment I got this error:

Error): Internal error occurred: failed calling webhook "mutating-create-update-seldondeployment.seldon.io": Post https://webhook-server-service.seldon-system.svc:443/mutating-create-update-seldondeployment?timeout=30s: service "webhook-server-service" not found

Seems to be related to the webhook changing its name. The helm upgrade doesn't seem to handle that and nor did helm delete. I had to manually run kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io mutating-webhook-configuration -n seldon-system and kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validating-webhook-configuration -n seldon-system

I am not sure how to avoid these issues in future except to always install the already-released version and attempt an upgrade to master.

ryandawsonuk · 2019-11-06T16:33:36Z

The best I've been able to do is the below but it does result in a window of downtime.

helm install seldon-core-operator --name seldon-core --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --namespace seldon-system --version 0.4.1 --set istio.gateway="kubeflow-gateway.kubeflow.svc.cluster.local" --set istio.enabled="true" --set engine.logMessagesExternally="true"
<INSTALL SOME SDEPS, WAIT FOR PODS AND WATCH>
kubectl delete crd seldondeployments.machinelearning.seldon.io
kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io mutating-webhook-configuration -n seldon-system
kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validating-webhook-configuration -n seldon-system
helm upgrade --install seldon-core ../../helm-charts/seldon-core-operator/ --namespace seldon-system --set istio.gateway="kubeflow-gateway.kubeflow.svc.cluster.local" --set istio.enabled="true" --set engine.logMessagesExternally="true"
<REINSTALL THE SDEPS>

Not exactly clear why I have to remove CRD but in master it does look quite different from in 0.4.1. Could maybe be the addition of the labels.

axsaucedo · 2019-11-26T08:43:05Z

Here are some notes from our meeting discussing different types of "upgrades" that can be supported, subject to re-scoping as we decide which one would be more feasible to start with, as it may actually be a more iterative approach:

Different types of "handling versions" of Seldon Core across clusters:

Only able to install one operator globally in the cluster
Able to install one operator globally in the cluster and restrict with RBAC towards specific namespaces
Able to install an operator in a single namespace, but all have to be same CRD version
Able to install multiple operators with multiple different CRDs, but leveraging a central / backwards compatible CRD
Able to restrict an operator to look after specific Seldon Deployments that have specific labels

Disjoin to this, it is the ability to do rolling upgrades of each single operator and backwards compatibility for all the deployed models.

Action points:

Test Upgrading from 0.5.0 to 0.5.1
Scope work as deliverable task
Explore feasibility of operators that can have access across multiple namespaces
Take what we’ve discussed and convert into tasks
Scope work for integration tests that run rolling updates with models

As a heads up, it was pointed out by @cliveseldon that you cannot have multiple versions of CRDs, but you can have a backwards compatible CRD. He pointed to the kubebuilder book to an example of how this can be achieved with a central CRD that interfaces with different versions of CRDs.

adriangonz · 2019-11-26T16:52:57Z

@ryandawsonuk I just found out that the v0.4.1 charts specified a version v1alpha3 of the CRD:

seldon-core/helm-charts/seldon-core-operator/templates/crd.yaml

Lines 3213 to 3218 in e7585bb

    
           - name: v1alpha2 
        
             served: true 
        
             storage: true 
        
           - name: v1alpha3 
        
             served: true 
        
             storage: false

This version got them removed in v0.5.0:

seldon-core/helm-charts/seldon-core-operator/templates/customresourcedefinition_seldondeployments.machinelearning.seldon.io.yaml

Lines 3727 to 3729 in 9457313

    
           - name: v1alpha2 
        
             served: true 
        
             storage: true

The error during the upgrade to v0.5.0 (and onwards) seems to point to this missing v1alpha3. I'm still not sure on how does Kubernetes treat these versions. Should we try to "phase out" v1alpha3 by setting served: false above?

ukclivecox · 2019-11-26T16:55:18Z

Yes, that was a mistake the v1alpha3 got into a release. I'm not sure if we want to add this back in now that we have removed it in 0.5.x ?

adriangonz · 2019-11-26T17:00:40Z

@cliveseldon it seems to break the upgrade path from v0.4.1 to v0.5.0+. We could add it as a "non-served version" so that Kubernetes doesn't think it has suddenly disappeared?

I'm gonna run a couple tests first to see if that actually solves the issue @ryandawsonuk described above.

adriangonz · 2019-11-26T17:38:25Z

As an update on this, it seems that v0.4.1 stores the models as v1alpha3 regardless of having storage: false set:

apiVersion: machinelearning.seldon.io/v1alpha3
kind: SeldonDeployment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"annotations":{},"name":"sklearn","namespace":"seldon"},"spec":{"name":"iris","predictors":[{"graph":{"children":[],"implementation":"SKLEARN_SERVER","modelUri":"gs://seldon-models/sklearn/iris","name":"classifier"},"name":"default","replicas":1,"svcOrchSpec":{"env":[{"name":"SELDON_LOG_LEVEL","value":"DEBUG"}]}}]}}
  creationTimestamp: "2019-11-26T17:34:19Z"
  generation: 1
  name: sklearn
  namespace: seldon
  resourceVersion: "2051"
  selfLink: /apis/machinelearning.seldon.io/v1alpha3/namespaces/seldon/seldondeployments/sklearn
  uid: 571caa4e-8e64-4d98-8559-363038cd1eee
...

adriangonz · 2019-11-27T15:16:39Z

It seems that Kubernetes always tries to operate with the most recent version according to their naming convention. Therefore, by default, it will try to use v1alpha3, even when it's not the storage version.

This is at least what I get from kubectl get sdep. The only way to get the deployed resources as v1alpha2 is to fetch them explicitly as kubectl get sdep.v1alpha2.machinelearning.seldon.io.

Based on this, I think that the cleanest solution to ensure a safe upgrade path is to keep the v1alpha3 as it was on v0.4.1. This would be just a temporary solution until we release version v1.

What are your thoughts on this @ryandawsonuk @cliveseldon?

PS: I can confirm that adding v1alpha3 as a CRD version on the 0.5.1-SNAPSHOT charts fixes the could not find the requested resource error.

ukclivecox added this to the 0.5.x milestone Sep 26, 2019

ukclivecox assigned ryandawsonuk Sep 26, 2019

ukclivecox assigned adriangonz Nov 7, 2019

axsaucedo mentioned this issue Nov 26, 2019

Seldon Core Operator Restricted to Single Namespace #1139

Closed

adriangonz mentioned this issue Nov 28, 2019

Move CRD to v1 #1100

Closed

ukclivecox changed the title ~~Helm and Kustomize Upgrade Process~~ Helm Upgrade Process Dec 5, 2019

adriangonz mentioned this issue Dec 10, 2019

Add versions v1 and v1alpha3 to CRD #1229

Merged

1 task

adriangonz closed this as completed in #1229 Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm Upgrade Process #890

Helm Upgrade Process #890

ukclivecox commented Sep 26, 2019

ryandawsonuk commented Oct 11, 2019 •

edited

Loading

ryandawsonuk commented Nov 6, 2019 •

edited

Loading

ryandawsonuk commented Nov 6, 2019 •

edited

Loading

axsaucedo commented Nov 26, 2019

adriangonz commented Nov 26, 2019

ukclivecox commented Nov 26, 2019

adriangonz commented Nov 26, 2019

adriangonz commented Nov 26, 2019

adriangonz commented Nov 27, 2019 •

edited

Loading

Helm Upgrade Process #890

Helm Upgrade Process #890

Comments

ukclivecox commented Sep 26, 2019

ryandawsonuk commented Oct 11, 2019 • edited Loading

ryandawsonuk commented Nov 6, 2019 • edited Loading

ryandawsonuk commented Nov 6, 2019 • edited Loading

axsaucedo commented Nov 26, 2019

adriangonz commented Nov 26, 2019

ukclivecox commented Nov 26, 2019

adriangonz commented Nov 26, 2019

adriangonz commented Nov 26, 2019

adriangonz commented Nov 27, 2019 • edited Loading

ryandawsonuk commented Oct 11, 2019 •

edited

Loading

ryandawsonuk commented Nov 6, 2019 •

edited

Loading

ryandawsonuk commented Nov 6, 2019 •

edited

Loading

adriangonz commented Nov 27, 2019 •

edited

Loading