argo workflow not detecting seldon deployment is available #668

mbron · 2019-07-02T12:49:09Z

Hi, we're having an issue with our argo workflow sometimes not detecting the 'status.state' == 'Available' condition in the seldon deployment resource definition. We're using version v0.2.7.

Below an abbreviate argo workflow definition, note the successCondition: status.state == Available condition.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  templates:
    dag:
      tasks:
      - name: deploy-model
        template: deploy-model

  - name: deploy-model
    resource:
      action: apply
      successCondition: status.state == Available
      manifest: |
        apiVersion: machinelearning.seldon.io/v1alpha2
        kind: SeldonDeployment
        metadata:
          labels:
            app: seldon

So when the workflow successfully detects that the deployment is available the workflow logs contain the following json (snipped):

"{
  \"apiVersion\": \"machinelearning.seldon.io/v1alpha2\",
  \"kind\": \"SeldonDeployment\",
  \"metadata\": {
    \"annotations\": {
      \"kubectl.kubernetes.io/last-applied-configuration\": \"{\\\"apiVersion\\\":\\\"machinelearning.seldon.io/v1alpha2\\\",\\\"kind\\\":\\\"SeldonDeployment\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"labels\\\":{\\\"app\\\":\\\"seldon\\\"}, ... }\\n\"},
      \"creationTimestamp\": \"2019-07-01T10:44:59Z\",
      \"generation\": 1,
      \"labels\": {\"app\": \"seldon\"},
      \"name\": \"seldon-prediction-service\",
      \"resourceVersion\": \"15667665\",
      \"spec\": {
        \"predictors\": [...]
},

\"status\": {\"predictorStatus\": [{\"replicas\": 1,\"replicasAvailable\": 1}],\"state\": \"Available\"}}"

Note the last line, on the outer level of the resource description json, there is a "status" key which contains the state with value available.

Now let's look at the argo log output when the workflow does not detect that the deployment is available (snipped):

"{
  \"apiVersion\": \"machinelearning.seldon.io/v1alpha2\",
  \"kind\": \"SeldonDeployment\",
  \"metadata\": {
    \"annotations\": {
    
      \"kubectl.kubernetes.io/last-applied-configuration\": \"{\\\"apiVersion\\\":\\\"machinelearning.seldon.io/v1alpha2\\\",\\\"kind\\\":\\\"SeldonDeployment\\\",\\\"metadata\\\":{\\\"name\\\":\\\"seldon-prediction-service\\\",\\\"generation\\\":2,\\\"creationTimestamp\\\":\\\"2019-06-29T17:32:40Z\\\",\\\"labels\\\":{\\\"app\\\":\\\"seldon\\\"}},\\\"spec\\\":{\\\"predictors\\\":[{\\\"graph\\\":{\\\"type\\\":\\\"MODEL\\\",\\\"endpoint\\\":{\\\"type\\\":\\\"REST\\\"}},\\\"componentSpecs\\\":[{\\\"spec\\\":{...}}],\\\"containers\\\":[{\\\"resources\\\":{\\\"requests\\\":{\\\"memory\\\":\\\"1Mi\\\"}},\\\"imagePullPolicy\\\":\\\"Always\\\"}],\\\"terminationGracePeriodSeconds\\\":20}}],\\\"replicas\\\":1,\\\"annotations\\\":{\\\"predictor_version\\\":\\\"0.0.1\\\"}}],\\\"annotations\\\":{\\\"deployment_version\\\":\\\"0.1\\\"}},\\\"status\\\":{\\\"state\\\":\\\"Available\\\",\\\"predictorStatus\\\":[{\\\"replicas\\\":1,\\\"replicasAvailable\\\":1}]}}\\n\"},
      
    \"creationTimestamp\": \"2019-06-29T17:32:40Z\",
    \"generation\": 2,
    \"labels\": {\"app\": \"seldon\"},
    \"name\": \"seldon-prediction-service\",
    \"resourceVersion\": \"15382733\",
    \"uid\": \"e4e3c2db-9a93-11e9-bfff-6c0b8465aea7\"},
  \"spec\": {\"predictors\": [...]}
}"

Note that the {"status": {"state": "Available"}} condition is actually in the description, however, it is no longer at the outer level, but part of the value of the "kubectl.kubernetes.io/last-applied-configuration" key, which is actually a string (not json)

So the question is why does the condition appear in different places some of the time.

The text was updated successfully, but these errors were encountered:

ukclivecox · 2019-07-02T12:53:30Z

The last-applied-configuration is added automatically by k8s when a resource is updated. If the status is not there then there was an issue with the update.
Is there no status field at all? Can you check the logs of the cluster-manager ?

mbron · 2019-07-03T14:55:57Z

Hi, thanks for getting back so quick! The issue is flaky but I got the output of the seldon-cluster-manager. The seldon deployments are updated from status creating to available.
However, in cases where our argo workflow fails the update (both creating and available status) is in the string value of the kubectl.kubernetes.io/last-applied-configuration field. While on successful completion of the workflow the status updates are accessible via a status key on the same level as apiVersion.

Below the output of the log when the update ends up in the "kubectl.kubernetes.io/last-applied-configuration" field (snipped):

2019-07-03 14:22:44.490  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Skipping deletion of app-2320bac svcOrchOnly:true
2019-07-03 14:22:44.490  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Failed to delete anything from first stage delete so will delete all unsed deployments for app
2019-07-03 14:22:44.541  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Skipping deletion of app-2320bac svcOrchOnly:false
2019-07-03 14:22:44.542 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment seldon-prediction-service with status state: "Available"
predictorStatus {
  name: "app-2320bac"
  replicas: 1
  replicasAvailable: 1
}

2019-07-03 14:22:54.016 DEBUG 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Only updated cache for seldon-prediction-service
2019-07-03 14:22:54.017 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : MODIFIED
 : {
"apiVersion":"machinelearning.seldon.io/v1alpha2",
"kind":"SeldonDeployment",
"metadata":{
  "annotations":{
    "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",
 \"kind\":\"SeldonDeployment\",
 \"metadata\":{
  \"name\":\"seldon-prediction-service\",
  \"status\":{\"state\":\"Available\",
              \"predictorStatus\":[{
                                    \"name\":\"app-2320bac\",
                                    \"replicas\":1,\"replicasAvailable\":1}]}}\n"
}...

ukclivecox · 2019-07-03T15:40:35Z

But are you sure the last-applied-configuration does not refer to the previous state. As I say this is added automatically by kubectl I think.

mbron · 2019-07-03T16:00:51Z

Yeah, last-applied-configuration seems to be added through kubectl apply. But it gets into Available state anyway. Or do you think it may have been taken from a different deployment? As this is part of our CI/CD we're running several deployments at the same time (different namespaces though).

How does the cluster-manager update the state, i.e., how are these steps executed:

2019-07-03 15:29:26.828 DEBUG 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Pushing updated SeldonDeployment seldon-prediction-service back to kubectl
2019-07-03 15:29:26.831 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment seldon-prediction-service with status state: "Creating"

I restarted cluster-manager and now the status field is added at the outer level of the deployment definition again, but strangely not in the last-applied-configuration field:

2019-07-03 15:29:58.965  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Skipping deletion of app-2320bac svcOrchOnly:true
2019-07-03 15:29:58.965  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Failed to delete anything from first stage delete so will delete all unsed deployments for app
2019-07-03 15:29:59.026  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Skipping deletion of app-2320bac svcOrchOnly:false
2019-07-03 15:29:59.031 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment seldon-prediction-service with status state: "Available"
predictorStatus {
  name: "app-2320bac"
  replicas: 1
  replicasAvailable: 1
}

2019-07-03 15:30:08.598 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.DeploymentWatcher              : Updating processed resource version to 16140303
2019-07-03 15:30:08.599 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : The time is now 15:30:08
2019-07-03 15:30:08.657 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : MODIFIED
 : {
"apiVersion":"machinelearning.seldon.io/v1alpha2",
"kind":"SeldonDeployment",
"metadata":{
  "annotations":{
    "kubectl.kubernetes.io/last-applied-configuration":" 
      {\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",
       \"kind\":\"SeldonDeployment\",
       \"metadata\":{\"name\":\"seldon-prediction-service\"}}\n"},
"status":{"predictorStatus":[{"name":"app-2320bac","replicas":1.0,"replicasAvailable":1.0}],
          "state":"Available"}
}

mbron · 2019-07-03T16:17:44Z

Actually sometimes we redeploy the cluster manager as well as part of the tests (i.e., helm delete - helm install), but some of the prediction services might still be running deployed by a previous instance of the cluster manager. Could the new cluster manager be caching the older deployment definitions somehow? (just thinking out loud here)

ukclivecox · 2019-07-03T18:57:10Z

That is an interesting case. It should be viable to restart the cluster-manager with no ill effects.

ryandawsonuk · 2019-07-03T19:09:14Z

You could try doing a ‘helm upgrade’ instead of deleting and installing again. Also you could add a ‘kubectl rollout status -n seldon-system statefulset/seldon-operator-controller-manager’ to make sure the cluster manager is up.

ryandawsonuk · 2019-07-03T19:12:30Z

Oh I see your point. I think if the CRD is removed then the deployments of that CRD will be too. You could add some ‘kubectl get sdep’ statements to check whether the CRD is still registered or removed and whether there are seldondeployments hanging around.

mbron · 2019-07-04T09:11:22Z

Ok, I couldn't recreate this on my local cluster. You're right that all deployments get torn down when removing the CRD. We must be hitting some edge case on our CI/CD cluster.

We'll start using helm upgrade see if it persists.

We'll also start the process of moving to Seldon3.

I'll report back if it shows up again.

ukclivecox · 2019-08-23T14:50:35Z

Please reopen if still an issue.

ukclivecox closed this as completed Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

argo workflow not detecting seldon deployment is available #668

argo workflow not detecting seldon deployment is available #668

mbron commented Jul 2, 2019

ukclivecox commented Jul 2, 2019

mbron commented Jul 3, 2019

ukclivecox commented Jul 3, 2019

mbron commented Jul 3, 2019

mbron commented Jul 3, 2019

ukclivecox commented Jul 3, 2019

ryandawsonuk commented Jul 3, 2019

ryandawsonuk commented Jul 3, 2019

mbron commented Jul 4, 2019

ukclivecox commented Aug 23, 2019

argo workflow not detecting seldon deployment is available #668

argo workflow not detecting seldon deployment is available #668

Comments

mbron commented Jul 2, 2019

ukclivecox commented Jul 2, 2019

mbron commented Jul 3, 2019

ukclivecox commented Jul 3, 2019

mbron commented Jul 3, 2019

mbron commented Jul 3, 2019

ukclivecox commented Jul 3, 2019

ryandawsonuk commented Jul 3, 2019

ryandawsonuk commented Jul 3, 2019

mbron commented Jul 4, 2019

ukclivecox commented Aug 23, 2019