Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argo workflow not detecting seldon deployment is available #668

Closed
mbron opened this issue Jul 2, 2019 · 10 comments
Closed

argo workflow not detecting seldon deployment is available #668

mbron opened this issue Jul 2, 2019 · 10 comments

Comments

@mbron
Copy link

mbron commented Jul 2, 2019

Hi, we're having an issue with our argo workflow sometimes not detecting the 'status.state' == 'Available' condition in the seldon deployment resource definition. We're using version v0.2.7.

Below an abbreviate argo workflow definition, note the successCondition: status.state == Available condition.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  templates:
    dag:
      tasks:
      - name: deploy-model
        template: deploy-model

  - name: deploy-model
    resource:
      action: apply
      successCondition: status.state == Available
      manifest: |
        apiVersion: machinelearning.seldon.io/v1alpha2
        kind: SeldonDeployment
        metadata:
          labels:
            app: seldon

So when the workflow successfully detects that the deployment is available the workflow logs contain the following json (snipped):

"{
  \"apiVersion\": \"machinelearning.seldon.io/v1alpha2\",
  \"kind\": \"SeldonDeployment\",
  \"metadata\": {
    \"annotations\": {
      \"kubectl.kubernetes.io/last-applied-configuration\": \"{\\\"apiVersion\\\":\\\"machinelearning.seldon.io/v1alpha2\\\",\\\"kind\\\":\\\"SeldonDeployment\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"labels\\\":{\\\"app\\\":\\\"seldon\\\"}, ... }\\n\"},
      \"creationTimestamp\": \"2019-07-01T10:44:59Z\",
      \"generation\": 1,
      \"labels\": {\"app\": \"seldon\"},
      \"name\": \"seldon-prediction-service\",
      \"resourceVersion\": \"15667665\",
      \"spec\": {
        \"predictors\": [...]
},

\"status\": {\"predictorStatus\": [{\"replicas\": 1,\"replicasAvailable\": 1}],\"state\": \"Available\"}}"

Note the last line, on the outer level of the resource description json, there is a "status" key which contains the state with value available.

Now let's look at the argo log output when the workflow does not detect that the deployment is available (snipped):

"{
  \"apiVersion\": \"machinelearning.seldon.io/v1alpha2\",
  \"kind\": \"SeldonDeployment\",
  \"metadata\": {
    \"annotations\": {
    
      \"kubectl.kubernetes.io/last-applied-configuration\": \"{\\\"apiVersion\\\":\\\"machinelearning.seldon.io/v1alpha2\\\",\\\"kind\\\":\\\"SeldonDeployment\\\",\\\"metadata\\\":{\\\"name\\\":\\\"seldon-prediction-service\\\",\\\"generation\\\":2,\\\"creationTimestamp\\\":\\\"2019-06-29T17:32:40Z\\\",\\\"labels\\\":{\\\"app\\\":\\\"seldon\\\"}},\\\"spec\\\":{\\\"predictors\\\":[{\\\"graph\\\":{\\\"type\\\":\\\"MODEL\\\",\\\"endpoint\\\":{\\\"type\\\":\\\"REST\\\"}},\\\"componentSpecs\\\":[{\\\"spec\\\":{...}}],\\\"containers\\\":[{\\\"resources\\\":{\\\"requests\\\":{\\\"memory\\\":\\\"1Mi\\\"}},\\\"imagePullPolicy\\\":\\\"Always\\\"}],\\\"terminationGracePeriodSeconds\\\":20}}],\\\"replicas\\\":1,\\\"annotations\\\":{\\\"predictor_version\\\":\\\"0.0.1\\\"}}],\\\"annotations\\\":{\\\"deployment_version\\\":\\\"0.1\\\"}},\\\"status\\\":{\\\"state\\\":\\\"Available\\\",\\\"predictorStatus\\\":[{\\\"replicas\\\":1,\\\"replicasAvailable\\\":1}]}}\\n\"},
      
    \"creationTimestamp\": \"2019-06-29T17:32:40Z\",
    \"generation\": 2,
    \"labels\": {\"app\": \"seldon\"},
    \"name\": \"seldon-prediction-service\",
    \"resourceVersion\": \"15382733\",
    \"uid\": \"e4e3c2db-9a93-11e9-bfff-6c0b8465aea7\"},
  \"spec\": {\"predictors\": [...]}
}"

Note that the {"status": {"state": "Available"}} condition is actually in the description, however, it is no longer at the outer level, but part of the value of the "kubectl.kubernetes.io/last-applied-configuration" key, which is actually a string (not json)

So the question is why does the condition appear in different places some of the time.

@ukclivecox
Copy link
Contributor

The last-applied-configuration is added automatically by k8s when a resource is updated. If the status is not there then there was an issue with the update.
Is there no status field at all? Can you check the logs of the cluster-manager ?

@mbron
Copy link
Author

mbron commented Jul 3, 2019

Hi, thanks for getting back so quick! The issue is flaky but I got the output of the seldon-cluster-manager. The seldon deployments are updated from status creating to available.
However, in cases where our argo workflow fails the update (both creating and available status) is in the string value of the kubectl.kubernetes.io/last-applied-configuration field. While on successful completion of the workflow the status updates are accessible via a status key on the same level as apiVersion.

Below the output of the log when the update ends up in the "kubectl.kubernetes.io/last-applied-configuration" field (snipped):

2019-07-03 14:22:44.490  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Skipping deletion of app-2320bac svcOrchOnly:true
2019-07-03 14:22:44.490  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Failed to delete anything from first stage delete so will delete all unsed deployments for app
2019-07-03 14:22:44.541  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Skipping deletion of app-2320bac svcOrchOnly:false
2019-07-03 14:22:44.542 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment seldon-prediction-service with status state: "Available"
predictorStatus {
  name: "app-2320bac"
  replicas: 1
  replicasAvailable: 1
}

2019-07-03 14:22:54.016 DEBUG 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Only updated cache for seldon-prediction-service
2019-07-03 14:22:54.017 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : MODIFIED
 : {
"apiVersion":"machinelearning.seldon.io/v1alpha2",
"kind":"SeldonDeployment",
"metadata":{
  "annotations":{
    "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",
 \"kind\":\"SeldonDeployment\",
 \"metadata\":{
  \"name\":\"seldon-prediction-service\",
  \"status\":{\"state\":\"Available\",
              \"predictorStatus\":[{
                                    \"name\":\"app-2320bac\",
                                    \"replicas\":1,\"replicasAvailable\":1}]}}\n"
}...

@ukclivecox
Copy link
Contributor

But are you sure the last-applied-configuration does not refer to the previous state. As I say this is added automatically by kubectl I think.

@mbron
Copy link
Author

mbron commented Jul 3, 2019

Yeah, last-applied-configuration seems to be added through kubectl apply. But it gets into Available state anyway. Or do you think it may have been taken from a different deployment? As this is part of our CI/CD we're running several deployments at the same time (different namespaces though).

How does the cluster-manager update the state, i.e., how are these steps executed:

2019-07-03 15:29:26.828 DEBUG 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Pushing updated SeldonDeployment seldon-prediction-service back to kubectl
2019-07-03 15:29:26.831 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment seldon-prediction-service with status state: "Creating"

I restarted cluster-manager and now the status field is added at the outer level of the deployment definition again, but strangely not in the last-applied-configuration field:

2019-07-03 15:29:58.965  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Skipping deletion of app-2320bac svcOrchOnly:true
2019-07-03 15:29:58.965  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Failed to delete anything from first stage delete so will delete all unsed deployments for app
2019-07-03 15:29:59.026  INFO 1 --- [pool-1-thread-1] i.s.c.k.SeldonDeploymentControllerImpl   : Skipping deletion of app-2320bac svcOrchOnly:false
2019-07-03 15:29:59.031 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment seldon-prediction-service with status state: "Available"
predictorStatus {
  name: "app-2320bac"
  replicas: 1
  replicasAvailable: 1
}

2019-07-03 15:30:08.598 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.DeploymentWatcher              : Updating processed resource version to 16140303
2019-07-03 15:30:08.599 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : The time is now 15:30:08
2019-07-03 15:30:08.657 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : MODIFIED
 : {
"apiVersion":"machinelearning.seldon.io/v1alpha2",
"kind":"SeldonDeployment",
"metadata":{
  "annotations":{
    "kubectl.kubernetes.io/last-applied-configuration":" 
      {\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",
       \"kind\":\"SeldonDeployment\",
       \"metadata\":{\"name\":\"seldon-prediction-service\"}}\n"},
"status":{"predictorStatus":[{"name":"app-2320bac","replicas":1.0,"replicasAvailable":1.0}],
          "state":"Available"}
}

@mbron
Copy link
Author

mbron commented Jul 3, 2019

Actually sometimes we redeploy the cluster manager as well as part of the tests (i.e., helm delete - helm install), but some of the prediction services might still be running deployed by a previous instance of the cluster manager. Could the new cluster manager be caching the older deployment definitions somehow? (just thinking out loud here)

@ukclivecox
Copy link
Contributor

That is an interesting case. It should be viable to restart the cluster-manager with no ill effects.

@ryandawsonuk
Copy link
Contributor

You could try doing a ‘helm upgrade’ instead of deleting and installing again. Also you could add a ‘kubectl rollout status -n seldon-system statefulset/seldon-operator-controller-manager’ to make sure the cluster manager is up.

@ryandawsonuk
Copy link
Contributor

Oh I see your point. I think if the CRD is removed then the deployments of that CRD will be too. You could add some ‘kubectl get sdep’ statements to check whether the CRD is still registered or removed and whether there are seldondeployments hanging around.

@mbron
Copy link
Author

mbron commented Jul 4, 2019

Ok, I couldn't recreate this on my local cluster. You're right that all deployments get torn down when removing the CRD. We must be hitting some edge case on our CI/CD cluster.

We'll start using helm upgrade see if it persists.

We'll also start the process of moving to Seldon3.

I'll report back if it shows up again.

@ukclivecox
Copy link
Contributor

Please reopen if still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants