Workflows can get stuck in Running with no backing pod #3006

wreed4 · 2020-05-12T13:41:28Z

Checklist:

I've included the version.
I've included reproduction steps.
I've included the workflow YAML.
I've included the logs.

What happened:
We have seen several times now a workflow being stuck in the Running state waiting for pods to come up that are nowhere to be found.

....
   └-◷ envs(0)          get-envs   code-redeploy-vc5q2-4265142100  10d       ContainerCreating

(ins)-> k get pods code-redeploy-vc5q2-4265142100
Error from server (NotFound): pods "code-redeploy-vc5q2-4265142100" not found

What you expected to happen:
The workflow should be marked as Failed probably. Or the pods should be restarted.

How to reproduce it (as minimally and precisely as possible):
I'm unsure exactly how to achieve this state. But it may have something to do with our cluster being not the most stable at the moment, so pods are not always able to launch. They sometimes don't have enough IP addresses so they fail to start. Additionally, we're using spot instances, so the node may disappear out from under the workflow, taking the pod with it. In either case, I would expect the workflow to be resilient to this, either by Failing when the pod was killed (and no retry behavior specified) or by restarting the pod if there was retry behavior.

Anything else we need to know?:

Environment:

Argo version:

$ argo version
argo: v2.7.6+70facdb.dirty
  BuildDate: 2020-04-28T17:08:37Z
  GitCommit: 70facdb67207dbe115a9029e365f8e974e6156bc
  GitTreeState: dirty
  GitTag: v2.7.6
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64

Kubernetes version :

$ kubectl version -o yaml
clientVersion:
  buildDate: "2020-04-21T01:25:41Z"
  compiler: gc
  gitCommit: 52c56ce7a8272c798dbc29846288d7cd9fbae032
  gitTreeState: clean
  gitVersion: v1.18.2
  goVersion: go1.13.10
  major: "1"
  minor: "18"
  platform: linux/amd64
serverVersion:
  buildDate: "2019-12-23T08:58:45Z"
  compiler: gc
  gitCommit: eb1860579253bb5bf83a5a03eb0330307ae26d18
  gitTreeState: clean
  gitVersion: v1.13.12-eks-eb1860
  goVersion: go1.11.13
  major: "1"
  minor: 13+
  platform: linux/amd64

Other debugging information (if applicable):

These are old workflows that have been around for days. Unfortunately, I don't think the logs here will help. also, some of the logs container sensitive information.

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

The text was updated successfully, but these errors were encountered:

simster7 · 2020-05-12T13:59:24Z

Something that is invaluable when debugging similar errors is the full Workflow object that causes this error after it finishes running (or in this case stops running any further).

You can get it by running kubectl get wf <NAME> -o yaml. If it contains any sensitive company data and you're not comfortable sharing it publicly, you can share it with me privately on the project Slack in @simon

blkperl · 2020-05-21T23:47:23Z

@simster7 This is the workflow @wreed4 was referring to. I've changed a few lines to REDACTED but the majority is present

k get wf code-redeploy-vc5q2 -n REDACTED -o yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  creationTimestamp: "2020-05-01T20:46:49Z"
  finalizers:
  - REDACTED/send-complete
  generateName: code-redeploy-
  generation: 5
  labels:
    REDACTED/correlationUUID: 27f14c3b-619d-4196-b6d4-de6a7cadb5ea
    REDACTED/responseTopic: REDACTED
    REDACTED/taskUUID: 7a04c01a-9734-4858-93ac-5812c6b2afdd
    REDACTED/type: CodeRedeploy
    workflows.argoproj.io/phase: Running
  name: code-redeploy-vc5q2
  namespace: REDACTED
  resourceVersion: "55021876"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/REDACTED/workflows/code-redeploy-vc5q2
  uid: e179b3cf-8bec-11ea-9a72-02e84e9ba5f5
spec:
  arguments:
    parameters:
    - name: taskstepImage
      value: REDACTED:latest
    - name: namespace
      value: REDACTED
    - name: gitRef
      value: refs/heads/d8-install
    - name: imageRepo
      value: REDACTED/customer/REDACTED/REDACTED
    - name: imageTag
      value: refs..heads..d8-install..00aaab246a93a73359d93d8aa94858eea9ca7bb3
    - name: phpVersion
      value: "7.3"
    - name: repoURL
      value: REDACTED@REDACTED:REDACTED.git
  entrypoint: main
  onExit: exit
  serviceAccountName: task-code-redeploy
  templates:
  - arguments: {}
    inputs: {}
    metadata: {}
    name: main
    outputs: {}
    steps:
    - - arguments: {}
        name: run-cib
        template: run-cib
      - arguments: {}
        name: envs
        template: get-envs
      - arguments: {}
        name: crons
        template: get-crons
    - - arguments:
          parameters:
          - name: name
            value: '{{item.name}}'
          - name: namespace
            value: '{{item.namespace}}'
          - name: suspend
            value: "true"
        name: suspend-crons
        template: suspend-crons
        withParam: '{{steps.crons.outputs.parameters.json}}'
    - - arguments:
          parameters:
          - name: name
            value: '{{item.name}}'
          - name: namespace
            value: '{{item.namespace}}'
        name: redeploy-drenvs
        template: redeploy-env
        withParam: '{{steps.envs.outputs.parameters.json}}'
  - arguments: {}
    dag:
      tasks:
      - arguments: {}
        name: delete-cib
        template: clean-up-cib
      - arguments:
          parameters:
          - name: name
            value: '{{item.name}}'
          - name: namespace
            value: '{{item.namespace}}'
          - name: suspend
            value: '{{item.suspend}}'
        name: restore-crons
        template: suspend-crons
        withParam: '{{workflow.outputs.parameters.crons}}'
    inputs: {}
    metadata: {}
    name: exit
    outputs: {}
  - arguments: {}
    inputs:
      parameters:
      - name: gitSshSecret
        value: git-ssh
    metadata: {}
    name: run-cib
    outputs: {}
    resource:
      action: create
      failureCondition: status.job.failed > 0
      manifest: |
        apiVersion: REDACTED/v1alpha1
        kind: CustomerImageBuild
        metadata:
          name: code-redeploy-{{workflow.uid}}
          namespace: {{workflow.parameters.namespace}}
        spec:
          gitRef: {{workflow.parameters.gitRef}}
          gitSshSecret: {{inputs.parameters.gitSshSecret}}
          imageRepo: {{workflow.parameters.imageRepo}}
          imageTag: {{workflow.parameters.imageTag}}
          phpVersion: "{{workflow.parameters.phpVersion}}"
          repoURL: {{workflow.parameters.repoURL}}
      successCondition: status.job.succeeded > 0
  - arguments: {}
    container:
      args:
      - get-envs-by-git
      - '{{workflow.parameters.repoURL}}'
      - '{{workflow.parameters.gitRef}}'
      image: '{{workflow.parameters.taskstepImage}}'
      name: ""
      resources: {}
    inputs: {}
    metadata: {}
    name: get-envs
    outputs:
      parameters:
      - name: json
        valueFrom:
          path: /tmp/out.json
    retryStrategy:
      limit: 4
  - arguments: {}
    container:
      args:
      - get-crons-by-git
      - '{{workflow.parameters.repoURL}}'
      - '{{workflow.parameters.gitRef}}'
      image: '{{workflow.parameters.taskstepImage}}'
      name: ""
      resources: {}
    inputs: {}
    metadata: {}
    name: get-crons
    outputs:
      parameters:
      - globalName: crons
        name: json
        valueFrom:
          path: /tmp/out.json
    retryStrategy:
      limit: 4
  - activeDeadlineSeconds: 180
    arguments: {}
    inputs:
      parameters:
      - name: name
      - name: namespace
    metadata: {}
    name: redeploy-env
    outputs: {}
    resource:
      action: patch
      manifest: |
        apiVersion: REDACTED/v1alpha1
        kind: DrupalEnvironment
        metadata:
          name: {{inputs.parameters.name}}
          namespace: {{inputs.parameters.namespace}}
        spec:
          drupal:
            tag: {{workflow.parameters.imageTag}}
      mergeStrategy: merge
      successCondition: status.status=Synced
  - activeDeadlineSeconds: 180
    arguments: {}
    inputs:
      parameters:
      - name: name
      - name: namespace
      - name: suspend
    metadata: {}
    name: suspend-crons
    outputs: {}
    resource:
      action: patch
      manifest: |
        apiVersion: REDACTED/v1alpha1
        kind: Command
        metadata:
          name: {{inputs.parameters.name}}
          namespace: {{inputs.parameters.namespace}}
        spec:
          suspend: {{inputs.parameters.suspend}}
      mergeStrategy: merge
    retryStrategy:
      limit: 2
  - arguments: {}
    inputs: {}
    metadata: {}
    name: clean-up-cib
    outputs: {}
    resource:
      action: delete
      manifest: |
        apiVersion: REDACTED/v1alpha1
        kind: CustomerImageBuild
        metadata:
          name: code-redeploy-{{workflow.uid}}
          namespace: {{workflow.parameters.namespace}}
  ttlSecondsAfterFinished: 300
  ttlStrategy:
    secondsAfterCompletion: 300
    secondsAfterFailure: 259200
    secondsAfterSuccess: 259200
status:
  finishedAt: null
  nodes:
    code-redeploy-vc5q2:
      children:
      - code-redeploy-vc5q2-2656111545
      displayName: code-redeploy-vc5q2
      finishedAt: null
      id: code-redeploy-vc5q2
      name: code-redeploy-vc5q2
      phase: Running
      startedAt: "2020-05-01T20:46:49Z"
      templateName: main
      type: Steps
    code-redeploy-vc5q2-426105205:
      boundaryID: code-redeploy-vc5q2
      children:
      - code-redeploy-vc5q2-4265142100
      displayName: envs
      finishedAt: null
      id: code-redeploy-vc5q2-426105205
      name: code-redeploy-vc5q2[0].envs
      phase: Running
      startedAt: "2020-05-01T20:46:50Z"
      templateName: get-envs
      type: Retry
    code-redeploy-vc5q2-1002029982:
      boundaryID: code-redeploy-vc5q2
      children:
      - code-redeploy-vc5q2-1446995917
      displayName: crons
      finishedAt: null
      id: code-redeploy-vc5q2-1002029982
      name: code-redeploy-vc5q2[0].crons
      phase: Running
      startedAt: "2020-05-01T20:46:50Z"
      templateName: get-crons
      type: Retry
    code-redeploy-vc5q2-1446995917:
      boundaryID: code-redeploy-vc5q2
      displayName: crons(0)
      finishedAt: null
      id: code-redeploy-vc5q2-1446995917
      message: ContainerCreating
      name: code-redeploy-vc5q2[0].crons(0)
      phase: Pending
      startedAt: "2020-05-01T20:46:50Z"
      templateName: get-crons
      type: Pod
    code-redeploy-vc5q2-2656111545:
      boundaryID: code-redeploy-vc5q2
      children:
      - code-redeploy-vc5q2-3252789097
      - code-redeploy-vc5q2-426105205
      - code-redeploy-vc5q2-1002029982
      displayName: '[0]'
      finishedAt: null
      id: code-redeploy-vc5q2-2656111545
      name: code-redeploy-vc5q2[0]
      phase: Running
      startedAt: "2020-05-01T20:46:49Z"
      templateName: main
      type: StepGroup
    code-redeploy-vc5q2-3252789097:
      boundaryID: code-redeploy-vc5q2
      displayName: run-cib
      finishedAt: null
      id: code-redeploy-vc5q2-3252789097
      inputs:
        parameters:
        - name: gitSshSecret
          value: git-ssh
      message: ContainerCreating
      name: code-redeploy-vc5q2[0].run-cib
      phase: Pending
      startedAt: "2020-05-01T20:46:49Z"
      templateName: run-cib
      type: Pod
    code-redeploy-vc5q2-4265142100:
      boundaryID: code-redeploy-vc5q2
      displayName: envs(0)
      finishedAt: null
      id: code-redeploy-vc5q2-4265142100
      message: ContainerCreating
      name: code-redeploy-vc5q2[0].envs(0)
      phase: Pending
      startedAt: "2020-05-01T20:46:50Z"
      templateName: get-envs
      type: Pod
  phase: Running
  startedAt: "2020-05-01T20:46:49Z"

antoniomo · 2020-05-22T09:10:49Z

We have experienced this as well. We think it's related to the cluster not having enough resources to run the pod/pods getting OOM killed. The workflow controller seems to think the workflows are still running, but of course they aren't, and they'll never complete.

antoniomo · 2020-05-22T09:30:41Z

In our case, we were running argo 2.7.0 so perhaps it might be related to #2711, so now testing with 2.8.0 after #2721

simster7 · 2020-05-22T15:30:27Z

Thanks, I'll take a look at this

blkperl · 2020-05-22T15:49:03Z

We will try updating to 2.8 as well.

simster7 · 2020-06-02T20:23:42Z

In our case, we were running argo 2.7.0 so perhaps it might be related to #2711, so now testing with 2.8.0 after #2721

We will try updating to 2.8 as well.

Any news with trying 2.8 before I begin investigating?

antoniomo · 2020-06-03T05:48:24Z

I haven't seen the issue with Running workflows not having a pod: they do have a pod in all my "stuck in Running" workflows now. This is anecdotal data though, we see this in a test cluster that is very starved of resources, so workflow pods get OOM-killed quite often. However, pods are there now as far as I can see.

alexec · 2020-07-16T04:11:08Z

@antoniomo is this still an issue please?

antoniomo · 2020-07-17T07:16:36Z

Hi, as reported, we haven't seen this again as described. However I say "we" referring to my team, we aren't the original poster of the issue, that would be @wreed4, sorry if I caused confusion. For my team, the version update solved an issue that seemed to be the same.

blkperl · 2020-07-17T19:39:30Z

@wreed4 's team hasn't seen it since upgrading as well. I think we can close this out.

wreed4 added the type/bug label May 12, 2020

simster7 self-assigned this May 22, 2020

simster7 removed their assignment Jul 10, 2020

alexec added the need information label Jul 16, 2020

alexec added investigate and removed need information labels Jul 17, 2020

simster7 closed this as completed Jul 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflows can get stuck in Running with no backing pod #3006

Workflows can get stuck in Running with no backing pod #3006

wreed4 commented May 12, 2020

simster7 commented May 12, 2020

blkperl commented May 21, 2020

antoniomo commented May 22, 2020

antoniomo commented May 22, 2020

simster7 commented May 22, 2020

blkperl commented May 22, 2020

simster7 commented Jun 2, 2020

antoniomo commented Jun 3, 2020

alexec commented Jul 16, 2020

antoniomo commented Jul 17, 2020 •

edited

Loading

blkperl commented Jul 17, 2020

Workflows can get stuck in Running with no backing pod #3006

Workflows can get stuck in Running with no backing pod #3006

Comments

wreed4 commented May 12, 2020

simster7 commented May 12, 2020

blkperl commented May 21, 2020

antoniomo commented May 22, 2020

antoniomo commented May 22, 2020

simster7 commented May 22, 2020

blkperl commented May 22, 2020

simster7 commented Jun 2, 2020

antoniomo commented Jun 3, 2020

alexec commented Jul 16, 2020

antoniomo commented Jul 17, 2020 • edited Loading

blkperl commented Jul 17, 2020

antoniomo commented Jul 17, 2020 •

edited

Loading