Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG hanging #3739

Closed
eestewart opened this issue Aug 11, 2020 · 9 comments · Fixed by #3740
Closed

DAG hanging #3739

eestewart opened this issue Aug 11, 2020 · 9 comments · Fixed by #3740
Assignees
Labels
type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@eestewart
Copy link

eestewart commented Aug 11, 2020

Summary

What happened/what you expected to happen?
Workflow DAG step hangs when the DAG has a task for scheduled jobs to run, but there are no jobs scheduled, and a subsequent task depends on that task.

example_dag_workflow.yaml

Diagnostics

What version of Argo Workflows are you running? v2.9.5

Paste the workflow here, including status:
kubectl get wf -o yaml ${workflow} 

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  creationTimestamp: "2020-08-11T18:19:28Z"
  generateName: dag-hang-
  generation: 9
  labels:
    workflows.argoproj.io/phase: Running
  name: dag-hang-pcwmr
  namespace: argo
  resourceVersion: "246946368"
  selfLink: /apis/argoproj.io/v1alpha1/namespaces/argo/workflows/dag-hang-pcwmr
  uid: 7a6b50fa-afdd-4843-b226-54170ea849fa
spec:
  arguments: {}
  entrypoint: dag
  templates:
  - arguments: {}
    dag:
      tasks:
      - arguments: {}
        name: scheduler
        template: job-scheduler
      - arguments:
          parameters:
          - name: job_name
            value: '{{item.job_name}}'
        dependencies:
        - scheduler
        name: children
        template: whalesay
        withParam: '{{tasks.scheduler.outputs.parameters.scheduled-jobs}}'
      - arguments: {}
        dependencies:
        - children
        name: postprocess
        template: whalesay
    inputs: {}
    metadata: {}
    name: dag
    outputs: {}
  - arguments: {}
    container:
      args:
      - echo Decided not to schedule any jobs
      command:
      - sh
      - -c
      image: alpine:latest
      name: ""
      resources: {}
    inputs: {}
    metadata: {}
    name: job-scheduler
    outputs:
      parameters:
      - name: scheduled-jobs
        value: '[]'
  - arguments: {}
    container:
      args:
      - hello world
      command:
      - cowsay
      image: docker/whalesay
      name: ""
      resources: {}
    inputs: {}
    metadata: {}
    name: whalesay
    outputs: {}
status:
  finishedAt: null
  nodes:
    dag-hang-pcwmr:
      children:
      - dag-hang-pcwmr-1789179473
      displayName: dag-hang-pcwmr
      finishedAt: null
      id: dag-hang-pcwmr
      name: dag-hang-pcwmr
      phase: Running
      startedAt: "2020-08-11T18:19:28Z"
      templateName: dag
      templateScope: local/dag-hang-pcwmr
      type: DAG
    dag-hang-pcwmr-1415348083:
      boundaryID: dag-hang-pcwmr
      displayName: postprocess
      finishedAt: "2020-08-11T18:19:40Z"
      hostNodeName: dech117
      id: dag-hang-pcwmr-1415348083
      name: dag-hang-pcwmr.postprocess
      outputs:
        exitCode: "0"
      phase: Succeeded
      resourcesDuration:
        cpu: 3
        memory: 3
      startedAt: "2020-08-11T18:19:36Z"
      templateName: whalesay
      templateScope: local/dag-hang-pcwmr
      type: Pod
    dag-hang-pcwmr-1738428083:
      boundaryID: dag-hang-pcwmr
      children:
      - dag-hang-pcwmr-1415348083
      displayName: children
      finishedAt: "2020-08-11T18:19:35Z"
      id: dag-hang-pcwmr-1738428083
      name: dag-hang-pcwmr.children
      phase: Succeeded
      startedAt: "2020-08-11T18:19:35Z"
      templateName: whalesay
      templateScope: local/dag-hang-pcwmr
      type: TaskGroup
    dag-hang-pcwmr-1789179473:
      boundaryID: dag-hang-pcwmr
      children:
      - dag-hang-pcwmr-1738428083
      displayName: scheduler
      finishedAt: "2020-08-11T18:19:33Z"
      hostNodeName: dech113
      id: dag-hang-pcwmr-1789179473
      name: dag-hang-pcwmr.scheduler
      outputs:
        exitCode: "0"
        parameters:
        - name: scheduled-jobs
          value: '[]'
      phase: Succeeded
      resourcesDuration:
        cpu: 4
        memory: 4
      startedAt: "2020-08-11T18:19:28Z"
      templateName: job-scheduler
      templateScope: local/dag-hang-pcwmr
      type: Pod
  phase: Running
  startedAt: "2020-08-11T18:19:28Z"
Paste the logs from the workflow controller:
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name) | grep ${workflow}
time="2020-08-11T18:19:28Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="DAG node dag-hang-pcwmr initialized Running" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="All of node dag-hang-pcwmr.scheduler dependencies [] completed" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="Pod node dag-hang-pcwmr-1789179473 initialized Pending" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="Created pod: dag-hang-pcwmr.scheduler (dag-hang-pcwmr-1789179473)" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:28Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946246 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:29Z" level=info msg="insignificant pod change" key=argo/dag-hang-pcwmr-1789179473
time="2020-08-11T18:19:29Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:29Z" level=info msg="Updating node dag-hang-pcwmr-1789179473 message: ContainerCreating"
time="2020-08-11T18:19:29Z" level=info msg="Skipped pod dag-hang-pcwmr.scheduler (dag-hang-pcwmr-1789179473) creation: already exists" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:29Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946257 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:30Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:30Z" level=info msg="Skipped pod dag-hang-pcwmr.scheduler (dag-hang-pcwmr-1789179473) creation: already exists" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:33Z" level=info msg="insignificant pod change" key=argo/dag-hang-pcwmr-1789179473
time="2020-08-11T18:19:34Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:34Z" level=info msg="Setting node dag-hang-pcwmr-1789179473 outputs"
time="2020-08-11T18:19:34Z" level=info msg="Updating node dag-hang-pcwmr-1789179473 status Pending -> Running"
time="2020-08-11T18:19:34Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946296 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="Updating node dag-hang-pcwmr-1789179473 status Running -> Succeeded"
time="2020-08-11T18:19:35Z" level=info msg="TaskGroup node dag-hang-pcwmr-1738428083 initialized Running (message: )" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="node dag-hang-pcwmr-1738428083 phase Running -> Succeeded" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="node dag-hang-pcwmr-1738428083 finished: 2020-08-11 18:19:35.424262116 +0000 UTC" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:35Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946305 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="All of node dag-hang-pcwmr.postprocess dependencies [children] completed" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="Pod node dag-hang-pcwmr-1415348083 initialized Pending" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="Labeled pod argo/dag-hang-pcwmr-1789179473 completed"
time="2020-08-11T18:19:36Z" level=info msg="Created pod: dag-hang-pcwmr.postprocess (dag-hang-pcwmr-1415348083)" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:36Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946318 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:37Z" level=info msg="insignificant pod change" key=argo/dag-hang-pcwmr-1415348083
time="2020-08-11T18:19:37Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:37Z" level=info msg="Labeled pod argo/dag-hang-pcwmr-1789179473 completed"
time="2020-08-11T18:19:37Z" level=info msg="Updating node dag-hang-pcwmr-1415348083 message: ContainerCreating"
time="2020-08-11T18:19:37Z" level=info msg="Skipped pod dag-hang-pcwmr.postprocess (dag-hang-pcwmr-1415348083) creation: already exists" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:37Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946328 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:38Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:38Z" level=info msg="Skipped pod dag-hang-pcwmr.postprocess (dag-hang-pcwmr-1415348083) creation: already exists" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:40Z" level=info msg="insignificant pod change" key=argo/dag-hang-pcwmr-1415348083
time="2020-08-11T18:19:41Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:41Z" level=info msg="Setting node dag-hang-pcwmr-1415348083 outputs"
time="2020-08-11T18:19:41Z" level=info msg="Updating node dag-hang-pcwmr-1415348083 status Pending -> Running"
time="2020-08-11T18:19:41Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946358 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:42Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:42Z" level=info msg="Updating node dag-hang-pcwmr-1415348083 status Running -> Succeeded"
time="2020-08-11T18:19:42Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=246946368 workflow=dag-hang-pcwmr
time="2020-08-11T18:19:43Z" level=info msg="Processing workflow" namespace=argo workflow=dag-hang-pcwmr
time="2020-08-11T18:19:43Z" level=info msg="Labeled pod argo/dag-hang-pcwmr-1415348083 completed"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec alexec added the type/regression Regression from previous behavior (a specific type of bug) label Aug 11, 2020
@alexec
Copy link
Contributor

alexec commented Aug 11, 2020

@simster7 created a fix and released in v2.9.5.

Can you please run the controlled with the environment variable ALL_POD_CHANGES_SIGNIFICANT=true and see if that addresses this issue.

@simster7 simster7 self-assigned this Aug 11, 2020
@simster7
Copy link
Member

I'll investigate this today

@simster7
Copy link
Member

Confirmed this as a bug, working on a fix

@alexec
Copy link
Contributor

alexec commented Aug 11, 2020

Thank you @simster7

@simster7
Copy link
Member

Note: this bug is a very unusual edge case. For this bug to happen all of these conditions must be satisfied:

  1. Task A is a TaskGroup (i.e. uses withItems, withParams, etc.)
  2. Task A's tasks are empty (i.e. withParams: "")
  3. Task B depends on Task A
  4. Task B is a target task (i.e. it is a leaf task in the DAG)

@fangpenlin
Copy link

just happened to run into the same bug, where my generator script is returning an empty json payload, here's the yaml I used for reproducing it:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: reproduce-bug-
spec:
  entrypoint: start
  templates:
  - name: start
    dag:
      tasks:
      - name: first-step
        template: first-step
      - name: second-step
        dependencies: [first-step]
        template: second-step
        arguments:
          parameters:
          - name: chunk
            value: "{{item}}"
        withParam: "{{tasks.first-step.outputs.result}}"
      - name: third-step
        dependencies: [second-step]
        template: third-step

  - name: first-step
    script:
      image: python:3
      command: [bash]
      source: |
        set -e
        python3 -c 'import os, json; print(json.dumps([]))'
      resources:
        requests:
          cpu: 250m

  - name: second-step
    inputs:
      parameters:
      - name: chunk
    script:
      image: python:3
      command: [bash]
      source: |
        set -e
        echo hi {{$inputs.parameters.chunk}}
      resources:
        requests:
          cpu: 100m

  - name: third-step
    script:
      image: busybox
      command: [sh]
      source: |
        echo fin
      resources:
        requests:
          cpu: 100m

I return empty json on purpose, as I have a finish maker detection step in my real workflow, I wanted to skip the whole DAG if the finish marker file is already there in place in cloud storage. And JSON result is required even if when condition didn't pass. And because of that, turned out returning empty json will make DAG hangs forever. So I changed it to return a dummy value and skip in the downstream with when condition as well.

I guess maybe it makes sense to have a feature to skip the whole DAG without even providing JSON result. Anyway, just added a possible case people may want to return empty JSON on purpose on top of it.

@tarunpantqb
Copy link

I'm using V2.11.7 and running into the same issue again.
All DAGs are going into long run and keep hanging.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-diamond-
spec:
  entrypoint: diamond
  templates:
  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: diamond
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        dependencies: [A]
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        dependencies: [A]
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        dependencies: [B, C]
        template: echo
        arguments:
          parameters: [{name: message, value: D}]

@alexec
Copy link
Contributor

alexec commented Nov 5, 2020

Can you please try :latest?

@simster7
Copy link
Member

simster7 commented Nov 6, 2020

@tarunpantqb I just ran the sample workflow you posted with no issues. I also see that it's one of our examples, which means that if there was a bug that prevented it from running successfully, our testing infrastructure would have caught it already.

Lastly, as you can see in #3739 (comment), this issue is only relevant for DAGs that contain a specific edge case when using withItems, etc. The workflow you posted does not use them.

Can you please ensure that you don't have an environment or similar issue that's preventing you from running workflows? If you are sure that it's not an issue on your end, feel free to open a new issue with a detailed bug report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
5 participants