Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PNS Executor Doesn't Work Occasionally #2671

Closed
4 tasks done
dtaniwaki opened this issue Apr 13, 2020 · 13 comments · Fixed by #4253
Closed
4 tasks done

PNS Executor Doesn't Work Occasionally #2671

dtaniwaki opened this issue Apr 13, 2020 · 13 comments · Fixed by #4253
Labels

Comments

@dtaniwaki
Copy link
Member

dtaniwaki commented Apr 13, 2020

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:

I set up v2.7.2 in minikube with the PNS executor, but found output-parameter.yaml doesn't work because the wait container cannot get PID of the main container. I'm not sure if it only occurs in minikube.

I checked the Kubernetes events of the pod and found image-pulling of the main container took longer than the wait container. I guess this is the reason why the executor cannot get the PID of the main container.

I think the deadline in the following code should be configurable.
https://github.com/argoproj/argo/blob/79217bc89e892ee82bdd5018b2bba65425924d36/workflow/executor/pns/pns.go#L149

I will make a fix soon.

What you expected to happen:

output-parameter.yaml can be executed successfully.

How to reproduce it (as minimally and precisely as possible):

Run output-parameter.yaml with v2.7.2 in minikube.

Anything else we need to know?:
N/A

Environment:

  • Argo version:
$ argo version
argo: v2.7.2
  BuildDate: 2020-04-10T19:25:22Z
  GitCommit: c52a65aa62426f5e874e1d3f1058af15c43eb35f
  GitTreeState: clean
  GitTag: v2.7.2
  GoVersion: go1.13.4
  Compiler: gc
  Platform: darwin/amd64
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: "2020-04-10T21:53:58Z"
  compiler: gc
  gitCommit: 7879fc12a63337efff607952a323df90cdc7a335
  gitTreeState: clean
  gitVersion: v1.18.1
  goVersion: go1.14.2
  major: "1"
  minor: "18"
  platform: darwin/amd64
serverVersion:
  buildDate: "2018-11-26T12:46:57Z"
  compiler: gc
  gitCommit: 435f92c719f279a3a67808c80521ea17d5715c66
  gitTreeState: clean
  gitVersion: v1.12.3
  goVersion: go1.10.4
  major: "1"
  minor: "12"
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
$ argo get output-parameter-ms85z
Name:                output-parameter-ms85z
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             child 'output-parameter-ms85z-1252809517' failed
Created:             Tue Apr 14 00:14:34 +0900 (23 seconds ago)
Started:             Tue Apr 14 00:14:34 +0900 (23 seconds ago)
Finished:            Tue Apr 14 00:14:39 +0900 (18 seconds ago)
Duration:            5 seconds

STEP                       TEMPLATE          PODNAME                            DURATION  MESSAGE
 ✖ output-parameter-ms85z  output-parameter                                               child 'output-parameter-ms85z-1252809517' failed
 └---⚠ generate-parameter  whalesay          output-parameter-ms85z-1252809517  5s        failed to save outputs: Failed to determine pid for containerID 3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911: container may have exited too quickly
  • executor logs:
logs
$ kubectl logs output-parameter-ms85z-1252809517 -c wait
time="2020-04-13T15:14:35Z" level=info msg="Starting Workflow Executor" version=vv2.7.2+c52a65a.dirty
time="2020-04-13T15:14:35Z" level=info msg="Creating PNS executor (namespace: default, pod: output-parameter-ms85z-1252809517, pid: 6, hasOutputs: true)"
time="2020-04-13T15:14:35Z" level=info msg="Executor (version: vv2.7.2+c52a65a.dirty, build_date: 2020-04-10T19:30:34Z) initialized (pod: default/output-parameter-ms85z-1252809517) with template:\n{\"name\":\"whalesay\",\"arguments\":{},\"inputs\":{},\"outputs\":{\"parameters\":[{\"name\":\"hello-param\",\"valueFrom\":{\"path\":\"/tmp/hello_world.txt\",\"default\":\"Foobar\"}}]},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"docker/whalesay:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"sleep 1; echo -n hello world \\u003e /tmp/hello_world.txt\"],\"resources\":{}}}"
time="2020-04-13T15:14:35Z" level=info msg="Waiting on main container"
time="2020-04-13T15:14:35Z" level=warning msg="Polling root processes (1m0s)"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 53000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="Secured filehandle on /proc/20/root"
time="2020-04-13T15:14:38Z" level=info msg="containerID docker-3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911 mapped to pid 20"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 53000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="main container started with container ID: 3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911"
time="2020-04-13T15:14:38Z" level=info msg="Starting annotations monitor"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=warning msg="Failed to wait for container id '3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911': Failed to determine pid for containerID 3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911: container may have exited too quickly"
time="2020-04-13T15:14:38Z" level=info msg="/argo/podmetadata/annotations updated"
time="2020-04-13T15:14:38Z" level=error msg="executor error: Failed to determine pid for containerID 3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911: container may have exited too quickly\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/argoproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).getContainerPID\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:289\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:162\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2020-04-13T15:14:38Z" level=info msg="Saving output parameters"
time="2020-04-13T15:14:38Z" level=info msg="Saving path output parameter: hello-param"
time="2020-04-13T15:14:38Z" level=info msg="Copying /tmp/hello_world.txt from base image layer"
time="2020-04-13T15:14:38Z" level=info msg="Successfully saved output parameter: hello-param"
time="2020-04-13T15:14:38Z" level=info msg="No output artifacts"
time="2020-04-13T15:14:38Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-04-13T15:14:38Z" level=info msg="Annotating pod with output"
time="2020-04-13T15:14:38Z" level=info msg="Starting deadline monitor"
time="2020-04-13T15:14:38Z" level=info msg="Deadline monitor stopped"
time="2020-04-13T15:14:38Z" level=info msg="Annotations monitor stopped"
time="2020-04-13T15:14:38Z" level=info msg="pid 20: &{root 4096 2147484141 {53000000 63722387678 0x2d10800} {1048599 4188280 1 16877 0 0 0 0 4096 4096 8 {1586790878 249000000} {1586790878 53000000} {1586790878 168000000} [0 0 0]}}"
time="2020-04-13T15:14:38Z" level=info msg="Killing sidecars"
time="2020-04-13T15:14:38Z" level=info msg="Alloc=5794 TotalAlloc=13026 Sys=70080 NumGC=4 Goroutines=8"
time="2020-04-13T15:14:38Z" level=fatal msg="Failed to determine pid for containerID 3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911: container may have exited too quickly\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/argoproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).getContainerPID\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:289\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:162\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
  • workflow-controller logs:
logs
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
$ kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
time="2020-04-13T15:14:10Z" level=info msg="config map" name=workflow-controller-configmap
time="2020-04-13T15:14:10Z" level=info msg="Configuration:\nartifactRepository: {}\ncontainerRuntimeExecutor: pns\nfeatureFlags: {}\nmetricsConfig:\n  disableLegacy: false\n  enabled: true\n  path: /metrics\n  port: \"9090\"\npersistence:\n  archive: true\n  clusterName: local\n  connectionPool:\n    maxIdleConns: 100\n    maxOpenConns: 0\n  nodeStatusOffLoad: true\n  postgresql:\n    database: postgres\n    host: postgres\n    passwordSecret:\n      key: password\n      name: argo-postgres-config\n    port: \"5432\"\n    tableName: argo_workflows\n    userNameSecret:\n      key: username\n      name: argo-postgres-config\npodSpecLogStrategy: {}\ntelemetryConfig:\n  disableLegacy: false\n"
time="2020-04-13T15:14:10Z" level=info msg="Persistence configuration enabled"
time="2020-04-13T15:14:10Z" level=info msg="Creating DB session"
time="2020-04-13T15:14:10Z" level=info msg="Persistence Session created successfully"
time="2020-04-13T15:14:10Z" level=info msg="Migrating database schema" clusterName=local dbType=postgres
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="create table if not exists argo_workflows (\n    id varchar(128) ,\n    name varchar(256),\n    phase varchar(25),\n    namespace varchar(256),\n    workflow text,\n    startedat timestamp,\n    finishedat timestamp,\n    primary key (id, namespace)\n)" changeSchemaVersion=0
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="create unique index idx_name on argo_workflows (name)" changeSchemaVersion=1
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="create table if not exists argo_workflow_history (\n    id varchar(128) ,\n    name varchar(256),\n    phase varchar(25),\n    namespace varchar(256),\n    workflow text,\n    startedat timestamp,\n    finishedat timestamp,\n    primary key (id, namespace)\n)" changeSchemaVersion=2
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_workflow_history rename to argo_archived_workflows" changeSchemaVersion=3
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="drop index idx_name" changeSchemaVersion=4
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="create unique index idx_name on argo_workflows(name, namespace)" changeSchemaVersion=5
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_workflows drop constraint argo_workflows_pkey" changeSchemaVersion=6
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_workflows add primary key(name,namespace)" changeSchemaVersion=7
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows drop constraint argo_workflow_history_pkey" changeSchemaVersion=8
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows add primary key(id)" changeSchemaVersion=9
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows rename column id to uid" changeSchemaVersion=10
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column uid set not null" changeSchemaVersion=11
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column phase set not null" changeSchemaVersion=12
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column namespace set not null" changeSchemaVersion=13
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column workflow set not null" changeSchemaVersion=14
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column startedat set not null" changeSchemaVersion=15
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column finishedat set not null" changeSchemaVersion=16
time="2020-04-13T15:14:10Z" level=info msg="applying database change" change="alter table argo_archived_workflows add clustername varchar(64)" changeSchemaVersion=17
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="update argo_archived_workflows set clustername = 'local' where clustername is null" changeSchemaVersion=18
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column clustername set not null" changeSchemaVersion=19
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_archived_workflows drop constraint argo_archived_workflows_pkey" changeSchemaVersion=20
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_archived_workflows add primary key(clustername,uid)" changeSchemaVersion=21
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="create index argo_archived_workflows_i1 on argo_archived_workflows (clustername,namespace)" changeSchemaVersion=22
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows drop column phase" changeSchemaVersion=23
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows drop column startedat" changeSchemaVersion=24
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows drop column finishedat" changeSchemaVersion=25
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows rename column id to uid" changeSchemaVersion=26
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows alter column uid set not null" changeSchemaVersion=27
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows alter column namespace set not null" changeSchemaVersion=28
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows add column clustername varchar(64)" changeSchemaVersion=29
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="update argo_workflows set clustername = 'local' where clustername is null" changeSchemaVersion=30
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows alter column clustername set not null" changeSchemaVersion=31
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows add column version varchar(64)" changeSchemaVersion=32
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows add column nodes text" changeSchemaVersion=33
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="backfillNodes{argo_workflows}" changeSchemaVersion=34
time="2020-04-13T15:14:11Z" level=info msg="Backfill node status"
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows alter column nodes set not null" changeSchemaVersion=35
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows drop column workflow" changeSchemaVersion=36
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows add column updatedat timestamp not null default current_timestamp" changeSchemaVersion=37
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows drop constraint argo_workflows_pkey" changeSchemaVersion=38
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="drop index idx_name" changeSchemaVersion=39
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows drop column name" changeSchemaVersion=40
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows add primary key(clustername,uid,version)" changeSchemaVersion=41
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="create index argo_workflows_i1 on argo_workflows (clustername,namespace)" changeSchemaVersion=42
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column workflow type json using workflow::json" changeSchemaVersion=43
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column name set not null" changeSchemaVersion=44
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="create index argo_workflows_i2 on argo_workflows (clustername,namespace,updatedat)" changeSchemaVersion=45
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="create table if not exists argo_archived_workflows_labels (\n\tclustername varchar(64) not null,\n\tuid varchar(128) not null,\n    name varchar(317) not null,\n    value varchar(63) not null,\n    primary key (clustername, uid, name),\n \tforeign key (clustername, uid) references argo_archived_workflows(clustername, uid) on delete cascade\n)" changeSchemaVersion=46
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_workflows alter column nodes type json using nodes::json" changeSchemaVersion=47
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_archived_workflows add column instanceid varchar(64)" changeSchemaVersion=48
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="update argo_archived_workflows set instanceid = '' where instanceid is null" changeSchemaVersion=49
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="alter table argo_archived_workflows alter column instanceid set not null" changeSchemaVersion=50
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="drop index argo_archived_workflows_i1" changeSchemaVersion=51
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="create index argo_archived_workflows_i1 on argo_archived_workflows (clustername,instanceid,namespace)" changeSchemaVersion=52
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="drop index argo_workflows_i1" changeSchemaVersion=53
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="drop index argo_workflows_i2" changeSchemaVersion=54
time="2020-04-13T15:14:11Z" level=info msg="applying database change" change="create index argo_workflows_i1 on argo_workflows (clustername,namespace,updatedat)" changeSchemaVersion=55
time="2020-04-13T15:14:11Z" level=info msg="Node status offloading config" ttl=5m0s
time="2020-04-13T15:14:11Z" level=info msg="Node status offloading is enabled"
time="2020-04-13T15:14:11Z" level=info msg="Workflow archiving is enabled"
time="2020-04-13T15:14:11Z" level=info msg="Starting prometheus metrics server at 0.0.0.0:9090/metrics"
time="2020-04-13T15:14:11Z" level=info msg="Starting Workflow Controller" version=vv2.7.2+c52a65a.dirty
time="2020-04-13T15:14:11Z" level=info msg="Workers: workflow: 8, pod: 8"
time="2020-04-13T15:14:11Z" level=info msg="Starting CronWorkflow controller"
time="2020-04-13T15:14:11Z" level=info msg="Starting workflow TTL controller (resync 20m0s)"
time="2020-04-13T15:14:11Z" level=info msg="Performing periodic GC every 5m0s"
time="2020-04-13T15:14:11Z" level=info msg="Started workflow TTL worker"
time="2020-04-13T15:14:34Z" level=info msg="Processing workflow" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:34Z" level=info msg="Updated phase  -> Running" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:34Z" level=info msg="Steps node {output-parameter-ms85z output-parameter-ms85z output-parameter-ms85z Steps output-parameter nil    Running   2020-04-13 15:14:34.923722135 +0000 UTC 0001-01-01 00:00:00 +0000 UTC   <nil> nil nil [] []} initialized Running" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:34Z" level=info msg="StepGroup node {output-parameter-ms85z-3009655378 output-parameter-ms85z[0] [0] StepGroup output-parameter nil    Running output-parameter-ms85z  2020-04-13 15:14:34.923765636 +0000 UTC 0001-01-01 00:00:00 +0000 UTC   <nil> nil nil [] []} initialized Running" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:34Z" level=info msg="Pod node {output-parameter-ms85z-1252809517 output-parameter-ms85z[0].generate-parameter generate-parameter Pod whalesay nil    Pending output-parameter-ms85z  2020-04-13 15:14:34.923959764 +0000 UTC 0001-01-01 00:00:00 +0000 UTC   <nil> nil nil [] []} initialized Pending" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:34Z" level=info msg="Created pod: output-parameter-ms85z[0].generate-parameter (output-parameter-ms85z-1252809517)" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:34Z" level=info msg="Workflow step group node &NodeStatus{ID:output-parameter-ms85z-3009655378,Name:output-parameter-ms85z[0],DisplayName:[0],Type:StepGroup,TemplateName:output-parameter,TemplateRef:nil,Phase:Running,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34.923765636 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-1252809517],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} not yet completed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:35Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=3259 workflow=output-parameter-ms85z
time="2020-04-13T15:14:36Z" level=info msg="Processing workflow" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:36Z" level=info msg="Updating node &NodeStatus{ID:output-parameter-ms85z-1252809517,Name:output-parameter-ms85z[0].generate-parameter,DisplayName:generate-parameter,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Pending,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} message: ContainerCreating"
time="2020-04-13T15:14:36Z" level=info msg="Workflow step group node &NodeStatus{ID:output-parameter-ms85z-3009655378,Name:output-parameter-ms85z[0],DisplayName:[0],Type:StepGroup,TemplateName:output-parameter,TemplateRef:nil,Phase:Running,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-1252809517],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} not yet completed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:36Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=3265 workflow=output-parameter-ms85z
time="2020-04-13T15:14:37Z" level=info msg="Processing workflow" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:37Z" level=info msg="Workflow step group node &NodeStatus{ID:output-parameter-ms85z-3009655378,Name:output-parameter-ms85z[0],DisplayName:[0],Type:StepGroup,TemplateName:output-parameter,TemplateRef:nil,Phase:Running,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-1252809517],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} not yet completed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:38Z" level=info msg="Processing workflow" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:38Z" level=info msg="Updating node &NodeStatus{ID:output-parameter-ms85z-1252809517,Name:output-parameter-ms85z[0].generate-parameter,DisplayName:generate-parameter,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Pending,BoundaryID:output-parameter-ms85z,Message:ContainerCreating,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} status Pending -> Running"
time="2020-04-13T15:14:38Z" level=info msg="Workflow step group node &NodeStatus{ID:output-parameter-ms85z-3009655378,Name:output-parameter-ms85z[0],DisplayName:[0],Type:StepGroup,TemplateName:output-parameter,TemplateRef:nil,Phase:Running,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-1252809517],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} not yet completed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:38Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=3273 workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Processing workflow" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Setting node &NodeStatus{ID:output-parameter-ms85z-1252809517,Name:output-parameter-ms85z[0].generate-parameter,DisplayName:generate-parameter,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Running,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} outputs"
time="2020-04-13T15:14:39Z" level=info msg="Updating node &NodeStatus{ID:output-parameter-ms85z-1252809517,Name:output-parameter-ms85z[0].generate-parameter,DisplayName:generate-parameter,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Running,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:&Outputs{Parameters:[]Parameter{Parameter{Name:hello-param,Default:nil,Value:*Foobar,ValueFrom:&ValueFrom{Path:/tmp/hello_world.txt,JSONPath:,JQFilter:,Parameter:,Default:Foobar,},GlobalName:,},},Artifacts:[]Artifact{},Result:nil,},Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} status Running -> Error"
time="2020-04-13T15:14:39Z" level=info msg="Updating node &NodeStatus{ID:output-parameter-ms85z-1252809517,Name:output-parameter-ms85z[0].generate-parameter,DisplayName:generate-parameter,Type:Pod,TemplateName:whalesay,TemplateRef:nil,Phase:Error,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:&Outputs{Parameters:[]Parameter{Parameter{Name:hello-param,Default:nil,Value:*Foobar,ValueFrom:&ValueFrom{Path:/tmp/hello_world.txt,JSONPath:,JQFilter:,Parameter:,Default:Foobar,},GlobalName:,},},Artifacts:[]Artifact{},Result:nil,},Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} message: failed to save outputs: Failed to determine pid for containerID 3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911: container may have exited too quickly"
time="2020-04-13T15:14:39Z" level=info msg="Step group node &NodeStatus{ID:output-parameter-ms85z-3009655378,Name:output-parameter-ms85z[0],DisplayName:[0],Type:StepGroup,TemplateName:output-parameter,TemplateRef:nil,Phase:Running,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-1252809517],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} deemed failed: child 'output-parameter-ms85z-1252809517' failed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="node &NodeStatus{ID:output-parameter-ms85z-3009655378,Name:output-parameter-ms85z[0],DisplayName:[0],Type:StepGroup,TemplateName:output-parameter,TemplateRef:nil,Phase:Running,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-1252809517],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} phase Running -> Failed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="node &NodeStatus{ID:output-parameter-ms85z-3009655378,Name:output-parameter-ms85z[0],DisplayName:[0],Type:StepGroup,TemplateName:output-parameter,TemplateRef:nil,Phase:Failed,BoundaryID:output-parameter-ms85z,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-1252809517],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} message: child 'output-parameter-ms85z-1252809517' failed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="node &NodeStatus{ID:output-parameter-ms85z-3009655378,Name:output-parameter-ms85z[0],DisplayName:[0],Type:StepGroup,TemplateName:output-parameter,TemplateRef:nil,Phase:Failed,BoundaryID:output-parameter-ms85z,Message:child 'output-parameter-ms85z-1252809517' failed,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:2020-04-13 15:14:39.808738163 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-1252809517],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} finished: 2020-04-13 15:14:39.808738163 +0000 UTC" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="step group output-parameter-ms85z-3009655378 was unsuccessful: child 'output-parameter-ms85z-1252809517' failed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Outbound nodes of output-parameter-ms85z-1252809517 is [output-parameter-ms85z-1252809517]" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Outbound nodes of output-parameter-ms85z is [output-parameter-ms85z-1252809517]" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="node &NodeStatus{ID:output-parameter-ms85z,Name:output-parameter-ms85z,DisplayName:output-parameter-ms85z,Type:Steps,TemplateName:output-parameter,TemplateRef:nil,Phase:Running,BoundaryID:,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-3009655378],OutboundNodes:[output-parameter-ms85z-1252809517],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} phase Running -> Failed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="node &NodeStatus{ID:output-parameter-ms85z,Name:output-parameter-ms85z,DisplayName:output-parameter-ms85z,Type:Steps,TemplateName:output-parameter,TemplateRef:nil,Phase:Failed,BoundaryID:,Message:,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-3009655378],OutboundNodes:[output-parameter-ms85z-1252809517],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} message: child 'output-parameter-ms85z-1252809517' failed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="node &NodeStatus{ID:output-parameter-ms85z,Name:output-parameter-ms85z,DisplayName:output-parameter-ms85z,Type:Steps,TemplateName:output-parameter,TemplateRef:nil,Phase:Failed,BoundaryID:,Message:child 'output-parameter-ms85z-1252809517' failed,StartedAt:2020-04-13 15:14:34 +0000 UTC,FinishedAt:2020-04-13 15:14:39.808827328 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[output-parameter-ms85z-3009655378],OutboundNodes:[output-parameter-ms85z-1252809517],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:,ResourcesDuration:ResourcesDuration{},} finished: 2020-04-13 15:14:39.808827328 +0000 UTC" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Checking daemoned children of output-parameter-ms85z" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Updated message  -> child 'output-parameter-ms85z-1252809517' failed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Marking workflow completed" namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Checking daemoned children of " namespace=default workflow=output-parameter-ms85z
time="2020-04-13T15:14:39Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=3279 workflow=output-parameter-ms85z
time="2020-04-13T15:14:40Z" level=info msg="Labeled pod default/output-parameter-ms85z-1252809517 completed"

Kubernetes pod events from kubectl describe.

$ kubectl describe pod output-parameter-ms85z-12528
09517
Name:           output-parameter-ms85z-1252809517
Namespace:      default
Priority:       0
Node:           pfn-argo/192.168.99.107
Start Time:     Tue, 14 Apr 2020 00:14:34 +0900
Labels:         workflows.argoproj.io/completed=true
                workflows.argoproj.io/workflow=output-parameter-ms85z
Annotations:    workflows.argoproj.io/node-message:
                  Failed to determine pid for containerID 3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911: container may have exited too qu...
                workflows.argoproj.io/node-name: output-parameter-ms85z[0].generate-parameter
                workflows.argoproj.io/outputs:
                  {"parameters":[{"name":"hello-param","value":"Foobar","valueFrom":{"path":"/tmp/hello_world.txt","default":"Foobar"}}]}
                workflows.argoproj.io/template:
                  {"name":"whalesay","arguments":{},"inputs":{},"outputs":{"parameters":[{"name":"hello-param","valueFrom":{"path":"/tmp/hello_world.txt","d...
Status:         Failed
IP:             172.17.0.5
IPs:            <none>
Controlled By:  Workflow/output-parameter-ms85z
Containers:
  wait:
    Container ID:  docker://e5f0e1af6b453b2c98af21a141a815f442bab3e3591be6704a3abd9716cb37c2
    Image:         argoproj/argoexec:v2.7.2
    Image ID:      docker-pullable://argoproj/argoexec@sha256:8e9c6440e7ef62977077632ca56acf0eb4d86b29ba5842a531a53e85ececc5ff
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
    State:          Terminated
      Reason:       Error
      Message:      Failed to determine pid for containerID 3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911: container may have exited too quickly
      Exit Code:    1
      Started:      Tue, 14 Apr 2020 00:14:35 +0900
      Finished:     Tue, 14 Apr 2020 00:14:38 +0900
    Ready:          False
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:                    output-parameter-ms85z-1252809517 (v1:metadata.name)
      ARGO_CONTAINER_RUNTIME_EXECUTOR:  pns
    Mounts:
      /argo/podmetadata from podmetadata (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-2zw64 (ro)
  main:
    Container ID:  docker://3e76e9a21e2dad264e3e71c06ccfc5fe9966ce0bf95a2ebfe19ba3f2a8690911
    Image:         docker/whalesay:latest
    Image ID:      docker-pullable://docker/whalesay@sha256:178598e51a26abbc958b8a2e48825c90bc22e641de3d31e18aaf55f3258ba93b
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      sleep 1; echo -n hello world > /tmp/hello_world.txt
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 14 Apr 2020 00:14:38 +0900
      Finished:     Tue, 14 Apr 2020 00:14:39 +0900
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-2zw64 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  podmetadata:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  default-token-2zw64:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-2zw64
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m2s   default-scheduler  Successfully assigned default/output-parameter-ms85z-1252809517 to pfn-argo
  Normal  Pulled     3m1s   kubelet, pfn-argo  Container image "argoproj/argoexec:v2.7.2" already present on machine
  Normal  Created    3m1s   kubelet, pfn-argo  Created container
  Normal  Started    3m1s   kubelet, pfn-argo  Started container
  Normal  Pulling    3m1s   kubelet, pfn-argo  pulling image "docker/whalesay:latest"
  Normal  Pulled     2m58s  kubelet, pfn-argo  Successfully pulled image "docker/whalesay:latest"
  Normal  Created    2m58s  kubelet, pfn-argo  Created container
  Normal  Started    2m58s  kubelet, pfn-argo  Started container

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@dtaniwaki
Copy link
Member Author

I think the deadline in the following code should be configurable.

I realized my assumption is wrong.

The PNS executor waits on initializing init containers.

https://github.com/argoproj/argo/blob/79217bc89e892ee82bdd5018b2bba65425924d36/workflow/executor/pns/pns.go#L145-L158

However, init containers don't exist in a specific situation.

https://github.com/argoproj/argo/blob/6c685c5baf281116340b7b0708f8a29764d72c47/workflow/controller/workflowpod.go#L193-L196

@dtaniwaki
Copy link
Member Author

I got it. Init containers have nothing to do with the PNS executor.

WaitInit doesn't block at all here while the name of the function sounds waiting.
https://github.com/argoproj/argo/blob/79217bc89e892ee82bdd5018b2bba65425924d36/workflow/executor/pns/pns.go#L149

Is it expected behavior?

@sarabala1979
Copy link
Member

looks like WaitIniti initializing step for Wait operation in Wait container. Comments of the function is clear. In PNSExecutor case, It is trying to secure the "FS" for outputs.

// WaitInit is called before Wait() to signal the executor about an impending Wait call.
// For most executors this is a noop, and is only used by the the PNS executor
WaitInit() error

@dtaniwaki
Copy link
Member Author

I think the go function doesn't wait for the main container.

@wreed4
Copy link
Contributor

wreed4 commented Apr 29, 2020

I've encountered this problem now even when storing outputs in emptyDirs. The problem for me isn't actually saving the outputs, it says it successfully saved the outputs. Here's the wait container logs.

time="2020-04-29T12:24:53Z" level=info msg="Starting Workflow Executor" version=vv2.7.1+2a3f59c.dirty
time="2020-04-29T12:24:53Z" level=info msg="Creating PNS executor (namespace: acquia-polaris-system, pod: code-redeploy-qsmpx-2720976418, pid: 6, hasOutputs: true)"
time="2020-04-29T12:24:53Z" level=info msg="Executor (version: vv2.7.1+2a3f59c.dirty, build_date: 2020-04-07T16:44:22Z) initialized (pod: acquia-polaris-system/code-redeploy-qsmpx-2720976418) with templat
e:\n{\"name\":\"get-envs\",\"arguments\":{},\"inputs\":{},\"outputs\":{\"parameters\":[{\"name\":\"json\",\"valueFrom\":{\"path\":\"/tmp/out.json\"}}]},\"metadata\":{},\"container\":{\"name\":\"\",\"image
\":\"881217801864.dkr.ecr.us-east-1.amazonaws.com/fn-taskstep:latest\",\"args\":[\"get-envs-by-git\",\"git@github.com:acquia/fn-example-repos.git\",\"e2e-d8-build-code-deploy\"],\"resources\":{},\"volumeM
ounts\":[{\"name\":\"out\",\"mountPath\":\"/tmp\"}]},\"volumes\":[{\"name\":\"out\",\"emptyDir\":{}}],\"retryStrategy\":{\"limit\":4}}"
time="2020-04-29T12:24:53Z" level=info msg="Waiting on main container"
time="2020-04-29T12:24:53Z" level=warning msg="Polling root processes (1m0s)"
time="2020-04-29T12:24:53Z" level=warning msg="Failed to stat /proc/27/root: stat /proc/27/root: permission denied"
time="2020-04-29T12:25:08Z" level=info msg="main container started with container ID: 0e63f624ba94dbf5b9dc69ee2ad28d2183ce7d2dbea4bfcd50a534ce343bdce9"
time="2020-04-29T12:25:08Z" level=info msg="Starting annotations monitor"
time="2020-04-29T12:25:08Z" level=warning msg="Failed to wait for container id '0e63f624ba94dbf5b9dc69ee2ad28d2183ce7d2dbea4bfcd50a534ce343bdce9': Failed to determine pid for containerID 0e63f624ba94dbf5b
9dc69ee2ad28d2183ce7d2dbea4bfcd50a534ce343bdce9: container may have exited too quickly"
time="2020-04-29T12:25:08Z" level=error msg="executor error: Failed to determine pid for containerID 0e63f624ba94dbf5b9dc69ee2ad28d2183ce7d2dbea4bfcd50a534ce343bdce9: container may have exited too quickly
\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/ar
goproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).getContainerPID\n\t/go/src/github.com/argoproj/argo
/workflow/executor/pns/pns.go:289\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:162\ngithub.com/argoproj/argo/workflo
w/executor.(*WorkflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.
16.7-beta.0/pkg/util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/c
md/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/
argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command
).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.2018102114111
4-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm
_amd64.s:1357"
time="2020-04-29T12:25:08Z" level=info msg="Saving output parameters"
time="2020-04-29T12:25:08Z" level=info msg="Saving path output parameter: json"
time="2020-04-29T12:25:08Z" level=info msg="Copying /tmp/out.json from from volume mount"
time="2020-04-29T12:25:08Z" level=info msg="Successfully saved output parameter: json"
time="2020-04-29T12:25:08Z" level=info msg="No output artifacts"
time="2020-04-29T12:25:08Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-04-29T12:25:08Z" level=info msg="Annotating pod with output"
time="2020-04-29T12:25:08Z" level=info msg="/argo/podmetadata/annotations updated"
time="2020-04-29T12:25:08Z" level=info msg="Starting deadline monitor"
time="2020-04-29T12:25:08Z" level=info msg="Deadline monitor stopped"
time="2020-04-29T12:25:08Z" level=info msg="Killing sidecars"
time="2020-04-29T12:25:08Z" level=info msg="Alloc=4301 TotalAlloc=20985 Sys=71360 NumGC=7 Goroutines=9"
time="2020-04-29T12:25:08Z" level=fatal msg="Failed to determine pid for containerID 0e63f624ba94dbf5b9dc69ee2ad28d2183ce7d2dbea4bfcd50a534ce343bdce9: container may have exited too quickly\ngithub.com/arg
oproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/argoproj/argo/erro
rs.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).getContainerPID\n\t/go/src/github.com/argoproj/argo/workflow/execut
or/pns/pns.go:289\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:162\ngithub.com/argoproj/argo/workflow/executor.(*Wor
kflowExecutor).Wait.func1\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:829\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.16.7-beta.0/pkg/
util/wait/wait.go:292\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:828\ngithub.com/argoproj/argo/cmd/argoexec/comm
ands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexe
c/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/g
o/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/c
ommand.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"

The reason on the workflow does state it couldn't save outputs, but this doesn't seem right given the logs above.

failed to save outputs: Failed to determine pid for containerID 0e63f624ba94dbf5b9dc69ee2ad28d2183ce7d2dbea4bfcd50a534ce343bdce9: container may have exited too quickly

It looks like the issue is at https://github.com/argoproj/argo/blob/master/workflow/executor/pns/pns.go#L333 . For some reason, intermittently, the Stat is failing with "permission denied". Then the next time checking in the map it still isn't found, so the same error that happens when outputs can't be saved gets thrown, but the real issue in my case is intermittent permission denied....

If it helps, I'm running on microk8s locally on ubuntu. All permissions are set up normally on a microk8s install, my user is in the microk8s group, the pod I'm launching does not have a securityContext set. Also, I'm running version 2.7.1. I can upgrade and see if that helps.

@stale
Copy link

stale bot commented Jul 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 1, 2020
@stale stale bot closed this as completed Jul 8, 2020
@guoweis-work
Copy link

i'm still hitting this issue.

@alexec
Copy link
Contributor

alexec commented Oct 1, 2020

I'm getting this with runAsNonRoot

@alexec alexec reopened this Oct 1, 2020
@stale stale bot removed the wontfix label Oct 1, 2020
@cyang49
Copy link

cyang49 commented Oct 2, 2020

I'm also having similar issue

@alexec
Copy link
Contributor

alexec commented Oct 2, 2020

PNS executor needs to run-as-root. Check that?

@juliusvonkohout
Copy link
Contributor

juliusvonkohout commented Oct 5, 2020

PNS executor needs to run-as-root. Check that?

It does not if you are on containerd or cri-o and modify the executor image a bit. Docker has some bugs regarding capabilities without root afaik. I am using it rootless with pod security policies. Do you want me to make a pull request with the necessary changes? It depends a bit on
#3785 (comment)

@alexec
Copy link
Contributor

alexec commented Oct 5, 2020

@juliusvonkohout can I get you to add your security policy to #4186 as the first step so we can understand the solution?

@alexec
Copy link
Contributor

alexec commented Oct 9, 2020

@wreed @juliusvonkohout I've noted that when ones run PNS with runAsNonRoot and output artifacts, then you get the permission denied error, files are copied, but the wait container cannot determine when the main container to exit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants