Surface K8s Events about Nodes to UI #3673

hadim · 2020-08-04T23:32:44Z

We run argo on GKE with cluster-autoscaler. Most of the time all our workflows trigger a scaleUp event since powerful machines are usually down when not used.

Usually, the scale-up takes a few minutes. During this time the workflow is in pending state with a message similar to this:

Once the node has been created the workflow starts and all is good!

But sometimes the cluster is not able to scale up (often due to an error in the resources and nodeSelector configuration).

Admin have access to cluster event logs and we can spot the issue quickly:

But most of our staff only use the Argo dashboard and so can only rely on the workflow status message to understand what is going on. And scaling up error reported by the cluster-autoscaler is not propagated to it:

pod didn't trigger scale-up (it wouldn't fit if a new node is added): 16 Insufficient memory, 1 node(s) didn't match node selector, 15 Insufficient cpu

Would that be possible to report cluster events related to a specific workflow or pods managed by a workflow in the status field?

Not sure it's technically possible (because of ServiceAccount and permissions) but I am asking just in case.

The text was updated successfully, but these errors were encountered:

alexec · 2020-08-05T15:49:08Z

We watch pods of a workflow to compute state. If this information is available on the pod’s status we could.

Are you able to attach the YAML of a pod that we involved? Or is this only available on events? (Please attach the event YAML).

hadim · 2020-08-05T17:04:54Z

Indeed the only thing I see in the pod YAML is what Argo report in the UI:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-08-05T16:53:27Z"
    message: '0/21 nodes are available: 21 Insufficient cpu.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

and the source is default-scheduler for the message. But the message "pod didn't trigger scale-up (it wouldn't fit if a new node is added): 17 Insufficient cpu" coming from the source cluster-autoscaler is not shown.

$ kubectl get event -n argo --field-selector involvedObject.name=wonderful-tiger
LAST SEEN   TYPE      REASON              OBJECT                     MESSAGE
25s         Normal    WorkflowRunning     workflow/wonderful-tiger   Workflow Running
25s         Warning   FailedScheduling    pod/wonderful-tiger        0/21 nodes are available: 21 Insufficient cpu.
22s         Normal    NotTriggerScaleUp   pod/wonderful-tiger        pod didn't trigger scale-up (it wouldn't fit if a new node is added): 17 Insufficient cpu

and

kubectl get event -n argo --field-selector involvedObject.name=wonderful-tiger -o yaml

apiVersion: v1
items:
- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "2020-08-05T17:01:09Z"
  involvedObject:
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    name: wonderful-tiger
    namespace: argo
    resourceVersion: "14645697"
    uid: f4bb68ae-84a2-4e40-8d5b-71f0a9ab91f3
  kind: Event
  lastTimestamp: "2020-08-05T17:01:09Z"
  message: Workflow Running
  metadata:
    creationTimestamp: "2020-08-05T17:01:09Z"
    name: wonderful-tiger.16286dde6befd70c
    namespace: argo
    resourceVersion: "135777"
    selfLink: /api/v1/namespaces/argo/events/wonderful-tiger.16286dde6befd70c
    uid: 57c2bd09-11c6-4dc8-8ada-b33346764566
  reason: WorkflowRunning
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: workflow-controller
  type: Normal
- apiVersion: v1
  count: 3
  eventTime: null
  firstTimestamp: "2020-08-05T17:01:09Z"
  involvedObject:
    apiVersion: v1
    kind: Pod
    name: wonderful-tiger
    namespace: argo
    resourceVersion: "14645699"
    uid: a6c471b1-a31e-46fa-b53e-27554a81d328
  kind: Event
  lastTimestamp: "2020-08-05T17:02:33Z"
  message: '0/21 nodes are available: 21 Insufficient cpu.'
  metadata:
    creationTimestamp: "2020-08-05T17:01:09Z"
    name: wonderful-tiger.16286dde6d37eca9
    namespace: argo
    resourceVersion: "135781"
    selfLink: /api/v1/namespaces/argo/events/wonderful-tiger.16286dde6d37eca9
    uid: d256c9fe-40fd-4b97-a7e6-62ec0470eb4f
  reason: FailedScheduling
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: default-scheduler
  type: Warning
- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "2020-08-05T17:01:12Z"
  involvedObject:
    apiVersion: v1
    kind: Pod
    name: wonderful-tiger
    namespace: argo
    resourceVersion: "14645700"
    uid: a6c471b1-a31e-46fa-b53e-27554a81d328
  kind: Event
  lastTimestamp: "2020-08-05T17:01:12Z"
  message: 'pod didn''t trigger scale-up (it wouldn''t fit if a new node is added):
    17 Insufficient cpu'
  metadata:
    creationTimestamp: "2020-08-05T17:01:12Z"
    name: wonderful-tiger.16286ddf00db261b
    namespace: argo
    resourceVersion: "135780"
    selfLink: /api/v1/namespaces/argo/events/wonderful-tiger.16286ddf00db261b
    uid: ac79183e-aaff-4f8d-a604-10a8c4b254f6
  reason: NotTriggerScaleUp
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: cluster-autoscaler
  type: Normal
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

So maybe it's possible that the cluster-autoscaler event was too fast to be displayed? If that's the case then is there is to access the previous status messages of a workflow?

alexec · 2020-08-05T17:19:34Z

Would it be enough to list events related to the pod in the UI?

hadim · 2020-08-05T17:35:09Z

I think it would do the job yeah. And actually this will be useful for probably other things than this specific issue to be able to see the history of events. Similar to what's GKE is doing:

I guess if this is available in the UI it will also be available in the status field of a Workflow YAML spec. Then we could also use this information from a machine for automatic processing. So yeah that would a very nice feature!

alexec · 2020-08-05T17:54:12Z

As an MVP, I don't think we would make it available in the YAML. Instead, we would just make it available in the UI only.

That would exclude the UI and the CLI.

Do you think this should be MVP?

hadim · 2020-08-05T17:58:30Z

For us, the most important is to have this information available on the UI.

Having it in the YAML is just a bonus and a cool enhancement IMO. I like the idea of having all the information related to a workflow in a single YAML object. Its state, spec, and history of status.

So my answer is yes it should go in MVP but if this requires too much work then having the information on the UI only is fine.

…3726)

alexec · 2020-09-02T15:22:35Z

Available for testing in v2.11.0-rc1.

hadim added the type/feature Feature request label Aug 4, 2020

jessesuen added the ui label Aug 5, 2020

jessesuen changed the title ~~Propagate cluster-autoscaler event messages to workflow status message~~ Surface K8s Events about Nodes to UI Aug 5, 2020

hadim mentioned this issue Aug 12, 2020

feat(server): Display events involved in the workflow. Closes #3673 #3726

Merged

6 tasks

alexec mentioned this issue Aug 12, 2020

Show workflows events in the UI like Argo CD does #3725

Closed

alexec closed this as completed in #3726 Sep 1, 2020

alexec added a commit that referenced this issue Sep 1, 2020

feat(server): Display events involved in the workflow. Closes #3673 (#…

650869f

…3726)

vladlosev mentioned this issue Sep 22, 2020

feat(argo): Allow Argo server read access to events. argoproj/argo-helm#454

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface K8s Events about Nodes to UI #3673

Surface K8s Events about Nodes to UI #3673

hadim commented Aug 4, 2020

alexec commented Aug 5, 2020

hadim commented Aug 5, 2020

alexec commented Aug 5, 2020

hadim commented Aug 5, 2020

alexec commented Aug 5, 2020

hadim commented Aug 5, 2020

alexec commented Sep 2, 2020 •

edited

Loading

Surface K8s Events about Nodes to UI #3673

Surface K8s Events about Nodes to UI #3673

Comments

hadim commented Aug 4, 2020

alexec commented Aug 5, 2020

hadim commented Aug 5, 2020

alexec commented Aug 5, 2020

hadim commented Aug 5, 2020

alexec commented Aug 5, 2020

hadim commented Aug 5, 2020

alexec commented Sep 2, 2020 • edited Loading

alexec commented Sep 2, 2020 •

edited

Loading