Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface K8s Events about Nodes to UI #3673

Closed
hadim opened this issue Aug 4, 2020 · 7 comments · Fixed by #3726
Closed

Surface K8s Events about Nodes to UI #3673

hadim opened this issue Aug 4, 2020 · 7 comments · Fixed by #3726
Labels
type/feature Feature request

Comments

@hadim
Copy link

hadim commented Aug 4, 2020

We run argo on GKE with cluster-autoscaler. Most of the time all our workflows trigger a scaleUp event since powerful machines are usually down when not used.

Usually, the scale-up takes a few minutes. During this time the workflow is in pending state with a message similar to this:

image

Once the node has been created the workflow starts and all is good!

But sometimes the cluster is not able to scale up (often due to an error in the resources and nodeSelector configuration).

Admin have access to cluster event logs and we can spot the issue quickly:

image

But most of our staff only use the Argo dashboard and so can only rely on the workflow status message to understand what is going on. And scaling up error reported by the cluster-autoscaler is not propagated to it:

pod didn't trigger scale-up (it wouldn't fit if a new node is added): 16 Insufficient memory, 1 node(s) didn't match node selector, 15 Insufficient cpu

Would that be possible to report cluster events related to a specific workflow or pods managed by a workflow in the status field?

image

Not sure it's technically possible (because of ServiceAccount and permissions) but I am asking just in case.

@hadim hadim added the type/feature Feature request label Aug 4, 2020
@alexec
Copy link
Contributor

alexec commented Aug 5, 2020

We watch pods of a workflow to compute state. If this information is available on the pod’s status we could.

Are you able to attach the YAML of a pod that we involved? Or is this only available on events? (Please attach the event YAML).

@hadim
Copy link
Author

hadim commented Aug 5, 2020

Indeed the only thing I see in the pod YAML is what Argo report in the UI:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-08-05T16:53:27Z"
    message: '0/21 nodes are available: 21 Insufficient cpu.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

and the source is default-scheduler for the message. But the message "pod didn't trigger scale-up (it wouldn't fit if a new node is added): 17 Insufficient cpu" coming from the source cluster-autoscaler is not shown.

$ kubectl get event -n argo --field-selector involvedObject.name=wonderful-tiger
LAST SEEN   TYPE      REASON              OBJECT                     MESSAGE
25s         Normal    WorkflowRunning     workflow/wonderful-tiger   Workflow Running
25s         Warning   FailedScheduling    pod/wonderful-tiger        0/21 nodes are available: 21 Insufficient cpu.
22s         Normal    NotTriggerScaleUp   pod/wonderful-tiger        pod didn't trigger scale-up (it wouldn't fit if a new node is added): 17 Insufficient cpu

and

kubectl get event -n argo --field-selector involvedObject.name=wonderful-tiger -o yaml
apiVersion: v1
items:
- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "2020-08-05T17:01:09Z"
  involvedObject:
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    name: wonderful-tiger
    namespace: argo
    resourceVersion: "14645697"
    uid: f4bb68ae-84a2-4e40-8d5b-71f0a9ab91f3
  kind: Event
  lastTimestamp: "2020-08-05T17:01:09Z"
  message: Workflow Running
  metadata:
    creationTimestamp: "2020-08-05T17:01:09Z"
    name: wonderful-tiger.16286dde6befd70c
    namespace: argo
    resourceVersion: "135777"
    selfLink: /api/v1/namespaces/argo/events/wonderful-tiger.16286dde6befd70c
    uid: 57c2bd09-11c6-4dc8-8ada-b33346764566
  reason: WorkflowRunning
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: workflow-controller
  type: Normal
- apiVersion: v1
  count: 3
  eventTime: null
  firstTimestamp: "2020-08-05T17:01:09Z"
  involvedObject:
    apiVersion: v1
    kind: Pod
    name: wonderful-tiger
    namespace: argo
    resourceVersion: "14645699"
    uid: a6c471b1-a31e-46fa-b53e-27554a81d328
  kind: Event
  lastTimestamp: "2020-08-05T17:02:33Z"
  message: '0/21 nodes are available: 21 Insufficient cpu.'
  metadata:
    creationTimestamp: "2020-08-05T17:01:09Z"
    name: wonderful-tiger.16286dde6d37eca9
    namespace: argo
    resourceVersion: "135781"
    selfLink: /api/v1/namespaces/argo/events/wonderful-tiger.16286dde6d37eca9
    uid: d256c9fe-40fd-4b97-a7e6-62ec0470eb4f
  reason: FailedScheduling
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: default-scheduler
  type: Warning
- apiVersion: v1
  count: 1
  eventTime: null
  firstTimestamp: "2020-08-05T17:01:12Z"
  involvedObject:
    apiVersion: v1
    kind: Pod
    name: wonderful-tiger
    namespace: argo
    resourceVersion: "14645700"
    uid: a6c471b1-a31e-46fa-b53e-27554a81d328
  kind: Event
  lastTimestamp: "2020-08-05T17:01:12Z"
  message: 'pod didn''t trigger scale-up (it wouldn''t fit if a new node is added):
    17 Insufficient cpu'
  metadata:
    creationTimestamp: "2020-08-05T17:01:12Z"
    name: wonderful-tiger.16286ddf00db261b
    namespace: argo
    resourceVersion: "135780"
    selfLink: /api/v1/namespaces/argo/events/wonderful-tiger.16286ddf00db261b
    uid: ac79183e-aaff-4f8d-a604-10a8c4b254f6
  reason: NotTriggerScaleUp
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: cluster-autoscaler
  type: Normal
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

So maybe it's possible that the cluster-autoscaler event was too fast to be displayed? If that's the case then is there is to access the previous status messages of a workflow?

@alexec
Copy link
Contributor

alexec commented Aug 5, 2020

Would it be enough to list events related to the pod in the UI?

@hadim
Copy link
Author

hadim commented Aug 5, 2020

I think it would do the job yeah. And actually this will be useful for probably other things than this specific issue to be able to see the history of events. Similar to what's GKE is doing:

image

I guess if this is available in the UI it will also be available in the status field of a Workflow YAML spec. Then we could also use this information from a machine for automatic processing. So yeah that would a very nice feature!

@alexec
Copy link
Contributor

alexec commented Aug 5, 2020

As an MVP, I don't think we would make it available in the YAML. Instead, we would just make it available in the UI only.

That would exclude the UI and the CLI.

Do you think this should be MVP?

@hadim
Copy link
Author

hadim commented Aug 5, 2020

For us, the most important is to have this information available on the UI.

Having it in the YAML is just a bonus and a cool enhancement IMO. I like the idea of having all the information related to a workflow in a single YAML object. Its state, spec, and history of status.

So my answer is yes it should go in MVP but if this requires too much work then having the information on the UI only is fine.

@jessesuen jessesuen added the ui label Aug 5, 2020
@jessesuen jessesuen changed the title Propagate cluster-autoscaler event messages to workflow status message Surface K8s Events about Nodes to UI Aug 5, 2020
@alexec
Copy link
Contributor

alexec commented Sep 2, 2020

Available for testing in v2.11.0-rc1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants