Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-state alert routing in multi tenant cluster #1231

Closed
nrobert13 opened this issue Aug 5, 2021 · 4 comments
Closed

kube-state alert routing in multi tenant cluster #1231

nrobert13 opened this issue Aug 5, 2021 · 4 comments

Comments

@nrobert13
Copy link

nrobert13 commented Aug 5, 2021

Hey guys,

I'm facing the following challenge. We are running a cluster with multiple tenant supporting their own workloads. Each tenant has a separate alerting channel, so we would need to route the alerts generated out of the kube-state-metrics based on the source namespaces. As far as I can tell, the current chart does not allow such a dynamic routing of the alerts based on the source namespace.

We came up with the following solution:

  1. add a label alerts to each namespace representing the name of the channel where the alert are supposed to be routed to.
kind: Namespace
metadata:
  labels:
    alerts: channel
  1. leveraging the kube_namespace_labels metric, which contains all the labels attached to namespaces ( see label_alerts )
kube_namespace_labels{container="kube-state-metrics", endpoint="http", instance="10.32.3.33:8080", job="kube-state-metrics", label_alerts="channel", namespace="service", pod="prometheus-kube-state-metrics-748b59796-t4s29", service="prometheus-kube-state-metrics"}
  1. and grouping the 2 metrics on namespace in order to get the value of the alerts label of the namespace from step 1. , and use it as a label on the alert itself
name: KubeJobFailed
expr: ( kube_job_status_failed{job="kube-state-metrics",namespace=~"{{ $targetNamespace }}"} > 0 ) * on(namespace) group_left(label_alerts) kube_namespace_labels
for: 1h
labels:
  severity: warning
  team: {{ $labels.label_alerts }}
annotations:
  message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
  runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed
  1. this way we propagate the label from namespace to alerts, and it works as expected.

The problem is, that we need to copy/paste all the rules from the kube-state-metrics in order to add the grouping to the expressions * on(namespace) group_left(label_alerts) kube_namespace_labels and the parenthesis. I imagine others may face the same challenges, so I thought about adding 2 extra values like .Values.defaultRules.expressionPrefix and .Values.defaultRules.expressionSuffix to be able to feed the grouping into all the rules without copy/pasting and maintaining the rules ... so the rule would become something like:

    - alert: KubeJobFailed
      annotations:
        description: Job {{`{{`}} $labels.namespace {{`}}`}}/{{`{{`}} $labels.job_name {{`}}`}} failed to complete. Removing failed job after investigation should clear this alert.
        runbook_url: {{ .Values.defaultRules.runbookUrl }}alert-name-kubejobfailed
        summary: Job failed to complete.
      expr: {{ .Values.defaultRules.expressionPrefix }} kube_job_failed{job="kube-state-metrics", namespace=~"{{ $targetNamespace }}"}  > 0 {{ .Values.defaultRules.expressionSufix }}
      for: 15m
      labels:
        severity: warning
{{- if .Values.defaultRules.additionalRuleLabels }}
{{ toYaml .Values.defaultRules.additionalRuleLabels | indent 8 }}
{{- end }}

and the values:

defaultRules:
  expressionPrefix: "("
  expressionSufix: ") * on(namespace) group_left(label_alerts) kube_namespace_labels"

Let me know please WDYT, or if you have another alternative to achieve the same goal.

Cheers,
Robert

@nrobert13
Copy link
Author

hey guys, what do you think about this proposal. any input is highly appreciated.
cheers,
robert

@martykuentzel
Copy link

I just saw this request and I highly endorse it. In our company we have a similar setup via slack as described here and its currently not easy to route generic prometheus alerts to separate channels.

@jorik90
Copy link

jorik90 commented Sep 3, 2021

You can use templates inside a label, in which you can execute a query. Support for additional labels is already present. I have used the following to apply the correct team/squad label to alerts that have a namespace label.

In the values.yaml of the chart:

defaultRules:
  additionalRuleLabels:
    squad: '{{ with printf `kube_namespace_labels{namespace="%s"}` .Labels.namespace | query }}{{ with (. | first).Labels.label_squad }}{{ . }}{{else}}nosquad{{end}}{{else}}nosquad{{end}}'

kube-state-metrics:
  # make sure kube-state-metrics includes the label
  metricLabelsAllowlist: namespaces=[squad]

Each alert will get a squad label containing the squad that is defined in the namespace label. If no label is defined, the value will be nosquad.

@nrobert13
Copy link
Author

thanks @jorik90 , this is amazing. closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants