Skip to content

Prometheus AlertManager wrapper APIs

Alan Malta Rodrigues edited this page Mar 17, 2021 · 3 revisions

This is a basic documentation for the Prometheus AlertManager service and how WMCore wrapped some of those APIs in https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/AlertManager/AlertManagerAPI.py . This wiki is based on the gdoc documentation created by Erik.

Background

Email alerting we typically use in WMCore (sendmail via localhost) does not work in Kubernetes pods, see: https://github.com/dmwm/WMCore/issues/10234 Given that we don’t want to put sendmail in each of our pods, so we looked into some alternatives for alerting. We plan to switch to using the MONIT/Prometheus AlertManager API to handle alerts that require notification via email, slack or need to be viewed in a dashboard; this service is supported by the CMS Monitoring and many other services have already adopted it.

AlertManager

AlertManager is highly flexible and alerts are sent with an alertname and certain labels that can determine how the alert will be routed. The alert name - or a combination of alertname and labels - uniquely identify an alert. i.e. if ms-transferor needs to alert on two transfers, they need unique names or the same name and a unique set of labels. Otherwise, the second alert will overwrite the first. Alerts can also be sent with annotations providing additional information about the alert. Annotations do not uniquely identify an alert.

The routing and type of notifications (email/slack/grafana dashboard/etc) has to be configured by the CMS Monitoring, and a JIRA ticket needs to be opened for such requests.

Alerts can be seen at: https://cms-monitoring.cern.ch/alertmanager/#/alerts and further information can be found at: https://prometheus.io/docs/alerting/latest/alertmanager/

WMCore alert structure

Alerts can have a fairly complex structure. Here we try to expand on those and describe how WMCore is going to use them (and how it has been implemented in our service wrapper). An example is as follows:

{
    "annotations": {
        "hostname": "esg-dmwm-dev1.cern.ch",
        "summary": "[MSOutput] Campaign test not found in central CouchDB",
        "description": "Dataset: test cannot have an output transfer rule because its campaign: test cannot be found in central CouchDB."
    },
    "labels": {
        "tag": "wmcore",
        "service": "ms-output",
        "alertname": "a_random_workflow_name",
        "severity": "high"
    },
    "endsAt": "2021-03-05T17:57:47.256281+01:00",
    "generatorURL": "https://cmsweb.cern.ch"
}

where:

  • annotations.hostname: is the host triggering this alert (automatically filled when creating an instance of the AlertManagerAPI).
  • annotations.summary: a very short summary of the problem (meant to be the email title, if email notification has been set up).
  • annotations.description: a longer description of the problem (meant to be the body of the email, if email notification has been set up).
  • labels.tag: will be wmcore by default, such that we can properly group all these alerts and properly route them.
  • labels.service: the service name that is triggering this alert (workqueue, reqmgr2, ms-transferor, etc). We will need to add some validation/sanitization in the future.
  • labels.alertname: will likely be a workflow name (if alert concerns a workflow failure); or a service name (in case it concerns a general problem with a component/service).
  • labels.severity: it can be one of these: ["high", "medium", "low"]. For WMCore usage, medium severity will trigger a slack notification only; while high will trigger both slack and email notifications.

From the AlertManagerAPI implementation, the following parameters are mandatory when sending an alert: alertName, severity, summary, description, service.

Clone this wiki locally