Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routes don't inherit their parent route's grouping when "group_by: [...]" #1857

Closed
ingshtrom opened this issue Apr 24, 2019 · 5 comments · Fixed by #2154
Closed

Routes don't inherit their parent route's grouping when "group_by: [...]" #1857

ingshtrom opened this issue Apr 24, 2019 · 5 comments · Fixed by #2154

Comments

@ingshtrom
Copy link

What did you do?
I have two alerts that do not have the same labels and when they make it to PagerDuty (via v1 api) they show up as a single alert.

What did you expect to see?
I expected the alerts to come through separately.

What did you see instead? Under which circumstances?
The alerts seem to be merged into the same aggregation group.

Environment

  • System information:

Linux 4.4.0-109-generic x86_64 (it's the prom/alertmanager:0.16.1 Docker Image)

  • Alertmanager version:
alertmanager, version 0.16.1 (branch: HEAD, revision: 571caec278be1f0dbadfdf5effd0bbea16562cfc)
  build user:       root@3000aa3a06c5
  build date:       20190131-15:05:40
  go version:       go1.11.5
  • Prometheus version:

I don't think this is relevant to the issue, but it's 2.7.0.

  • Alertmanager configuration file:
global:
  pagerduty_url: https://events.pagerduty.com/generic/2010-04-15/create_event.json
# https://github.com/prometheus/alertmanager/blob/master/doc/examples/simple.yml
route:
  # effectively turn off grouping because we let PagerDuty do that, if we ever
  # actually want it
  group_by: ['...']

  # A default receiver
  receiver: 'infra_low'

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 5m

  # The child route trees.
  ## All alerts that do not match the following child routes
  ## will remain at the root node and be dispatched to 'infra_low'.
  routes:
    - match:
        pd_service: infra_low
      receiver: 'infra_low'

# Receivers
receivers:
  - name: infra_low
    pagerduty_configs:
    - service_key: <pd_events_v1_api_key>
  • Prometheus configuration file:
    n/a

  • Logs:
    Notice how SwarmMembers and SwarmManagers are aggregated into the same group and then sent to PagerDuty that way?

level=debug ts=2019-04-24T18:49:56.260395675Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SwarmMembers[a5d427e][active]
level=debug ts=2019-04-24T18:49:56.291808767Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SwarmManagers[afec9b4][active]
level=debug ts=2019-04-24T18:49:56.311212482Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SwarmMembers[a5d427e][active]
level=debug ts=2019-04-24T18:50:23.157521362Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"LogstashDown\", name=\"logstash\"}" msg=flushing alerts=[LogstashDown[d8cf207][active]]
level=debug ts=2019-04-24T18:50:23.157715593Z caller=impl.go:647 msg="Notifying PagerDuty" incident="{}:{alertname=\"LogstashDown\", name=\"logstash\"}" eventType=trigger
level=debug ts=2019-04-24T18:50:23.157521287Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}:{alertname=\"InfraTestRoutingServiceDown\", pd_service=\"infra\"}" msg=flushing alerts=[InfraTestRoutingServiceDown[c1526b1][active]]
level=debug ts=2019-04-24T18:50:23.15874876Z caller=impl.go:647 msg="Notifying PagerDuty" incident="{}:{alertname=\"InfraTestRoutingServiceDown\", pd_service=\"infra\"}" eventType=trigger
level=debug ts=2019-04-24T18:50:23.630065422Z caller=dispatch.go:343 component=dispatcher aggrGroup="{}/{pd_service=\"infra_low\"}:{}" msg=flushing alerts="[SwarmManagers[afec9b4][active] SwarmMembers[a5d427e][active]]"
level=debug ts=2019-04-24T18:50:23.63041893Z caller=impl.go:647 msg="Notifying PagerDuty" incident="{}/{pd_service=\"infra_low\"}:{}" eventType=trigger
level=debug ts=2019-04-24T18:51:11.128676163Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=InfraTestRoutingServiceDown[c1526b1][active]
level=debug ts=2019-04-24T18:51:11.129535358Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=LogstashDown[d8cf207][active]
level=debug ts=2019-04-24T18:51:11.130395572Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=InfraTestRoutingServiceDown[c1526b1][active]
level=debug ts=2019-04-24T18:51:11.132252652Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=LogstashDown[d8cf207][active]
level=debug ts=2019-04-24T18:51:11.132536422Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeCpuSteal[7a886d3][resolved]
level=debug ts=2019-04-24T18:51:11.134124758Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=NodeCpuSteal[7a886d3][resolved]
level=debug ts=2019-04-24T18:51:11.235076972Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SwarmManagers[afec9b4][active]
level=debug ts=2019-04-24T18:51:11.237976255Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SwarmManagers[afec9b4][active]
level=debug ts=2019-04-24T18:51:11.256561706Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SwarmMembers[a5d427e][active]
level=debug ts=2019-04-24T18:51:11.284542852Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SwarmMembers[a5d427e][active]

Here is a screenshot of the alerts that you see in the logs:
Screen Shot 2019-04-24 at 2 55 49 PM

@ingshtrom
Copy link
Author

I should note, this specific case can be worked around with setting group_by: ['alertname', 'pd_service'], but we have over 100 alerts across several teams which we'd have to go through to make sure we got every label necessary. We would also need to do this exercise every time a new alert was added. :(

@simonpasquier
Copy link
Member

Have you tried setting group_by: [...] for the sub-route(s)? IIUC this is because the GroupByAll field doesn't inherit from the parent route as we can't distinguish between the boolean value being true, false or unset.

opts.GroupByAll = cr.GroupByAll

@ingshtrom
Copy link
Author

I was not aware of that. It's been really busy for me, so I'll try this next week and submit a docs PR if this works to make it more clear. Thank you for the quick response!

@ingshtrom
Copy link
Author

i was able to test this out and it works as expected when you specify the group_by: [...] on a per-sub-route basis. Worked perfect and then I realized it wasn't what I wanted. :) Thank you for your help!

@simonpasquier simonpasquier changed the title Alerts are Aggregated when GroupByAll = true Routes don't inherit their parent route's grouping when "group_by: [...]" May 3, 2019
@simonpasquier
Copy link
Member

simonpasquier commented May 3, 2019

Re-opening and updating the title as I think that sub-routes should default to their parent route's group_by in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants