Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of event consumer lag alerts in Datadog #538

Closed
3 tasks done
robrap opened this issue Jan 24, 2024 · 3 comments
Closed
3 tasks done

Better handling of event consumer lag alerts in Datadog #538

robrap opened this issue Jan 24, 2024 · 3 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented Jan 24, 2024

The consumer lag alerts from Datadog are all going to arch-bom. They have not been split by team and consumer, with safety-net alerts, as was done for the New Relic side alerts.

Acceptance Criteria:

  • A new example monitor of a consumer lag issue that is filtered by the consumer_group_id
    • Docs currently discuss splitting by topic, but we should probably split by consumer_group_id
  • A catch-all monitor that will alert us when any particular topic falls behind for 2 hours
    • We can update the existing monitor to behave this way.
  • Update docs with new approach

Note:

@dianakhuang
Copy link
Member

New alert split on consumer group id for edxapp: https://app.datadoghq.com/monitors/139126405?view=spans

@dianakhuang
Copy link
Member

We believe that the existing monitor does work as a catch all currently. The reason why it didn't trigger on a particular error is that Confluent wasn't sending us the proper data to trigger the alert.

@dianakhuang dianakhuang moved this from In Progress to Done in Arch-BOM Feb 8, 2024
@dianakhuang dianakhuang moved this from Done to In Progress in Arch-BOM Feb 9, 2024
@dianakhuang
Copy link
Member

dianakhuang commented Feb 9, 2024

  • Create a new opsgenie integration to alert Phoenix on the course discovery lag monitor
  • Update documentation to have examples of both topic-based and consumer group alerting (edx-platform vs. separate service)
  • Update runbooks so that they match the actual alerts.
  • Add a warning box that maybe this will change as we understand Datadog better.

@dianakhuang dianakhuang moved this from In Progress to Done in Arch-BOM Feb 12, 2024
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

2 participants