Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enriching Info in Slack alerts for cloudwatch alarms #8666

Closed
3 tasks done
richgreen-moj opened this issue Dec 3, 2024 · 3 comments
Closed
3 tasks done

Enriching Info in Slack alerts for cloudwatch alarms #8666

richgreen-moj opened this issue Dec 3, 2024 · 3 comments
Assignees
Labels
firebreak Mod Platform skunk works

Comments

@richgreen-moj
Copy link
Contributor

richgreen-moj commented Dec 3, 2024

User Story

As a MP Engineer
I would like to enrich the information provided in our low priority alerts channel
So that I can get more pertinent insights into the alerts being raised before having to open PagerDuty or query cloudtrail logs etc.

Value / Purpose

The idea with this one is to save some clicks basically. We receive alerts in our channels via PagerDuty e.g. #modernisation-platform-low-priority-alarms channel and the only information at a glance is the name of the cloudwatch alarm that has been triggered.

At a minimum it would be useful to have an idea of the name/alias of the account that has triggered the alert. Anything extra would be a bonus e.g. links to a cloudwatch insights query to speed up interrogation of logs etc.

Context / Background

Idea for a firebreak ticket that will improve our ability to respond to alerts.

AWS ChatBot comes with some of this stuff out of the box but unfortunately requires manual steps to set up and so might not be viable yet to role out at scale.

Useful Contacts

No response

Additional Information

No response

Definition of Done

  • Identify a solution for enriching alerts
  • PoC solution to enrich alerts for a test account (e.g. Sprinkler)
  • If deemed beneficial, raise ticket to deploy to the rest of the platform
@richgreen-moj richgreen-moj added firebreak Mod Platform skunk works needs refining labels Dec 3, 2024
@richgreen-moj richgreen-moj changed the title Enriching Info in Slack Alerts for cloudwatch alarms Enriching Info in Slack alerts for cloudwatch alarms Dec 3, 2024
@richgreen-moj richgreen-moj moved this from To Do to In Progress in Modernisation Platform Dec 11, 2024
@richgreen-moj
Copy link
Contributor Author

richgreen-moj commented Jan 7, 2025

I've tested a solution whereby we can include some extra detail in the alerts we get in the #modernisation-platform-low-priority-alarms channel.

My branch with changes can be seen here: main...feature/8666-enriching-alerts

This does the following:

  1. Removes the PagerDuty subscription from the SNS topic (so PagerDuty is no longer directly triggered by the SNS event) which is the current/default operation
  2. Creates a lambda function with necessary permissions which is subscribed to the SNS topic
  3. The lambda function is triggered when the CloudWatch alarms are triggered
  4. It parses the information in the message and extracts certain details to be included in the PagerDuty event including a function which queries the account alias
  5. It posts an event to PagerDuty via an API and includes some retries and exponential backoff in case of any errors.

The results of this can be seen in this slack alert of the admin-role-usage alarm being triggered for sprinkler. Here is the associated PagerDuty event.

The lambda function has allowed me to edit the contents of the event summary, which is the main bit of detail you see via Slack, with the account alias and the account number. This means that at a glance we can see which account the alarm relates to, which speeds up the process of any further investigation required.

Previously, via the slack alert alone, we could only see only that the alarm has been triggered, but not where. With some extra clicks e.g. clicking "View Details" or clicking the link into PagerDuty we can find the account number which we would then need to cross reference elsewhere. An example of that can be seen here

Limitations:

  • Slack alert formatting: PagerDuty has a Common Event Format https://support.pagerduty.com/main/docs/pd-cef , a standardized alert format which is good but restricts the way your events are displayed in Slack. Bypassing PagerDuty entirely would allow for even more customisation of the message formatting in Slack. I've only been able to customise the event summary but it could look better.

@richgreen-moj
Copy link
Contributor Author

After getting this working I did spend some extra time looking at adding extra functions to the script to attempt to interrogate the cloudtrail logs around the timestamp of the alarm being triggered (e.g. +/- 15 mins) to try and establish the user identities that may have triggered the alarm and include this info in the alert.

Unfortunately I was unable to make this work, although this could be expanded on in future.
It might be a bit difficult to include this in the alerting functionality and be better to have a quick link to query this in the console or observability platform etc.

@richgreen-moj
Copy link
Contributor Author

#8871 has been raised to look at deploying this into production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
firebreak Mod Platform skunk works
Projects
Status: In Progress
Development

No branches or pull requests

2 participants