Toggle CloudWatch alarms actions during Data Refresh #3652

krysal · 2024-01-11T16:20:50Z

Fixes

Description

This PR adds steps for disabling and re-enabling the actions of the most sensitive alarms when a Data Refresh process is running. I made the most basic changes to achieve this but maybe it can be improved to make these steps not required to continue the DAG execution and make sure that at the end the alarms actions are always re-enabled. Requesting @stacimc's review specifically for that.

We'll need to create AWS credentials for Airflow as well I believe.

Testing Instructions

Set the AWS environment variables with the credentials of your AWS account. They can be found in the ~/.aws/credentials file if you have configured the AWS CLI.

# .env file
AWS_ACCESS_KEY_ID=<your_key>
#AWS_SECRET_ACCESS_KEY=<your_secret>
#AWS_DEFAULT_REGION=us-east-1

🚨 This will modify production alarms actions 🚨 (if you have access to this credentials) but since a local data refresh finishes quickly and it's not destroying resources I thought this is acceptable if you want to see it working.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

dhruvkb

LGTM, it's very cool that the DAG will automatically disable and re-enable the alarms!

± 1 change: The description of the tasks variable (lines 222 to 229), before calling chain needs updating.

catalog/dags/common/cloudwatch.py

stacimc

This is so cool -- really clever idea, and a welcome change since we investigate those false alarms weekly!

I have two blockers: first, the list of alarms being silenced. Can we instead only silence the ES Prod CPU Utilization? That one can 100% be ignored during the data refresh. For the response time I don't think instability in those should be silenced, especially since this part of the image data refresh takes two days, is generally running during the part of the week where the API is deployed, and those alarms are pretty critical.

Second and in response to your question in the PR description: we can update this to ensure that enable_alarms runs even if the DAG fails during ingestion! To do that you need to add trigger_rule=TriggerRule.ALL_DONE to the enable_alarms task. This will cause that task to run as soon as all of its upstream tasks have completed, even if some of them failed.

The catch is that trigger rules only evaluate the status of a task's immediately upstream tasks, so if the tasks were chained like this:

create_filtered_index >> enable_alarms >> promote_tasks

The promote_tasks will try to run as long as enable_alarms runs, even if create_filtered_index (or a further upstream task) failed. We can fix that by ensuring that both are upstream of the promotion steps:

tasks.append([enable_alarms, promote_tasks])

catalog/dags/common/cloudwatch.py

catalog/dags/data_refresh/data_refresh_task_factory.py

catalog/env.template

catalog/dags/data_refresh/data_refresh_task_factory.py

openverse-bot · 2024-01-18T00:00:14Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 4 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

stacimc · 2024-01-18T18:18:40Z

catalog/dags/data_refresh/data_refresh_task_factory.py

        # Finally, promote the filtered index.
-        tasks.append(promote_filtered_index)
+        tasks.append([enable_alarms, promote_filtered_index])


This also works, but still puts enable_alarms in between the promotion/deletion steps for the media index, and the promotion/deletion for the filtered index. I don't think that will cause any problems in terms of task execution, but it makes the flow of the DAG more confusing, and ideally we'd re-enable the alarm ASAP. I'd prefer to define this task earlier and then append it along with the first set of promote tasks on L199.

I see I chose the wrong promote task to pair it with. Gotcha, fixed.

stacimc

🥳 Tested all the failure states to be sure and the alarms are always re-enabled :) LGTM!

krysal · 2024-01-18T20:18:01Z

Thanks!

AetherUnbound · 2024-01-18T23:58:19Z

@krysal We run the data refresh DAG locally quite often for testing. Given that we don't have a Cloudwatch service available locally for testing, any local runs will immediately fail at the alarm step (I tried and confirmed this locally)

Do you mind adding in a mechanism, perhaps similar to our SLACK_MESSAGE_OVERRIDE, which checks the environment to make sure the step only happens in production unless a Variable is explicitly set to True?

openverse/catalog/dags/common/slack.py

Line 304 in ee77ef5

"SLACK_MESSAGE_OVERRIDE", default_var=False, deserialize_json=True

This reverts commit ce5424f.

…#3689) This reverts commit ce5424f.

krysal assigned stacimc Jan 11, 2024

krysal requested a review from a team as a code owner January 11, 2024 16:20

krysal requested review from obulat and stacimc January 11, 2024 16:20

krysal added the 🧱 stack: mgmt Related to repo management and automations label Jan 11, 2024

krysal unassigned stacimc Jan 11, 2024

dhruvkb approved these changes Jan 16, 2024

View reviewed changes

catalog/dags/common/cloudwatch.py Show resolved Hide resolved

krysal force-pushed the add/alarms-toggling-steps branch from c8dc5d8 to 3bc2faf Compare January 16, 2024 23:19

krysal requested a review from a team as a code owner January 16, 2024 23:19

stacimc requested changes Jan 17, 2024

View reviewed changes

krysal force-pushed the add/alarms-toggling-steps branch from 3bc2faf to fd5e116 Compare January 18, 2024 00:58

krysal requested a review from stacimc January 18, 2024 01:28

stacimc reviewed Jan 18, 2024

View reviewed changes

krysal requested a review from stacimc January 18, 2024 19:20

krysal added 8 commits January 18, 2024 15:21

Disable and enable CloudWatch alarm during Data Refresh

644d78f

Update AWS variables in env.template

83ef9fe

Silence only ES CPU utilization alarm

36669a8

Re-enable alarms after deleting old index

a96c8b0

Always re-enable CloudWatch alarms

d2b8dab

Update environment variables in tests

ff05b45

Expand tasks comments to highlight enabling the alarms

58589c3

Move enable_alarms task earlier in the Data Refresh

71db055

krysal force-pushed the add/alarms-toggling-steps branch from ddef6a8 to 71db055 Compare January 18, 2024 19:21

stacimc approved these changes Jan 18, 2024

View reviewed changes

krysal merged commit ce5424f into main Jan 18, 2024
39 checks passed

krysal deleted the add/alarms-toggling-steps branch January 18, 2024 20:18

This was referenced Jan 19, 2024

Add check to skip toggling CloudWatch alarms #3682

Merged

ECS Alarms for anomalous behavior #2344

Closed

AetherUnbound added a commit that referenced this pull request Jan 22, 2024

Revert "Toggle CloudWatch alarms actions during Data Refresh (#3652)"

ee53e52

This reverts commit ce5424f.

AetherUnbound added a commit that referenced this pull request Jan 22, 2024

Revert "Toggle CloudWatch alarms actions during Data Refresh (#3652)" (…

9a85e05

…#3689) This reverts commit ce5424f.

krysal added a commit that referenced this pull request Jan 31, 2024

Re-apply changes of #3652

a3fbd45

krysal added a commit that referenced this pull request Feb 1, 2024

Re-apply changes of #3652

671d270

krysal added a commit that referenced this pull request Feb 1, 2024

Re-apply changes of #3652

4efa9a0

krysal added a commit that referenced this pull request Feb 1, 2024

Re-apply changes of #3652

afcd707

krysal added a commit that referenced this pull request Feb 2, 2024

Re-apply changes of #3652

5305cee

krysal added a commit that referenced this pull request Feb 2, 2024

Re-apply changes of #3652

94cf1d6

krysal added a commit that referenced this pull request Feb 7, 2024

Re-apply changes of #3652

a8ffc17

krysal added a commit that referenced this pull request Feb 8, 2024

Re-apply changes of #3652

554b4ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Toggle CloudWatch alarms actions during Data Refresh #3652

Toggle CloudWatch alarms actions during Data Refresh #3652

krysal commented Jan 11, 2024

dhruvkb left a comment

stacimc left a comment

openverse-bot commented Jan 18, 2024

stacimc Jan 18, 2024

krysal Jan 18, 2024

stacimc left a comment

krysal commented Jan 18, 2024

AetherUnbound commented Jan 18, 2024

Toggle CloudWatch alarms actions during Data Refresh #3652

Toggle CloudWatch alarms actions during Data Refresh #3652

Conversation

krysal commented Jan 11, 2024

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

dhruvkb left a comment

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

openverse-bot commented Jan 18, 2024

Footnotes

stacimc Jan 18, 2024

Choose a reason for hiding this comment

krysal Jan 18, 2024

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

krysal commented Jan 18, 2024

AetherUnbound commented Jan 18, 2024