Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toggle CloudWatch alarms actions during Data Refresh #3652

Merged
merged 8 commits into from
Jan 18, 2024

Conversation

krysal
Copy link
Member

@krysal krysal commented Jan 11, 2024

Fixes

Fixes #3647 by @krysal

Description

This PR adds steps for disabling and re-enabling the actions of the most sensitive alarms when a Data Refresh process is running. I made the most basic changes to achieve this but maybe it can be improved to make these steps not required to continue the DAG execution and make sure that at the end the alarms actions are always re-enabled. Requesting @stacimc's review specifically for that.

We'll need to create AWS credentials for Airflow as well I believe.

Testing Instructions

Set the AWS environment variables with the credentials of your AWS account. They can be found in the ~/.aws/credentials file if you have configured the AWS CLI.

# .env file
AWS_ACCESS_KEY_ID=<your_key>
#AWS_SECRET_ACCESS_KEY=<your_secret>
#AWS_DEFAULT_REGION=us-east-1

🚨 This will modify production alarms actions 🚨 (if you have access to this credentials) but since a local data refresh finishes quickly and it's not destroying resources I thought this is acceptable if you want to see it working.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Jan 11, 2024
@krysal krysal requested a review from a team as a code owner January 11, 2024 16:20
@krysal krysal requested review from obulat and stacimc January 11, 2024 16:20
@krysal krysal added the 🧱 stack: mgmt Related to repo management and automations label Jan 11, 2024
Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it's very cool that the DAG will automatically disable and re-enable the alarms!

± 1 change: The description of the tasks variable (lines 222 to 229), before calling chain needs updating.

catalog/dags/common/cloudwatch.py Show resolved Hide resolved
@krysal krysal force-pushed the add/alarms-toggling-steps branch from c8dc5d8 to 3bc2faf Compare January 16, 2024 23:19
@krysal krysal requested a review from a team as a code owner January 16, 2024 23:19
Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool -- really clever idea, and a welcome change since we investigate those false alarms weekly!

I have two blockers: first, the list of alarms being silenced. Can we instead only silence the ES Prod CPU Utilization? That one can 100% be ignored during the data refresh. For the response time I don't think instability in those should be silenced, especially since this part of the image data refresh takes two days, is generally running during the part of the week where the API is deployed, and those alarms are pretty critical.

Second and in response to your question in the PR description: we can update this to ensure that enable_alarms runs even if the DAG fails during ingestion! To do that you need to add trigger_rule=TriggerRule.ALL_DONE to the enable_alarms task. This will cause that task to run as soon as all of its upstream tasks have completed, even if some of them failed.

The catch is that trigger rules only evaluate the status of a task's immediately upstream tasks, so if the tasks were chained like this:

create_filtered_index >> enable_alarms >> promote_tasks

The promote_tasks will try to run as long as enable_alarms runs, even if create_filtered_index (or a further upstream task) failed. We can fix that by ensuring that both are upstream of the promotion steps:

tasks.append([enable_alarms, promote_tasks])

catalog/dags/common/cloudwatch.py Show resolved Hide resolved
catalog/dags/common/cloudwatch.py Show resolved Hide resolved
catalog/dags/data_refresh/data_refresh_task_factory.py Outdated Show resolved Hide resolved
catalog/env.template Show resolved Hide resolved
catalog/dags/data_refresh/data_refresh_task_factory.py Outdated Show resolved Hide resolved
@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@krysal krysal force-pushed the add/alarms-toggling-steps branch from 3bc2faf to fd5e116 Compare January 18, 2024 00:58
@krysal krysal requested a review from stacimc January 18, 2024 01:28
# Finally, promote the filtered index.
tasks.append(promote_filtered_index)
tasks.append([enable_alarms, promote_filtered_index])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also works, but still puts enable_alarms in between the promotion/deletion steps for the media index, and the promotion/deletion for the filtered index. I don't think that will cause any problems in terms of task execution, but it makes the flow of the DAG more confusing, and ideally we'd re-enable the alarm ASAP. I'd prefer to define this task earlier and then append it along with the first set of promote tasks on L199.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see I chose the wrong promote task to pair it with. Gotcha, fixed.

@krysal krysal requested a review from stacimc January 18, 2024 19:20
@krysal krysal force-pushed the add/alarms-toggling-steps branch from ddef6a8 to 71db055 Compare January 18, 2024 19:21
Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳 Tested all the failure states to be sure and the alarms are always re-enabled :) LGTM!

@krysal
Copy link
Member Author

krysal commented Jan 18, 2024

Thanks!

@krysal krysal merged commit ce5424f into main Jan 18, 2024
39 checks passed
@krysal krysal deleted the add/alarms-toggling-steps branch January 18, 2024 20:18
@AetherUnbound
Copy link
Collaborator

@krysal We run the data refresh DAG locally quite often for testing. Given that we don't have a Cloudwatch service available locally for testing, any local runs will immediately fail at the alarm step (I tried and confirmed this locally)
image

Do you mind adding in a mechanism, perhaps similar to our SLACK_MESSAGE_OVERRIDE, which checks the environment to make sure the step only happens in production unless a Variable is explicitly set to True?

"SLACK_MESSAGE_OVERRIDE", default_var=False, deserialize_json=True

AetherUnbound added a commit that referenced this pull request Jan 22, 2024
krysal added a commit that referenced this pull request Jan 31, 2024
krysal added a commit that referenced this pull request Feb 1, 2024
krysal added a commit that referenced this pull request Feb 1, 2024
krysal added a commit that referenced this pull request Feb 1, 2024
krysal added a commit that referenced this pull request Feb 2, 2024
krysal added a commit that referenced this pull request Feb 2, 2024
krysal added a commit that referenced this pull request Feb 7, 2024
krysal added a commit that referenced this pull request Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: mgmt Related to repo management and automations
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Disable alarm notifications during ES index creation
5 participants