-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toggle CloudWatch alarms actions during Data Refresh #3652
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, it's very cool that the DAG will automatically disable and re-enable the alarms!
± 1 change: The description of the tasks
variable (lines 222 to 229), before calling chain
needs updating.
c8dc5d8
to
3bc2faf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool -- really clever idea, and a welcome change since we investigate those false alarms weekly!
I have two blockers: first, the list of alarms being silenced. Can we instead only silence the ES Prod CPU Utilization? That one can 100% be ignored during the data refresh. For the response time I don't think instability in those should be silenced, especially since this part of the image data refresh takes two days, is generally running during the part of the week where the API is deployed, and those alarms are pretty critical.
Second and in response to your question in the PR description: we can update this to ensure that enable_alarms
runs even if the DAG fails during ingestion! To do that you need to add trigger_rule=TriggerRule.ALL_DONE
to the enable_alarms
task. This will cause that task to run as soon as all of its upstream tasks have completed, even if some of them failed.
The catch is that trigger rules only evaluate the status of a task's immediately upstream tasks, so if the tasks were chained like this:
create_filtered_index >> enable_alarms >> promote_tasks
The promote_tasks
will try to run as long as enable_alarms
runs, even if create_filtered_index
(or a further upstream task) failed. We can fix that by ensuring that both are upstream of the promotion steps:
tasks.append([enable_alarms, promote_tasks])
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @obulat Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
3bc2faf
to
fd5e116
Compare
# Finally, promote the filtered index. | ||
tasks.append(promote_filtered_index) | ||
tasks.append([enable_alarms, promote_filtered_index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also works, but still puts enable_alarms
in between the promotion/deletion steps for the media index, and the promotion/deletion for the filtered index. I don't think that will cause any problems in terms of task execution, but it makes the flow of the DAG more confusing, and ideally we'd re-enable the alarm ASAP. I'd prefer to define this task earlier and then append it along with the first set of promote tasks on L199.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see I chose the wrong promote task to pair it with. Gotcha, fixed.
ddef6a8
to
71db055
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳 Tested all the failure states to be sure and the alarms are always re-enabled :) LGTM!
Thanks! |
@krysal We run the data refresh DAG locally quite often for testing. Given that we don't have a Cloudwatch service available locally for testing, any local runs will immediately fail at the alarm step (I tried and confirmed this locally) Do you mind adding in a mechanism, perhaps similar to our openverse/catalog/dags/common/slack.py Line 304 in ee77ef5
|
This reverts commit ce5424f.
Fixes
Fixes #3647 by @krysal
Description
This PR adds steps for disabling and re-enabling the actions of the most sensitive alarms when a Data Refresh process is running. I made the most basic changes to achieve this but maybe it can be improved to make these steps not required to continue the DAG execution and make sure that at the end the alarms actions are always re-enabled. Requesting @stacimc's review specifically for that.
We'll need to create AWS credentials for Airflow as well I believe.
Testing Instructions
Set the AWS environment variables with the credentials of your AWS account. They can be found in the
~/.aws/credentials
file if you have configured the AWS CLI.🚨 This will modify production alarms actions 🚨 (if you have access to this credentials) but since a local data refresh finishes quickly and it's not destroying resources I thought this is acceptable if you want to see it working.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin