Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health monitoring and alerting for the Airflow scheduler #2335

Open
zackkrida opened this issue Jun 6, 2023 · 7 comments
Open

Health monitoring and alerting for the Airflow scheduler #2335

zackkrida opened this issue Jun 6, 2023 · 7 comments
Assignees
Labels
🤖 aspect: dx Concerns developers' experience with the codebase ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: mgmt Related to repo management and automations

Comments

@zackkrida
Copy link
Member

zackkrida commented Jun 6, 2023

Problem

Currently, Airflow encounters issues we only become aware of them by happenstance; if someone peeks at the Airflow UI or a DAG reports an error. Recent issues we didn't detect include:

  • the Airflow scheduler failed
  • the DB connection to Postgres was lost after a DB reset

Description

Implement a healthcheck for the Airflow scheduler which sends an Alert in AWS. Also look into the current situation concerning airflow logs in CloudWatch.

Ping /health on the webserver box and send a slack ping to the alerts channel if any component is unhealthy. Add a cron job alongside the dag-sync script to run this check every 30 seconds.

@zackkrida zackkrida added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 🤖 aspect: dx Concerns developers' experience with the codebase 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: mgmt Related to repo management and automations labels Jun 6, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Jun 6, 2023
@AetherUnbound
Copy link
Collaborator

I know the webserver monitors a heartbeat on the scheduler, I have to wonder if that's something we could tap into. Additionally, an alternative scenario would be to have the scheduler exit and have the container restart automatically. I'm not sure why the DB shutdown would not cause the scheduler to stop (the container was still running recently when it encountered the database is shutting down exception).

@sarayourfriend
Copy link
Collaborator

Here are the airflow docs on health monitoring: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html

We can't access HTTP Airflow outside of our VPC, so we cannot rely on, for example, UptimeRobot to make requests to /health.

We could, however, add a cron job to the webserver box that calls the /health endpoint and sends a Slack ping (or something) if any of the status fields are unhealthy. If we worked on https://github.com/WordPress/openverse-infrastructure/issues/482 to enable tunnelling through Cloudflare Access so that we could make HTTP requests to Airflow, we could implement the job outside the box, in a GitHub cron action, for example, and write our own mini-uptime HTTP check. Theoretically this could be more stable/reliable than the box reporting its own health, but I don't think that's necessary, because we do run Airflow in Docker, so it isn't likely for the entire EC2 instance to crash.

@sarayourfriend
Copy link
Collaborator

To leverage AWS monitoring tools for this we'd need to put Airflow boxes into an ASG or target group + LB. Individual EC2 instances do not have "health checks" in the same way as those meta-resources, as far as I can tell.

@AetherUnbound
Copy link
Collaborator

I think a simple slack ping makes sense! We do something similar for the dag-sync on that box:

https://github.com/WordPress/openverse-infrastructure/blob/5bb2a1d9046a9734e66ec33cb734abdac1cc0503/modules/services/catalog-airflow/init.tpl#L130

@AetherUnbound AetherUnbound self-assigned this Jun 27, 2023
@sarayourfriend sarayourfriend moved this from 📋 Backlog to 📅 To do in Openverse Backlog Jun 30, 2023
@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Jul 17, 2023

The scheduler went down again recently because the upstream database restarted during the maintenance window. What confuses & frustrates me is that the scheduler clearly failed, but the container was still running and hadn't exited (which would have restarted it and thus meant that the scheduler came back online). Perhaps we can look into why that's happening as well as part of this effort.

Edit: I'm going to make a separate issue for that actually and investigate it.

@AetherUnbound AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Oct 4, 2023
@AetherUnbound AetherUnbound moved this from 📅 To do to 📋 Backlog in Openverse Backlog Oct 4, 2023
@AetherUnbound AetherUnbound changed the title Implement health monitoring and alerting for the Airflow scheduler Health monitoring and alerting for the Airflow scheduler Dec 26, 2023
@AetherUnbound
Copy link
Collaborator

To leverage AWS monitoring tools for this we'd need to put Airflow boxes into an ASG or target group + LB.

Looking at this again, it appears that we do have Airflow behind a target group + LB: https://github.com/WordPress/openverse-infrastructure/blob/754fc882b93c41c4085f668880f197d7c89bc893/modules/services/catalog-airflow/load-balancer.tf#L47-L74

We also have the unhealthy host count alarm which could be leveraged here too - this says it's for ECS, but the only piece that's ECS-specific appears to be the log link.

It might be possible to hook this up now, but it seems advisable to wait until #2037 is complete. That project may alter the way Airflow is defined in Terraform, so it might be ideal to let the dust from that settle before adding an alarm for Airflow before it can be moved to next/.

@sarayourfriend
Copy link
Collaborator

Let's use the unhealthy host count alarm and expand it to include EC2-only services as you suggested 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: mgmt Related to repo management and automations
Projects
Status: 📋 Backlog
Development

No branches or pull requests

3 participants