Health monitoring and alerting for the Airflow scheduler #2335

zackkrida · 2023-06-06T16:22:28Z

Problem

Currently, Airflow encounters issues we only become aware of them by happenstance; if someone peeks at the Airflow UI or a DAG reports an error. Recent issues we didn't detect include:

the Airflow scheduler failed
the DB connection to Postgres was lost after a DB reset

Description

Implement a healthcheck for the Airflow scheduler which sends an Alert in AWS. Also look into the current situation concerning airflow logs in CloudWatch.

Ping /health on the webserver box and send a slack ping to the alerts channel if any component is unhealthy. Add a cron job alongside the dag-sync script to run this check every 30 seconds.

The text was updated successfully, but these errors were encountered:

AetherUnbound · 2023-06-06T20:30:10Z

I know the webserver monitors a heartbeat on the scheduler, I have to wonder if that's something we could tap into. Additionally, an alternative scenario would be to have the scheduler exit and have the container restart automatically. I'm not sure why the DB shutdown would not cause the scheduler to stop (the container was still running recently when it encountered the database is shutting down exception).

sarayourfriend · 2023-06-06T23:17:43Z

Here are the airflow docs on health monitoring: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html

We can't access HTTP Airflow outside of our VPC, so we cannot rely on, for example, UptimeRobot to make requests to /health.

We could, however, add a cron job to the webserver box that calls the /health endpoint and sends a Slack ping (or something) if any of the status fields are unhealthy. If we worked on https://github.com/WordPress/openverse-infrastructure/issues/482 to enable tunnelling through Cloudflare Access so that we could make HTTP requests to Airflow, we could implement the job outside the box, in a GitHub cron action, for example, and write our own mini-uptime HTTP check. Theoretically this could be more stable/reliable than the box reporting its own health, but I don't think that's necessary, because we do run Airflow in Docker, so it isn't likely for the entire EC2 instance to crash.

sarayourfriend · 2023-06-06T23:22:16Z

To leverage AWS monitoring tools for this we'd need to put Airflow boxes into an ASG or target group + LB. Individual EC2 instances do not have "health checks" in the same way as those meta-resources, as far as I can tell.

AetherUnbound · 2023-06-08T08:35:51Z

I think a simple slack ping makes sense! We do something similar for the dag-sync on that box:

https://github.com/WordPress/openverse-infrastructure/blob/5bb2a1d9046a9734e66ec33cb734abdac1cc0503/modules/services/catalog-airflow/init.tpl#L130

AetherUnbound · 2023-07-17T18:54:27Z

The scheduler went down again recently because the upstream database restarted during the maintenance window. What confuses & frustrates me is that the scheduler clearly failed, but the container was still running and hadn't exited (which would have restarted it and thus meant that the scheduler came back online). Perhaps we can look into why that's happening as well as part of this effort.

Edit: I'm going to make a separate issue for that actually and investigate it.

AetherUnbound · 2023-12-26T21:01:48Z

To leverage AWS monitoring tools for this we'd need to put Airflow boxes into an ASG or target group + LB.

Looking at this again, it appears that we do have Airflow behind a target group + LB: https://github.com/WordPress/openverse-infrastructure/blob/754fc882b93c41c4085f668880f197d7c89bc893/modules/services/catalog-airflow/load-balancer.tf#L47-L74

We also have the unhealthy host count alarm which could be leveraged here too - this says it's for ECS, but the only piece that's ECS-specific appears to be the log link.

It might be possible to hook this up now, but it seems advisable to wait until #2037 is complete. That project may alter the way Airflow is defined in Terraform, so it might be ideal to let the dust from that settle before adding an alarm for Airflow before it can be moved to next/.

sarayourfriend · 2024-01-02T20:31:21Z

Let's use the unhealthy host count alarm and expand it to include EC2-only services as you suggested 👍

github-project-automation bot added this to Openverse Backlog Jun 6, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Jun 6, 2023

AetherUnbound self-assigned this Jun 27, 2023

sarayourfriend moved this from 📋 Backlog to 📅 To do in Openverse Backlog Jun 30, 2023

AetherUnbound mentioned this issue Jul 17, 2023

Airflow scheduler will crash when connection to the database drops, but container will not stop #2661

Closed

AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Oct 4, 2023

AetherUnbound moved this from 📅 To do to 📋 Backlog in Openverse Backlog Oct 4, 2023

AetherUnbound changed the title ~~Implement health monitoring and alerting for the Airflow scheduler~~ Health monitoring and alerting for the Airflow scheduler Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health monitoring and alerting for the Airflow scheduler #2335

Health monitoring and alerting for the Airflow scheduler #2335

zackkrida commented Jun 6, 2023 •

edited by sarayourfriend

Loading

AetherUnbound commented Jun 6, 2023

sarayourfriend commented Jun 6, 2023

sarayourfriend commented Jun 6, 2023

AetherUnbound commented Jun 8, 2023

AetherUnbound commented Jul 17, 2023 •

edited

Loading

AetherUnbound commented Dec 26, 2023

sarayourfriend commented Jan 2, 2024

Health monitoring and alerting for the Airflow scheduler #2335

Health monitoring and alerting for the Airflow scheduler #2335

Comments

zackkrida commented Jun 6, 2023 • edited by sarayourfriend Loading

Problem

Description

AetherUnbound commented Jun 6, 2023

sarayourfriend commented Jun 6, 2023

sarayourfriend commented Jun 6, 2023

AetherUnbound commented Jun 8, 2023

AetherUnbound commented Jul 17, 2023 • edited Loading

AetherUnbound commented Dec 26, 2023

sarayourfriend commented Jan 2, 2024

zackkrida commented Jun 6, 2023 •

edited by sarayourfriend

Loading

AetherUnbound commented Jul 17, 2023 •

edited

Loading