-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health monitoring and alerting for the Airflow scheduler #2335
Comments
I know the webserver monitors a heartbeat on the scheduler, I have to wonder if that's something we could tap into. Additionally, an alternative scenario would be to have the scheduler exit and have the container restart automatically. I'm not sure why the DB shutdown would not cause the scheduler to stop (the container was still running recently when it encountered the |
Here are the airflow docs on health monitoring: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html We can't access HTTP Airflow outside of our VPC, so we cannot rely on, for example, UptimeRobot to make requests to We could, however, add a cron job to the webserver box that calls the |
To leverage AWS monitoring tools for this we'd need to put Airflow boxes into an ASG or target group + LB. Individual EC2 instances do not have "health checks" in the same way as those meta-resources, as far as I can tell. |
I think a simple slack ping makes sense! We do something similar for the dag-sync on that box: |
The scheduler went down again recently because the upstream database restarted during the maintenance window. What confuses & frustrates me is that the scheduler clearly failed, but the container was still running and hadn't exited (which would have restarted it and thus meant that the scheduler came back online). Perhaps we can look into why that's happening as well as part of this effort. Edit: I'm going to make a separate issue for that actually and investigate it. |
Looking at this again, it appears that we do have Airflow behind a target group + LB: https://github.com/WordPress/openverse-infrastructure/blob/754fc882b93c41c4085f668880f197d7c89bc893/modules/services/catalog-airflow/load-balancer.tf#L47-L74 We also have the unhealthy host count alarm which could be leveraged here too - this says it's for ECS, but the only piece that's ECS-specific appears to be the log link. It might be possible to hook this up now, but it seems advisable to wait until #2037 is complete. That project may alter the way Airflow is defined in Terraform, so it might be ideal to let the dust from that settle before adding an alarm for Airflow before it can be moved to |
Let's use the unhealthy host count alarm and expand it to include EC2-only services as you suggested 👍 |
Problem
Currently, Airflow encounters issues we only become aware of them by happenstance; if someone peeks at the Airflow UI or a DAG reports an error. Recent issues we didn't detect include:
Description
Implement a healthcheck for the Airflow scheduler which sends an Alert in AWS. Also look into the current situation concerning airflow logs in CloudWatch.
Ping
/health
on the webserver box and send a slack ping to the alerts channel if any component is unhealthy. Add a cron job alongside thedag-sync
script to run this check every 30 seconds.The text was updated successfully, but these errors were encountered: