Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sygnal healthcheck returns 200 when notification-sending is 500ing. #168

Open
lampholder opened this issue Feb 24, 2021 · 2 comments
Open

Comments

@lampholder
Copy link
Member

I've received a report from a customer that Sygnal is failing to send notifications returning a 500, while its healthcheck on /health still returns 200, causing problems because no automatic kubernetes failover restart is being triggered.

They've pointed the finger at a faulty database connection pool : sygnal psycopg2.InterfaceError: connection already closed

They've provided logs, but I'll keep those off this issue for now - let me know if you want to review the logs to investigate,

@callahad
Copy link
Contributor

callahad commented Mar 2, 2021

The /health endpoint currently only verifies that Sygnal itself is able to receive and respond to requests.

It's not unreasonable to expect /health to also verify database connectivity, but doing so in a liveness probe could inadvertently lead to cascading failures in the face of a database hiccup.

If absolutely necessary, we could implement a separate readiness probe which does verify database connectivity, but I think it's more important to enable Sygnal itself to gracefully handle a lost database connection, e.g., via #171.

We could also calcualte a moving average of successful notifications and using that to determine a health threshold, but that's likely sufficiently subjective to be out of scope for the /health endpoint. Hopefully the exposed metrics would be sufficient to implement whatever policy anyone might need as part of their own monitoring and alerting systems.

@anoadragon453
Copy link
Member

anoadragon453 commented Mar 2, 2021

For clarity, the customer was seeing all notifications 500'ing while the healthcheck continued to return 200. Advice for fixing the database connection pool errors was suggested (#171).

We might consider setting an unhealthy state if X% of notifications fail over a period of Y seconds, perhaps even allowing these values to be configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants