Sygnal healthcheck returns 200 when notification-sending is 500ing. #168

lampholder · 2021-02-24T13:52:45Z

I've received a report from a customer that Sygnal is failing to send notifications returning a 500, while its healthcheck on /health still returns 200, causing problems because no automatic kubernetes failover restart is being triggered.

They've pointed the finger at a faulty database connection pool : sygnal psycopg2.InterfaceError: connection already closed

They've provided logs, but I'll keep those off this issue for now - let me know if you want to review the logs to investigate,

callahad · 2021-03-02T11:52:22Z

The /health endpoint currently only verifies that Sygnal itself is able to receive and respond to requests.

It's not unreasonable to expect /health to also verify database connectivity, but doing so in a liveness probe could inadvertently lead to cascading failures in the face of a database hiccup.

If absolutely necessary, we could implement a separate readiness probe which does verify database connectivity, but I think it's more important to enable Sygnal itself to gracefully handle a lost database connection, e.g., via #171.

We could also calcualte a moving average of successful notifications and using that to determine a health threshold, but that's likely sufficiently subjective to be out of scope for the /health endpoint. Hopefully the exposed metrics would be sufficient to implement whatever policy anyone might need as part of their own monitoring and alerting systems.

anoadragon453 · 2021-03-02T11:52:33Z

For clarity, the customer was seeing all notifications 500'ing while the healthcheck continued to return 200. Advice for fixing the database connection pool errors was suggested (#171).

We might consider setting an unhealthy state if X% of notifications fail over a period of Y seconds, perhaps even allowing these values to be configurable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sygnal healthcheck returns 200 when notification-sending is 500ing. #168

Sygnal healthcheck returns 200 when notification-sending is 500ing. #168

lampholder commented Feb 24, 2021

callahad commented Mar 2, 2021

anoadragon453 commented Mar 2, 2021 •

edited

Loading

Sygnal healthcheck returns 200 when notification-sending is 500ing. #168

Sygnal healthcheck returns 200 when notification-sending is 500ing. #168

Comments

lampholder commented Feb 24, 2021

callahad commented Mar 2, 2021

anoadragon453 commented Mar 2, 2021 • edited Loading

anoadragon453 commented Mar 2, 2021 •

edited

Loading