You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've received a report from a customer that Sygnal is failing to send notifications returning a 500, while its healthcheck on /health still returns 200, causing problems because no automatic kubernetes failover restart is being triggered.
They've pointed the finger at a faulty database connection pool : sygnal psycopg2.InterfaceError: connection already closed
They've provided logs, but I'll keep those off this issue for now - let me know if you want to review the logs to investigate,
The text was updated successfully, but these errors were encountered:
The /health endpoint currently only verifies that Sygnal itself is able to receive and respond to requests.
It's not unreasonable to expect /health to also verify database connectivity, but doing so in a liveness probe could inadvertently lead to cascading failures in the face of a database hiccup.
If absolutely necessary, we could implement a separate readiness probe which does verify database connectivity, but I think it's more important to enable Sygnal itself to gracefully handle a lost database connection, e.g., via #171.
We could also calcualte a moving average of successful notifications and using that to determine a health threshold, but that's likely sufficiently subjective to be out of scope for the /health endpoint. Hopefully the exposed metrics would be sufficient to implement whatever policy anyone might need as part of their own monitoring and alerting systems.
For clarity, the customer was seeing all notifications 500'ing while the healthcheck continued to return 200. Advice for fixing the database connection pool errors was suggested (#171).
We might consider setting an unhealthy state if X% of notifications fail over a period of Y seconds, perhaps even allowing these values to be configurable.
I've received a report from a customer that Sygnal is failing to send notifications returning a 500, while its healthcheck on /health still returns 200, causing problems because no automatic kubernetes failover restart is being triggered.
They've pointed the finger at a faulty database connection pool : sygnal psycopg2.InterfaceError: connection already closed
They've provided logs, but I'll keep those off this issue for now - let me know if you want to review the logs to investigate,
The text was updated successfully, but these errors were encountered: