Update health/liveness probes for tenant-proxy #692

i5okie · 2023-07-07T16:28:44Z

Add /healthz to NGINX configuration. Returns 200, {"status":"up"}
Update liveness and readiness probes in tenant-proxy deployment

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>

WadeBarnes · 2023-07-07T16:38:05Z

A general comment on implementing health endpoints for proxies. It's best practice to have the proxy's health endpoint pass through to the health endpoint of application the proxy is proxying. That way the proxy will not accept connections until it's associated application is ready to receive traffic.

loneil · 2023-07-07T16:45:13Z

Should health here be confirming that acapy is up and receiving traffic as well?
Edit: Wade beat me to it.

i5okie · 2023-07-07T16:59:31Z

A general comment on implementing health endpoints for proxies. It's best practice to have the proxy's health endpoint pass through to the health endpoint of application the proxy is proxying. That way the proxy will not accept connections until it's associated application is ready to receive traffic.

We believe that is the cause of the problem we've been observing, with probes killing the tenant-proxy pods.

x.x.x.x - - [06/Jul/2023:23:08:43 +0000] "GET / HTTP/1.1" 499 0 "-" "kube-probe/1.25" "-"
x.x.x.x - - [06/Jul/2023:23:08:43 +0000] "GET / HTTP/1.1" 499 0 "-" "kube-probe/1.25" "-"
2023/07/06 23:08:43 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
2023/07/06 23:08:43 [notice] 34#34: gracefully shutting down
2023/07/06 23:08:43 [notice] 41#41: gracefully shutting down

i5okie · 2023-07-07T17:25:04Z

Acapy has it's own health checks and doesn't accept connections if it's not in Ready state.
If you're hosting a product behind your reverse proxy, and your product is sending back an error it can't prettify itself, then you have an opportunity to show a custom error page with your reverse proxy.

esune · 2023-07-07T17:54:58Z

@WadeBarnes @loneil this is really just meant to be a readiness/liveness probe for k8s to know when the nginx container is up, and to leave it alone. The issue we're facing right now is that the readiness/liveness probes try to GET /, which proxies to ACA-Py and, for some reason, that request is timing out - causing k8s to think nginx is not ready, and forcin the pod to shutdown and re-create (with related sysdig notification/noise).

If we implement a health check that depends on the upstream service AND it is used by the k8s probes, this will likely be an even more widespread scenario (aca-py is not ready, therefore nginx is to be considered not ready, k8s sends SIGTERM to both service, they restart, new iteration).

I don't think this is what we want, but I may also be missing something obvious: would you mind elaborating more on how you would approach/resolve this problem?

Additionally, if the proxy is just proxying requests as is, isn't the upstream service not accepting connections and responding appropriately basically doing what we need/want?

WadeBarnes · 2023-07-07T17:58:36Z

In the case of a pure proxy the proxy's readiness should track the upstream application's readiness. Otherwise you can run into a situation where your proxy is ready and routing traffic to the upstream application when it is not ready. We do this with the Aries Mediator Instances for this exact reason.

loneil · 2023-07-07T18:06:55Z

If we implement a health check that depends on the upstream service AND it is used by the k8s probes, this will likely be an even more widespread scenario (aca-py is not ready, therefore nginx is to be considered not ready, k8s sends SIGTERM to both service, they restart, new iteration).

Is that mixing liveliness and readiness? No expert but isn't it
liveliness - this pod infra is up and running, don't restart it
readiness - the container is good to go, accept traffic

So readiness should be that acapy is good to go as well?

esune · 2023-07-07T18:11:58Z

If we implement a health check that depends on the upstream service AND it is used by the k8s probes, this will likely be an even more widespread scenario (aca-py is not ready, therefore nginx is to be considered not ready, k8s sends SIGTERM to both service, they restart, new iteration).

Is that mixing liveliness and readiness? No expert but isn't it liveliness - this pod infra is up and running, don't restart it readiness - the container is good to go, accept traffic

So readiness should be that acapy is good to go as well?

I think this is the key. We need two endpoints: /health/alive and /health/ready, the one to tell k8s the pod is doing well and the latter tracking ACA-Py's readiness.

@WadeBarnes where can we find the mediator example for @i5okie to take inspiration from?

i5okie · 2023-07-07T18:20:49Z

If we implement a health check that depends on the upstream service AND it is used by the k8s probes, this will likely be an even more widespread scenario (aca-py is not ready, therefore nginx is to be considered not ready, k8s sends SIGTERM to both service, they restart, new iteration).

Is that mixing liveliness and readiness? No expert but isn't it liveliness - this pod infra is up and running, don't restart it readiness - the container is good to go, accept traffic

So readiness should be that acapy is good to go as well?

Every deployment has liveness and readiness probes that you are encouraged to configure.
Here we have two deployments, acapy and tenant-proxy. Both have it's own services.
When Acapy pod's readiness probe determines it is ready, it add's the pod's IP to the acapy service's load balancer.
If the pod is determined to be not ready, the ip address is removed, therefor traffic is no longer being sent to that pod.
If the pod fails liveness, kubelet restarts it in attempt to heal the service.

Vice versa, for the tenant-proxy pod.
The liveness and readiness probes are meant for their respective deployments.
So liveness and readiness probes for tenant-proxy, are meant to make sure that the tenant-proxy pods themselves are healthy and are ready to accept traffic.

However, if we configure the readiness probe of tenant-proxy pods to get / which nginx would send upstream to acapy in this case, and this readiness probe fails wether acapy is healhy or not, then kubelet will start restarting the tenant-proxy pod. Which is what is happening in our case. Acapy pod appears to be healthy, and is in ready state, however readiness probe for tenant-proxy keeps failing, causing it to be restarted by kubelet.

This circumvents the kubernetes mechanism meant to check readiness of the pod itself.

In the scenario of having this deployed in production, lets say we have multiple acapy pods, and multiple tenant-proxy pods.
We would likely be seing a lot of dropped connections because kubelet would be killing some tenant-proxy pods, regardless if acapy pod upstream is healthy or not. and would run into an issue where all tenant-proxy pods are stuck in CrashLoopBackOff and not serving any traffic at all to the healthy acapy pods.

WadeBarnes · 2023-07-07T18:40:45Z

@WadeBarnes where can we find the mediator example for @i5okie to take inspiration from?

https://github.com/bcgov/openshift-aries-mediator-service/blob/main/openshift/templates/aries-mediator-proxy/aries-mediator-proxy-deploy.yaml

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>

i5okie · 2023-07-10T16:08:36Z

Updated liveness probe of tenant-proxy deployment with tcpSocket
Updated readiness probe to point to /status/ready which is proxied to AcaPy deployment.

github-actions · 2023-07-10T16:12:36Z

Deployment	Resource	Location
Traction
	Tenant UI	https://pr-692-tenant-ui-dev.apps.silver.devops.gov.bc.ca
	Innkeeper UI	https://pr-692-tenant-ui-dev.apps.silver.devops.gov.bc.ca/innkeeper
Aca-Py
	Admin Swagger	https://pr-692-traction-acapy-admin-dev.apps.silver.devops.gov.bc.ca/api/doc
	Tenant Proxy Swagger	https://pr-692-traction-tenant-proxy-dev.apps.silver.devops.gov.bc.ca/api/doc

Deployment URLs ready for review.

i5okie added 2 commits July 7, 2023 09:24

add /healthz, return 200 for healthchecks

860171f

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>

update healthchecks, point to /healthz endpoint

b732d2e

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>

i5okie marked this pull request as ready for review July 7, 2023 18:30

i5okie had a problem deploying to development July 7, 2023 18:39 — with GitHub Actions Failure

use tcpSocket for liveness probe

bd36006

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>

i5okie had a problem deploying to development July 7, 2023 21:25 — with GitHub Actions Failure

fix liveness probe

cba879d

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>

i5okie had a problem deploying to development July 7, 2023 23:33 — with GitHub Actions Failure

i5okie added 2 commits July 10, 2023 09:00

remove /healthz

64ef2fb

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>

update readiness probe to point to acapy

60ff78a

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>

i5okie temporarily deployed to development July 10, 2023 16:10 — with GitHub Actions Inactive

WadeBarnes approved these changes Jul 10, 2023

View reviewed changes

Merge branch 'main' into chore/tenant-proxy-healthz

bc2b176

i5okie merged commit 18c0499 into main Jul 10, 2023
8 of 9 checks passed

i5okie temporarily deployed to development July 10, 2023 16:51 — with GitHub Actions Inactive

i5okie deleted the chore/tenant-proxy-healthz branch July 10, 2023 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update health/liveness probes for tenant-proxy #692

Update health/liveness probes for tenant-proxy #692

i5okie commented Jul 7, 2023

WadeBarnes commented Jul 7, 2023

loneil commented Jul 7, 2023 •

edited

Loading

i5okie commented Jul 7, 2023

i5okie commented Jul 7, 2023

esune commented Jul 7, 2023

WadeBarnes commented Jul 7, 2023 •

edited

Loading

loneil commented Jul 7, 2023

esune commented Jul 7, 2023

i5okie commented Jul 7, 2023 •

edited

Loading

WadeBarnes commented Jul 7, 2023

i5okie commented Jul 10, 2023

github-actions bot commented Jul 10, 2023

Update health/liveness probes for tenant-proxy #692

Update health/liveness probes for tenant-proxy #692

Conversation

i5okie commented Jul 7, 2023

WadeBarnes commented Jul 7, 2023

loneil commented Jul 7, 2023 • edited Loading

i5okie commented Jul 7, 2023

i5okie commented Jul 7, 2023

esune commented Jul 7, 2023

WadeBarnes commented Jul 7, 2023 • edited Loading

loneil commented Jul 7, 2023

esune commented Jul 7, 2023

i5okie commented Jul 7, 2023 • edited Loading

WadeBarnes commented Jul 7, 2023

i5okie commented Jul 10, 2023

github-actions bot commented Jul 10, 2023

loneil commented Jul 7, 2023 •

edited

Loading

WadeBarnes commented Jul 7, 2023 •

edited

Loading

i5okie commented Jul 7, 2023 •

edited

Loading