Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update health/liveness probes for tenant-proxy #692

Merged
merged 7 commits into from
Jul 10, 2023

Conversation

i5okie
Copy link
Contributor

@i5okie i5okie commented Jul 7, 2023

  • Add /healthz to NGINX configuration. Returns 200, {"status":"up"}
  • Update liveness and readiness probes in tenant-proxy deployment

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>
Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>
@WadeBarnes
Copy link
Member

A general comment on implementing health endpoints for proxies. It's best practice to have the proxy's health endpoint pass through to the health endpoint of application the proxy is proxying. That way the proxy will not accept connections until it's associated application is ready to receive traffic.

@loneil
Copy link
Collaborator

loneil commented Jul 7, 2023

Should health here be confirming that acapy is up and receiving traffic as well?
Edit: Wade beat me to it.

@i5okie
Copy link
Contributor Author

i5okie commented Jul 7, 2023

A general comment on implementing health endpoints for proxies. It's best practice to have the proxy's health endpoint pass through to the health endpoint of application the proxy is proxying. That way the proxy will not accept connections until it's associated application is ready to receive traffic.

We believe that is the cause of the problem we've been observing, with probes killing the tenant-proxy pods.

x.x.x.x - - [06/Jul/2023:23:08:43 +0000] "GET / HTTP/1.1" 499 0 "-" "kube-probe/1.25" "-"
x.x.x.x - - [06/Jul/2023:23:08:43 +0000] "GET / HTTP/1.1" 499 0 "-" "kube-probe/1.25" "-"
2023/07/06 23:08:43 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
2023/07/06 23:08:43 [notice] 34#34: gracefully shutting down
2023/07/06 23:08:43 [notice] 41#41: gracefully shutting down

@i5okie
Copy link
Contributor Author

i5okie commented Jul 7, 2023

Acapy has it's own health checks and doesn't accept connections if it's not in Ready state.
If you're hosting a product behind your reverse proxy, and your product is sending back an error it can't prettify itself, then you have an opportunity to show a custom error page with your reverse proxy.

@esune
Copy link
Member

esune commented Jul 7, 2023

@WadeBarnes @loneil this is really just meant to be a readiness/liveness probe for k8s to know when the nginx container is up, and to leave it alone. The issue we're facing right now is that the readiness/liveness probes try to GET /, which proxies to ACA-Py and, for some reason, that request is timing out - causing k8s to think nginx is not ready, and forcin the pod to shutdown and re-create (with related sysdig notification/noise).

If we implement a health check that depends on the upstream service AND it is used by the k8s probes, this will likely be an even more widespread scenario (aca-py is not ready, therefore nginx is to be considered not ready, k8s sends SIGTERM to both service, they restart, new iteration).

I don't think this is what we want, but I may also be missing something obvious: would you mind elaborating more on how you would approach/resolve this problem?

Additionally, if the proxy is just proxying requests as is, isn't the upstream service not accepting connections and responding appropriately basically doing what we need/want?

@WadeBarnes
Copy link
Member

WadeBarnes commented Jul 7, 2023

In the case of a pure proxy the proxy's readiness should track the upstream application's readiness. Otherwise you can run into a situation where your proxy is ready and routing traffic to the upstream application when it is not ready. We do this with the Aries Mediator Instances for this exact reason.

@loneil
Copy link
Collaborator

loneil commented Jul 7, 2023

If we implement a health check that depends on the upstream service AND it is used by the k8s probes, this will likely be an even more widespread scenario (aca-py is not ready, therefore nginx is to be considered not ready, k8s sends SIGTERM to both service, they restart, new iteration).

Is that mixing liveliness and readiness? No expert but isn't it
liveliness - this pod infra is up and running, don't restart it
readiness - the container is good to go, accept traffic

So readiness should be that acapy is good to go as well?

@esune
Copy link
Member

esune commented Jul 7, 2023

If we implement a health check that depends on the upstream service AND it is used by the k8s probes, this will likely be an even more widespread scenario (aca-py is not ready, therefore nginx is to be considered not ready, k8s sends SIGTERM to both service, they restart, new iteration).

Is that mixing liveliness and readiness? No expert but isn't it liveliness - this pod infra is up and running, don't restart it readiness - the container is good to go, accept traffic

So readiness should be that acapy is good to go as well?

I think this is the key. We need two endpoints: /health/alive and /health/ready, the one to tell k8s the pod is doing well and the latter tracking ACA-Py's readiness.

@WadeBarnes where can we find the mediator example for @i5okie to take inspiration from?

@i5okie
Copy link
Contributor Author

i5okie commented Jul 7, 2023

If we implement a health check that depends on the upstream service AND it is used by the k8s probes, this will likely be an even more widespread scenario (aca-py is not ready, therefore nginx is to be considered not ready, k8s sends SIGTERM to both service, they restart, new iteration).

Is that mixing liveliness and readiness? No expert but isn't it liveliness - this pod infra is up and running, don't restart it readiness - the container is good to go, accept traffic

So readiness should be that acapy is good to go as well?

Every deployment has liveness and readiness probes that you are encouraged to configure.
Here we have two deployments, acapy and tenant-proxy. Both have it's own services.
When Acapy pod's readiness probe determines it is ready, it add's the pod's IP to the acapy service's load balancer.
If the pod is determined to be not ready, the ip address is removed, therefor traffic is no longer being sent to that pod.
If the pod fails liveness, kubelet restarts it in attempt to heal the service.

Vice versa, for the tenant-proxy pod.
The liveness and readiness probes are meant for their respective deployments.
So liveness and readiness probes for tenant-proxy, are meant to make sure that the tenant-proxy pods themselves are healthy and are ready to accept traffic.

However, if we configure the readiness probe of tenant-proxy pods to get / which nginx would send upstream to acapy in this case, and this readiness probe fails wether acapy is healhy or not, then kubelet will start restarting the tenant-proxy pod. Which is what is happening in our case. Acapy pod appears to be healthy, and is in ready state, however readiness probe for tenant-proxy keeps failing, causing it to be restarted by kubelet.

This circumvents the kubernetes mechanism meant to check readiness of the pod itself.

In the scenario of having this deployed in production, lets say we have multiple acapy pods, and multiple tenant-proxy pods.
We would likely be seing a lot of dropped connections because kubelet would be killing some tenant-proxy pods, regardless if acapy pod upstream is healthy or not. and would run into an issue where all tenant-proxy pods are stuck in CrashLoopBackOff and not serving any traffic at all to the healthy acapy pods.

@i5okie i5okie marked this pull request as ready for review July 7, 2023 18:30
@WadeBarnes
Copy link
Member

Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>
Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>
Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>
Signed-off-by: Ivan Polchenko <2119240+i5okie@users.noreply.github.com>
@i5okie
Copy link
Contributor Author

i5okie commented Jul 10, 2023

  • Updated liveness probe of tenant-proxy deployment with tcpSocket
  • Updated readiness probe to point to /status/ready which is proxied to AcaPy deployment.

@i5okie i5okie temporarily deployed to development July 10, 2023 16:10 — with GitHub Actions Inactive
@github-actions
Copy link

@i5okie i5okie merged commit 18c0499 into main Jul 10, 2023
8 of 9 checks passed
@i5okie i5okie temporarily deployed to development July 10, 2023 16:51 — with GitHub Actions Inactive
@i5okie i5okie deleted the chore/tenant-proxy-healthz branch July 10, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants