Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Readiness Probe in multi-node etcd #288

Closed
ishan16696 opened this issue Jan 28, 2022 · 3 comments
Closed

[Feature] Readiness Probe in multi-node etcd #288

ishan16696 opened this issue Jan 28, 2022 · 3 comments
Labels
kind/enhancement Enhancement, improvement, extension

Comments

@ishan16696
Copy link
Member

Feature (What you would like to be added):
Currently, readinessProbe of etcd is set to an endpoint /healthz of HTTP server running in backup sidecar.
This behaviour needed to be updated as readinessProbe of clustered-etcd should depend on whether there is etcd-leader present or not then only it should serve the incoming write requests.

Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):
Approaches :

  1. ETCDCTL_API=3 etcdctl endpoint health --endpoints=${ENDPOINTS} --command-timeout=Xs
    etcdctl endpoint health command performs a GET on the "health" key(source)
    • fails when there is no etcd leader or when Quorum is lost as I think GET request will fail if there is no etcd leader present.

Advantage of this Method (etcdctl endpoint health).

  • We don't have to worry about such scenarios causing outage as now snapshotter fails won't fails the readinessProbe of etcd.

Disadvantage of this Method (etcdctl endpoint health).

  • If there is no Quorum present, kubelet will also mark the etcd-followers as NotReady.
  • Owner check feature depends on endpoint /healthz of HTTP server because when Owner check fails it fails the readinessProbe of etcd by setting the HTTP status to 503 but this Owner check in multi-node scenario is already being discussed here.
  1. Use endpoint /healthz of HTTP server running in backup sidecar with modifications in such a way that whenever backup-restore leader is elected it should set HTTP server status to 200 for itself as well for all backup-restore followers.
@timuthy
Copy link
Member

timuthy commented Jan 28, 2022

fails when there is no etcd leader or when Quorum is lost as I think GET request will fail if there is no etcd leader present.

I can confirm that as long as the reads happen with linaerizable consistency (default - https://etcd.io/docs/v3.3/learning/api_guarantees/#linearizability)

If there is no Quorum present, kubelet will also mark the etcd-followers as NotReady.

I'm not sure if this is really a problem because for when we need to recover from a quorum loss, we anyhow have to start again from scratch, i.e. create a new cluster from backup and scale out again (details tbd).

@timuthy
Copy link
Member

timuthy commented Jan 28, 2022

@ishan16696 can we move this issue to https://github.com/gardener/etcd-backup-restore as the implementation is very specific to the etcd-br component?

@ishan16696
Copy link
Member Author

Hi @timuthy ,

I have opened an issue there https://github.com/gardener/etcd-backup-restore there with updated points.
PTAL !!!.
closing this as we have moved to gardener/etcd-wrapper#7.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

3 participants