[Feature] Readiness Probe in multi-node etcd #288

ishan16696 · 2022-01-28T06:58:05Z

Feature (What you would like to be added):
Currently, readinessProbe of etcd is set to an endpoint /healthz of HTTP server running in backup sidecar.
This behaviour needed to be updated as readinessProbe of clustered-etcd should depend on whether there is etcd-leader present or not then only it should serve the incoming write requests.

Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):
Approaches :

ETCDCTL_API=3 etcdctl endpoint health --endpoints=${ENDPOINTS} --command-timeout=Xs
etcdctl endpoint health command performs a GET on the "health" key(source)
- fails when there is no etcd leader or when Quorum is lost as I think GET request will fail if there is no etcd leader present.

Advantage of this Method (etcdctl endpoint health).

We don't have to worry about such scenarios causing outage as now snapshotter fails won't fails the readinessProbe of etcd.

Disadvantage of this Method (etcdctl endpoint health).

If there is no Quorum present, kubelet will also mark the etcd-followers as NotReady.
Owner check feature depends on endpoint /healthz of HTTP server because when Owner check fails it fails the readinessProbe of etcd by setting the HTTP status to 503 but this Owner check in multi-node scenario is already being discussed here.

Use endpoint /healthz of HTTP server running in backup sidecar with modifications in such a way that whenever backup-restore leader is elected it should set HTTP server status to 200 for itself as well for all backup-restore followers.

The text was updated successfully, but these errors were encountered:

timuthy · 2022-01-28T13:43:37Z

fails when there is no etcd leader or when Quorum is lost as I think GET request will fail if there is no etcd leader present.

I can confirm that as long as the reads happen with linaerizable consistency (default - https://etcd.io/docs/v3.3/learning/api_guarantees/#linearizability)

If there is no Quorum present, kubelet will also mark the etcd-followers as NotReady.

I'm not sure if this is really a problem because for when we need to recover from a quorum loss, we anyhow have to start again from scratch, i.e. create a new cluster from backup and scale out again (details tbd).

timuthy · 2022-01-28T13:45:09Z

@ishan16696 can we move this issue to https://github.com/gardener/etcd-backup-restore as the implementation is very specific to the etcd-br component?

ishan16696 · 2022-01-31T05:47:46Z

Hi @timuthy ,

I have opened an issue there https://github.com/gardener/etcd-backup-restore there with updated points.
PTAL !!!.
closing this as we have moved to gardener/etcd-wrapper#7.
/close

ishan16696 added the kind/enhancement Enhancement, improvement, extension label Jan 28, 2022

This was referenced Jan 28, 2022

Decouple the readinessProbe of etcd and Snapshotter gardener/etcd-backup-restore#411

Merged

Multi-Node/Clustered ETCD #107

Closed

gardener-robot closed this as completed Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Readiness Probe in multi-node etcd #288

[Feature] Readiness Probe in multi-node etcd #288

ishan16696 commented Jan 28, 2022

timuthy commented Jan 28, 2022

timuthy commented Jan 28, 2022

ishan16696 commented Jan 31, 2022

[Feature] Readiness Probe in multi-node etcd #288

[Feature] Readiness Probe in multi-node etcd #288

Comments

ishan16696 commented Jan 28, 2022

timuthy commented Jan 28, 2022

timuthy commented Jan 28, 2022

ishan16696 commented Jan 31, 2022