[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead #147

amshuman-kr · 2021-03-10T09:40:54Z

Feature (What you would like to be added):
Currently, the health check of the etcd pods is linked to the backup health (last backup upload succeeded) in addition to just etcd health. But as long as etcd data is backed by persistent volumes (it is now), we can afford for etcd to continue serve income requests even when backup upload fails as long as high priority alerts are triggered when backup upload fails and follow up is done to resolve the issue.

Motivation (Why is this needed?):
Avoid bringing down the whole shoot cluster control-plane when backup upload fails as that basically brings the cluster to a grinding halt. This might be affordable if etcd data is backed by persistent volumes because for data loss to occur a further data corruption in the persistent volumes is required (while backup upload is failing) to cause a data loss.

See also https://github.tools.sap/kubernetes-canary/issues-canary/issues/599

Approach/Hint to the implement solution (optional):
The following tasks might have to checked/evaluated.

Trigger high priority alert when backup upload fails (for full or incremental backups) Add alert for snapshotter failure in etcd-backup-restore gardener#4094
Change the health check criteria for the readinessProbe in the etcd pods to not take backup health (last backup upload) into consideration Decouple the readinessProbe of etcd and Snapshotter etcd-backup-restore#411
~~Enhance the multi-node etcd proposal to address this new requirement.~~

gardener-robot · 2021-03-10T09:40:57Z

@amshuman-kr You have mentioned internal references in the public. Please check.

gardener-robot · 2021-03-10T10:33:56Z

@amshuman-kr You have mentioned internal references in the public. Please check.

gardener-robot · 2021-03-10T10:34:00Z

@amshuman-kr You have mentioned internal references in the public. Please check.

vlerenc · 2021-05-27T06:11:34Z

As we have seen lately with the off-by-one (32/33 chunks) on GCP, doesn't it make sense to give this one a higher prio @amshuman-kr?

ishan16696 · 2021-11-03T19:21:15Z

But as long as etcd data is backed by persistent volumes (it is now), we can afford for etcd to continue serve income requests even when backup upload fails

Just wanted to mention one point: In multi-node etcd we have a plan to use Ephemeral volume, if we choose to go with Ephemeral volume then we might lose the data in worst case scenario as data wasn’t backed up.

timuthy · 2021-11-05T13:59:57Z

But as long as etcd data is backed by persistent volumes (it is now), we can afford for etcd to continue serve income requests even when backup upload fails

Just wanted to mention one point: In multi-node etcd we have a plan to use Ephemeral volume, if we choose to go with Ephemeral volume then we might lose the data in worst case scenario as data wasn’t backed up.

Wouldn't it make sense to consider the ephemeral volume use-case as an optimized option for the future in order to keep things simple for now? The multi-node project involves more urgent points and requirements like these will only delay a feature roll-out.
WDYT? (/cc @vlerenc)

dguendisch · 2021-11-05T14:30:24Z

to consider the ephemeral volume use-case as an optimized option for the future in order to keep things simple for now?

fully agree. eph volumes needs I think a certain confidence in overall etcd cluster stability and probably a substantial amount of perf testing and tuning (i.e. what does it mean to now put 20 etcd's data on one single volume and thus share its IOPS).

ishan16696 · 2021-11-05T16:17:55Z

Wouldn't it make sense to consider the ephemeral volume use-case as an optimized option for the future in order to keep things simple for now?

agreed

vlerenc · 2021-11-06T16:26:17Z

Yes, I would also vote to focus on multi-node/clustered ETCD in the form that can be achieved "the best" (low complexity, low coupling). Ephemeral volumes, as @dguendisch pointed out, require having a solution first and gain trust next, before going there last. Already including that in the challenging task we have at hand, letting it pull in the backup question, raising thereby complexity/coupling even more (leader with failed backups losing leadership sounds like another level of complexity/coupling), sounds like a bit too much too early.

gardener-robot · 2022-02-16T10:38:53Z

@timuthy You have mentioned internal references in the public. Please check.

timuthy · 2022-02-16T10:39:20Z

We will follow up with #280 for the readiness probe.
/close

amshuman-kr added the kind/enhancement Enhancement, improvement, extension label Mar 10, 2021

shreyas-s-rao mentioned this issue Apr 28, 2021

[Feature] Snapshotter should not proactively cut off traffic to etcd on backup failures gardener/etcd-backup-restore#325

Closed

ishan16696 mentioned this issue Nov 3, 2021

[Feature] Cut-off client traffic when backups are not healthy #252

Closed

This was referenced Nov 4, 2021

Multi-Node/Clustered ETCD #107

Closed

[Feature] Enhance reconciliation to handle multi-node scenario in etcd-druid #158

Closed

timuthy mentioned this issue Nov 8, 2021

Move ephemeral volumes to follow-up #256

Merged

This was referenced Jan 28, 2022

[Feature] Readiness Probe in multi-node etcd #288

Closed

[Feature] Readiness Probe in multi-node etcd gardener/etcd-wrapper#7

Open

gardener-robot closed this as completed Feb 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead #147

[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead #147

amshuman-kr commented Mar 10, 2021 •

edited by timuthy

Loading

gardener-robot commented Mar 10, 2021

gardener-robot commented Mar 10, 2021

gardener-robot commented Mar 10, 2021

vlerenc commented May 27, 2021

ishan16696 commented Nov 3, 2021

timuthy commented Nov 5, 2021

dguendisch commented Nov 5, 2021

ishan16696 commented Nov 5, 2021

vlerenc commented Nov 6, 2021

gardener-robot commented Feb 16, 2022

timuthy commented Feb 16, 2022

[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead #147

[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead #147

Comments

amshuman-kr commented Mar 10, 2021 • edited by timuthy Loading

gardener-robot commented Mar 10, 2021

gardener-robot commented Mar 10, 2021

gardener-robot commented Mar 10, 2021

vlerenc commented May 27, 2021

ishan16696 commented Nov 3, 2021

timuthy commented Nov 5, 2021

dguendisch commented Nov 5, 2021

ishan16696 commented Nov 5, 2021

vlerenc commented Nov 6, 2021

gardener-robot commented Feb 16, 2022

timuthy commented Feb 16, 2022

amshuman-kr commented Mar 10, 2021 •

edited by timuthy

Loading