Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead #147

Closed
2 of 3 tasks
Tracked by #107
amshuman-kr opened this issue Mar 10, 2021 · 11 comments
Closed
2 of 3 tasks
Tracked by #107
Labels
kind/enhancement Enhancement, improvement, extension

Comments

@amshuman-kr
Copy link
Collaborator

amshuman-kr commented Mar 10, 2021

Feature (What you would like to be added):
Currently, the health check of the etcd pods is linked to the backup health (last backup upload succeeded) in addition to just etcd health. But as long as etcd data is backed by persistent volumes (it is now), we can afford for etcd to continue serve income requests even when backup upload fails as long as high priority alerts are triggered when backup upload fails and follow up is done to resolve the issue.

Motivation (Why is this needed?):
Avoid bringing down the whole shoot cluster control-plane when backup upload fails as that basically brings the cluster to a grinding halt. This might be affordable if etcd data is backed by persistent volumes because for data loss to occur a further data corruption in the persistent volumes is required (while backup upload is failing) to cause a data loss.

See also https://github.tools.sap/kubernetes-canary/issues-canary/issues/599

Approach/Hint to the implement solution (optional):
The following tasks might have to checked/evaluated.

@amshuman-kr amshuman-kr added the kind/enhancement Enhancement, improvement, extension label Mar 10, 2021
@gardener-robot
Copy link

@amshuman-kr You have mentioned internal references in the public. Please check.

2 similar comments
@gardener-robot
Copy link

@amshuman-kr You have mentioned internal references in the public. Please check.

@gardener-robot
Copy link

@amshuman-kr You have mentioned internal references in the public. Please check.

@vlerenc
Copy link
Member

vlerenc commented May 27, 2021

As we have seen lately with the off-by-one (32/33 chunks) on GCP, doesn't it make sense to give this one a higher prio @amshuman-kr?

@ishan16696
Copy link
Member

But as long as etcd data is backed by persistent volumes (it is now), we can afford for etcd to continue serve income requests even when backup upload fails

Just wanted to mention one point: In multi-node etcd we have a plan to use Ephemeral volume, if we choose to go with Ephemeral volume then we might lose the data in worst case scenario as data wasn’t backed up.

@timuthy
Copy link
Member

timuthy commented Nov 5, 2021

But as long as etcd data is backed by persistent volumes (it is now), we can afford for etcd to continue serve income requests even when backup upload fails

Just wanted to mention one point: In multi-node etcd we have a plan to use Ephemeral volume, if we choose to go with Ephemeral volume then we might lose the data in worst case scenario as data wasn’t backed up.

Wouldn't it make sense to consider the ephemeral volume use-case as an optimized option for the future in order to keep things simple for now? The multi-node project involves more urgent points and requirements like these will only delay a feature roll-out.
WDYT? (/cc @vlerenc)

@dguendisch
Copy link
Member

to consider the ephemeral volume use-case as an optimized option for the future in order to keep things simple for now?

fully agree. eph volumes needs I think a certain confidence in overall etcd cluster stability and probably a substantial amount of perf testing and tuning (i.e. what does it mean to now put 20 etcd's data on one single volume and thus share its IOPS).

@ishan16696
Copy link
Member

Wouldn't it make sense to consider the ephemeral volume use-case as an optimized option for the future in order to keep things simple for now?

agreed

@vlerenc
Copy link
Member

vlerenc commented Nov 6, 2021

Yes, I would also vote to focus on multi-node/clustered ETCD in the form that can be achieved "the best" (low complexity, low coupling). Ephemeral volumes, as @dguendisch pointed out, require having a solution first and gain trust next, before going there last. Already including that in the challenging task we have at hand, letting it pull in the backup question, raising thereby complexity/coupling even more (leader with failed backups losing leadership sounds like another level of complexity/coupling), sounds like a bit too much too early.

@gardener-robot
Copy link

@timuthy You have mentioned internal references in the public. Please check.

@timuthy
Copy link
Member

timuthy commented Feb 16, 2022

We will follow up with #280 for the readiness probe.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

6 participants