☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

amshuman-kr · 2019-09-27T10:07:17Z

Feature (What you would like to be added):
Summarise the roadmap for etcd-druid with links to the corresponding issues.

Motivation (Why is this needed?):
A central place to collect the roadmap as well as the progress.

Approach/Hint to the implement solution (optional):

Basic Controller
- Define CRD types
- Implement basic controller to deploy StatefulSet (with replicas: 1) with the containers for etcd and etcd-backup-restore the same way it is being done now.
- Unit tests
- Integration tests
Propagate etcd defragmentation schedule from the CRD to etcd-backup-restore sidecar container.
Trigger full snapshot before hibernation/scale down.
Backup compaction
- Incremental/continuous backup is used for finer granularity backup (in the order of minutes) with full snapshots being taken at a much larger intervals (in the order of hours). This makes the backup efficient both in terms of disk, network bandwidth and backup storage space utilization as well as compute resource utilisation during backup.
- If the proportion of changes in the incremental backup is large then this impacts the restoration times because incremental backups can only be restored in sequence
- #61@etcd-backup-restore.
Multi-node etcd cluster
- All etcd nodes within the same Kubernetes cluster.
  - I.e., one CRD instance would provision multiple etcd nodes in the same Kubernetes cluster/namespace as the CRD instance.
  - Enhance CRD types to address the use-case
  - Scale sub-resource implementation for the current CRD
  - Addi/promote etcd learners/members during scale up, including quorum adjustment.
  - Remove etcd members during scale down, including quorum adjustment.
  - Handle backup/restore in the different states of the etcd cluster
  - Multi-AZ support
    - I.e. etcd nodes distributed across availability zones in the hosting Kubernetes cluster
- Each etcd node in a different Kubernetes cluster.
  - I.e. each etcd node will be provisioned via a separate CRD instance in a different Kubernetes cluster but these nodes will be configured to find each other to form an etcd cluster.
  - There will be as many CRD instances as the number of nodes in the etcd cluster.
  - #233@gardener.
  - Enhance CRD types to address the use-case
  - Add/promote etcd learners/members during scale up, including quorum adjustment.
  - Remove etcd members during scale down, including quorum adjustment.
  - Handle backup/restore in the different states of the etcd cluster
Non-disruptive Autoscaling
- The VerticalPodAutoscaler supports multiple update policies including recreate, initial and off.
- The recreate policy is clearly not suitable for a single-node etcd instances because of the implications on frequent, unpredictable and unmanaged down-time.
- The initial policy does not make sense for etcd considering the longer database verification time for non-graceful shutdown.
- For a single-node etcd instance, vertical scaling via the VerticalPodAutoscaler would always be disruptive because of the way scaling is done by VPA. It gives no opportunity to take action before the etcd pod(s) are disrupted for scaling.
- A controller can co-ordinate the etcd-specific steps to mitigate the disruption during (vertical) scaling if an alternative way is used to vertically scale a CRD instead of the individual pods directly.
Non-disruptive Updates
- For a single-node etcd instance, updates would be disruptive.
- A controller can co-ordinate the etcd-specific steps to mitigate the disruption during updates.
Database Restoration
- Database restoration is also currently done on startup (or a restart) (if database verification fails) within the same backup-restore sidecar's main process.
- Introducing a controller enables the option to perform database restoration as a separate job.
- The main advantage of this approach is to decouple the memory requirement of a database restoration from the regular backup (full and delta) tasks.
- This could be especially of interest because the delta snapshot restoration requires an embedded etcd instance which might mean that the memory requirement for database restoration is almost certain to be proportionate to the database size. However, the memory requirement for backup (full and delta) need not be proportionate to the database size at all. In fact, it is very realistic to expect that the memory requirement for backup be more or less independent of the database size.
Migration for major updates
- Data and/or backup migration during major updates which change the data and/or backup format or location.
Backup Health Verification
- Currently, we rely on the database backups in the storage provider to remain healthy. There are no additional checks to verify if the backups are still healthy after upload.
- A controller can be used to perform such backup health verification asynchronously.
[Feature] Enhance scope of druid to create/manage additional resources done elsewhere #505

The text was updated successfully, but these errors were encountered:

amshuman-kr added the kind/epic Large multi-story topic label Sep 27, 2019

swapnilgm mentioned this issue Sep 27, 2019

A case for etcd controller. gardener/etcd-backup-restore#135

Closed

ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Nov 29, 2019

ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 29, 2020

ghost added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 6, 2020

shreyas-s-rao mentioned this issue Jun 16, 2020

Enhance etcd scale down behavior #59

Closed

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 6, 2020

amshuman-kr added roadmap/team-internal and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Oct 23, 2020

vlerenc changed the title ~~[Feature] Roadmap~~ ☂️ Gardener ETCD Operator a.k.a. ETCD Druid Nov 5, 2020

vlerenc added this to the 2021-Q4 milestone Nov 10, 2020

gardener-robot added roadmap/internal and removed roadmap/team-internal labels May 21, 2021

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Nov 18, 2021

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels May 18, 2022

abdasgupta mentioned this issue Jan 9, 2023

Update Operator #91

Closed

gardener-robot added kind/roadmap Roadmap BLI and removed roadmap/internal labels Mar 23, 2023

ashwani2k added the priority/2 Priority (lower number equals higher priority) label May 26, 2023

unmarshall mentioned this issue Apr 3, 2024

upgraded go version to 1.22.1 #780

Closed

unmarshall mentioned this issue Jul 3, 2024

Update golang version to 1.22 and adapt loop var improvements #826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

amshuman-kr commented Sep 27, 2019 •

edited by unmarshall

Loading

☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

Comments

amshuman-kr commented Sep 27, 2019 • edited by unmarshall Loading

amshuman-kr commented Sep 27, 2019 •

edited by unmarshall

Loading