Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

Open
7 of 27 tasks
amshuman-kr opened this issue Sep 27, 2019 · 0 comments
Open
7 of 27 tasks

☂️ Gardener ETCD Operator a.k.a. ETCD Druid #2

amshuman-kr opened this issue Sep 27, 2019 · 0 comments
Labels
kind/epic Large multi-story topic kind/roadmap Roadmap BLI lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/2 Priority (lower number equals higher priority)
Milestone

Comments

@amshuman-kr
Copy link
Collaborator

amshuman-kr commented Sep 27, 2019

Feature (What you would like to be added):
Summarise the roadmap for etcd-druid with links to the corresponding issues.

Motivation (Why is this needed?):
A central place to collect the roadmap as well as the progress.

Approach/Hint to the implement solution (optional):

  • Basic Controller
    • Define CRD types
    • Implement basic controller to deploy StatefulSet (with replicas: 1) with the containers for etcd and etcd-backup-restore the same way it is being done now.
    • Unit tests
    • Integration tests
  • Propagate etcd defragmentation schedule from the CRD to etcd-backup-restore sidecar container.
  • Trigger full snapshot before hibernation/scale down.
  • Backup compaction
    • Incremental/continuous backup is used for finer granularity backup (in the order of minutes) with full snapshots being taken at a much larger intervals (in the order of hours). This makes the backup efficient both in terms of disk, network bandwidth and backup storage space utilization as well as compute resource utilisation during backup.
    • If the proportion of changes in the incremental backup is large then this impacts the restoration times because incremental backups can only be restored in sequence
    • #61@etcd-backup-restore.
  • Multi-node etcd cluster
    • All etcd nodes within the same Kubernetes cluster.
      • I.e., one CRD instance would provision multiple etcd nodes in the same Kubernetes cluster/namespace as the CRD instance.
      • Enhance CRD types to address the use-case
      • Scale sub-resource implementation for the current CRD
      • Addi/promote etcd learners/members during scale up, including quorum adjustment.
      • Remove etcd members during scale down, including quorum adjustment.
      • Handle backup/restore in the different states of the etcd cluster
      • Multi-AZ support
        • I.e. etcd nodes distributed across availability zones in the hosting Kubernetes cluster
    • Each etcd node in a different Kubernetes cluster.
      • I.e. each etcd node will be provisioned via a separate CRD instance in a different Kubernetes cluster but these nodes will be configured to find each other to form an etcd cluster.
      • There will be as many CRD instances as the number of nodes in the etcd cluster.
      • #233@gardener.
      • Enhance CRD types to address the use-case
      • Add/promote etcd learners/members during scale up, including quorum adjustment.
      • Remove etcd members during scale down, including quorum adjustment.
      • Handle backup/restore in the different states of the etcd cluster
  • Non-disruptive Autoscaling
    • The VerticalPodAutoscaler supports multiple update policies including recreate, initial and off.
    • The recreate policy is clearly not suitable for a single-node etcd instances because of the implications on frequent, unpredictable and unmanaged down-time.
    • The initial policy does not make sense for etcd considering the longer database verification time for non-graceful shutdown.
    • For a single-node etcd instance, vertical scaling via the VerticalPodAutoscaler would always be disruptive because of the way scaling is done by VPA. It gives no opportunity to take action before the etcd pod(s) are disrupted for scaling.
    • A controller can co-ordinate the etcd-specific steps to mitigate the disruption during (vertical) scaling if an alternative way is used to vertically scale a CRD instead of the individual pods directly.
  • Non-disruptive Updates
    • For a single-node etcd instance, updates would be disruptive.
    • A controller can co-ordinate the etcd-specific steps to mitigate the disruption during updates.
  • Database Restoration
    • Database restoration is also currently done on startup (or a restart) (if database verification fails) within the same backup-restore sidecar's main process.
    • Introducing a controller enables the option to perform database restoration as a separate job.
    • The main advantage of this approach is to decouple the memory requirement of a database restoration from the regular backup (full and delta) tasks.
    • This could be especially of interest because the delta snapshot restoration requires an embedded etcd instance which might mean that the memory requirement for database restoration is almost certain to be proportionate to the database size. However, the memory requirement for backup (full and delta) need not be proportionate to the database size at all. In fact, it is very realistic to expect that the memory requirement for backup be more or less independent of the database size.
  • Migration for major updates
    • Data and/or backup migration during major updates which change the data and/or backup format or location.
  • Backup Health Verification
    • Currently, we rely on the database backups in the storage provider to remain healthy. There are no additional checks to verify if the backups are still healthy after upload.
    • A controller can be used to perform such backup health verification asynchronously.
  • [Feature] Enhance scope of druid to create/manage additional resources done elsewhere #505
@amshuman-kr amshuman-kr added the kind/epic Large multi-story topic label Sep 27, 2019
@ghost ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Nov 29, 2019
@ghost ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 29, 2020
@ghost ghost added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 6, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 6, 2020
@amshuman-kr amshuman-kr added roadmap/team-internal and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Oct 23, 2020
@vlerenc vlerenc changed the title [Feature] Roadmap ☂️ Gardener ETCD Operator a.k.a. ETCD Druid Nov 5, 2020
@vlerenc vlerenc added this to the 2021-Q4 milestone Nov 10, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Nov 18, 2021
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels May 18, 2022
@ashwani2k ashwani2k added the priority/2 Priority (lower number equals higher priority) label May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/epic Large multi-story topic kind/roadmap Roadmap BLI lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/2 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

4 participants