Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A case for etcd controller. #135

Closed
wants to merge 1 commit into from

Conversation

amshuman-kr
Copy link
Collaborator

What this PR does / why we need it:
A proposal for using a controller as part of the etcd life-cycle management. This intended as complementary to much of the current implementation and not as an alternative.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:


Copy link
Member

@vlerenc vlerenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the document and let me please say: I am neither opposing nor proposing a controller atm. I am sure, at one point in time we will need one (again) - maybe now has come the time – I don’t know. I definitely see it coming when we tackle multi-node, for whatever reason: Gardener ring, smooth rolling update with hand-over between two etcd instances, multi-zonal control planes. I am just saying, that there are very pressing needs we must attend to and if controllers can help, then good. If not, then let's not introduce them right now, but first make some space/time before we open that box by resolving these more pressing needs.

For backups, a better option might be to separate the backup sidecar as a separate pod for the whole etcd cluster rather than a sidecar container for each etcd instance in the etcd cluster.
This could be optionally used only in the multi-node scenario and we can perhaps continue to use the sidecar container approach for the single-node scenario.
This could be the first step and we can think of introducing the controller for co-ordination for other life-cycle operations later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, multi-node is definitely something that will require a coordinator/controller. So far I see three use cases:

  • Cross-cluster Gardener (whether it’s a self-managed ring is a separate topic, but not relevant for this question)
  • Smooth rolling update with hand-over between two etcd instances (what you call shorter/better just “non-disruptive maintenance”)
  • Multi-zonal control planes (desired from customer PoV, but less so from technical/our PoV, so with the lowest priority of the three at present)

The additional co-ordination steps to scale down the statefulset, trigger the restoration job, wait for the completion of restoration and scaling back the statefulset would have to be encoded in a controller.

This could be a second step because we can introduce only a lean controller for restoration and it might even be possible to do this without introducing a CRD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You argue that the restore operation, if executed in one pod, will result in high memory demand (etcd plus full restoration happening all in one pod). Yes, I see that if we can decouple that it’s better for us, i.e. detect the need before having scheduled etcd, restore when necessary and only then schedule etcd. By that we can decouple everything more nicely and solve the PV-can-only-be-mounted-once issue. Then again, it feels to me as if this could be ranked lower, wouldn’t you agree? The effort is very high and we have a solution that works today and that’s a rather rare operation anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effort is not really that high. In fact, the current backup-restore image already supports subcommand for restoration. So, technically, we can use the existing backup-restore image as a job to perform restoration. All that would be required would be a controller to co-ordinate sequence of scaling down the statefulset, triggering the job, waiting for completion and then scaling the statefulset back. The existing functionality could be fully re-used even in terms of the binary docker image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that restoration is a rare, but with the likely hood that our customers might hit this problem before us and the impact being quite high when the problem is encountered, I feel it is important to optimise the database restoration both in terms of memory and time. At least, prio-2 if not prio-1.

A controller can be used to periodically perform compaction of the incremental backups in the backup storage. This can optimize both the restoration times as well as the backup storage space utilization while not affecting the regular backup performance because such compaction would be done asynchronously.

This feature could be implemented as an enhancement of the [database restoration](#database-restoration) functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization for incremental backup compaction and backup health verification could be a use case, though an expensive one (needs disk, compute and network to compact what is already in store for the rare case of a restoration). E.g. I am not sure whether restore optimisation is (already) that important. Also shoot cluster migration to another seed should be a rare operation, it’s basically a DR measure. And as for the backup health verification, I have not heard that people have reason not to trust S3 or similar services. We also write new full backups again and again. That I really don’t see as an arg until we learn, that it’s not trustworthy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incremental backup compaction need not be that expensive at all. Yes, it would require disks, compute and network. But since it would be done in a separate job, it would be possible to do this with low priority jobs with throttled resources and possibly on dedicated low-cost hardware (that could even be spot instances). They could be done serially (effectively due to the throttled resources).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be alternative solutions to the problem of optimising the restoration time (other than compaction in the background). E.g., increasing the full-snapshot frequency to reduce incremental backup files. These should be pursued too. But my (not very deep) calculations indicate such approaches may not fully address the issue. I could be wrong, of course.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed the rareness of a restoration above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health check and major version upgrade were minor points. I agree they are not urgent at all. Health check, we might not want to do at all. I have updated the document to reflect this more clearly.

Another alternative is to create a custom image for etcd container to include the co-ordination logic. This is also not very desirable.

As mentioned in the case of [database restoration](#database-restoration), database verification is a part of a subcommand of the existing backup-restore component. So, the existing verification functionality can be fully reused even at the level of the binary docker image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mentioned database verification also on top, but how can you decouple it from etcd restart? You need the first before you can do the latter. So I only see the memory (like above) and only that as a pro argument. Or do you mean an internal optimisation because of the current hand-over between the two? Then it is no runtime, but just an implementation improvement. OK, agreed, still not a compelling reason to change everything right now, especially since we have too much on our plate with etcd at present (the pressing needs, but also the etcd VPA topic).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the priority. But I would like to address the effort aspect. Like in the case of restoration, this is already a part of a subcommand in the backup-restore. So, we can re-use practically everything in the side-car as is even at the binary docker image level. The only thing that would change is the shell script that is doing the co-ordination which would have to be moved to the controller.

The `initial` policy makes more sense when coupled with unlimited resource limits (but very clear autoscaled resource requests).

With a controller, even the `off` option becomes feasible because the time for applying `VerticalPodAutoscaler`'s could be decided in a very custom way while still relying on the recommendations from the `VerticalPodAutoscaler`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for VPA, if it lacks, shouldn’t we rather help it have the missing feature? It needs to be crystal clear, what’s missing and that we won’t get it, before deciding to replicate/redo what VPA already has. So today this is no arg, at least for me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not advocating for an alternative for VPA. VPA is being evaluated with the updatePolicy off to begin with (just like for prometheus) and the intention is to use it with updatePolicy initial as the next step. There is no question of replication or redoing VPA whatsoever.

But that is for now. It is very clear what is missing in VPA for the non-disruptive autoscaling for etcd. I see no way to make the current MutableWebhook approach of VPA work for the non-disruptive autoscaling. The recommender part of VPA would work very well still.

This is the line of reasoning. VPA with anything other than off or initial (i.e. recreate or auto) would kill the pod directly and rely on the MutableWebhook to inject the right resource recommendations. But this would be disruptive unless we have multi-node support. If we go for always multi-node etcd approach, this would not be such a problem. But our inclination has so far been to stick to single-node etcd for most scenarios and go for multi-node only if necessary. So, if we want to go temporarily multi-node during VPA scaling then we need to sit between the VPA’s recommendation and the updater. One way to do this would be to use a CRD for etcd with pod template (and hence, resource recommendations). If we avoid the MutableWebhook of VPA, we can still rely on VPA and its recommender (pretty much fully) but use a custom updater (or enhance VPA’s own updater) to update our CRD’s resource recommendations. The controller can then scale out the statefulset temporarily with the newer resource requirements (or permanently if we opt for multi-node always approach) in a rolling manner and scale in back similarly. This is all theory, of course with a big if. Practice will have its say when we try it.


One way to make such changes less disruptive could be to temporarily scale the cluster into a multi-node (3) `etcd` cluster, perform the disruptive change on a rolling manner on each of the individual nodes of the multi-node `etcd` clusters and then scale down the `etcd` back to single-instance.
This kind of multi-step process is better implemented as a controller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment, the non-disruptive maintenance can also be implemented with two etcds and a controller. I would not invoke three etcd cluster nodes if not necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be possible. But as you mentioned, it would require a controller still. 2 nodes or 3.

So far, `etcd` versions we have been using have provided backward compatibility for the databases. But it is possible that in some future version there is a break in compatibility and some data migration is required. Without a controller, this would have to be done in an ad hoc manner. A controller can provide a good place to encode such logic.

This feature could be implemented on demand when a new etcd version requires such data migration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, migration will most likely always require a coordinator of sorts.

A controller can be used to perform such backup health verification asynchronously.

This is a minor point which may not be important to implement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, comment on "Database Incremental Backup Compaction".


Because of the disruptive nature of scaling of single-node `etcd` instances, it might make sense to restrict some of the low priority life-cycle operations (for example, scaling down) to the maintenance window of the `Shoot` cluster which is backed by the given `etcd` instance.
It would be possible to implement this with the current sidecar approach but might be cleaner to do it as controller (also, possibly a `CustomResourceDefinition` to capture co-ordination information such as maintenance window).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, more advanced functions like scale-down during maintenance windows could be implemented by means of a controller better. Then again, not sure whether this alone would be necessary. We will roll seed cluster nodes fairly regularly, so there are natural windows of opportunities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But it might be better to make this possible even independent of such events.

@swapnilgm swapnilgm added component/etcd-backup-restore ETCD Backup & Restore area/scalability exp/expert Issue that requires expert level knowledge platform/all kind/api-change API change with impact on API users status/under-investigation Issue is under investigation size/l Size of pull request is large (see gardener-robot robot/bots/size.py) area/high-availability High availability related labels Mar 29, 2019
@swapnilgm
Copy link
Contributor

@amshuman-kr Will you please update the PR, with conclusion from discussion. And rename the file as doc/proposals/controller.md, so that i can merge it.

@swapnilgm
Copy link
Contributor

swapnilgm commented Sep 27, 2019

Closing the discussion thread. Since we already started with https://github.com/gardener/etcd-druid which will feature the conclusion from discussion.
Ref: gardener/etcd-druid#2

@swapnilgm swapnilgm closed this Sep 27, 2019
@amshuman-kr amshuman-kr deleted the proposal-controller branch September 27, 2019 10:33
@ghost ghost added the kind/epic Large multi-story topic label Mar 7, 2020
@gardener-robot gardener-robot added the area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related label Jun 5, 2020
@gardener-robot gardener-robot added priority/3 Priority (lower number equals higher priority) and removed priority/3 Priority (lower number equals higher priority) labels Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/high-availability High availability related component/etcd-backup-restore ETCD Backup & Restore exp/expert Issue that requires expert level knowledge kind/api-change API change with impact on API users kind/epic Large multi-story topic platform/all size/l Size of pull request is large (see gardener-robot robot/bots/size.py) status/under-investigation Issue is under investigation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants