A case for etcd controller. #135

amshuman-kr · 2019-03-12T06:03:27Z

What this PR does / why we need it:
A proposal for using a controller as part of the etcd life-cycle management. This intended as complementary to much of the current implementation and not as an alternative.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

vlerenc

Thank for the document and let me please say: I am neither opposing nor proposing a controller atm. I am sure, at one point in time we will need one (again) - maybe now has come the time – I don’t know. I definitely see it coming when we tackle multi-node, for whatever reason: Gardener ring, smooth rolling update with hand-over between two etcd instances, multi-zonal control planes. I am just saying, that there are very pressing needs we must attend to and if controllers can help, then good. If not, then let's not introduce them right now, but first make some space/time before we open that box by resolving these more pressing needs.

vlerenc · 2019-03-17T14:43:40Z

doc/controller.md

+For backups, a better option might be to separate the backup sidecar as a separate pod for the whole etcd cluster rather than a sidecar container for each etcd instance in the etcd cluster.
+This could be optionally used only in the multi-node scenario and we can perhaps continue to use the sidecar container approach for the single-node scenario.
+This could be the first step and we can think of introducing the controller for co-ordination for other life-cycle operations later.
+


Yes, multi-node is definitely something that will require a coordinator/controller. So far I see three use cases:

Cross-cluster Gardener (whether it’s a self-managed ring is a separate topic, but not relevant for this question)

Smooth rolling update with hand-over between two etcd instances (what you call shorter/better just “non-disruptive maintenance”)

Multi-zonal control planes (desired from customer PoV, but less so from technical/our PoV, so with the lowest priority of the three at present)

vlerenc · 2019-03-17T14:44:04Z

doc/controller.md

+The additional co-ordination steps to scale down the statefulset, trigger the restoration job, wait for the completion of restoration and scaling back the statefulset would have to be encoded in a controller.
+
+This could be a second step because we can introduce only a lean controller for restoration and it might even be possible to do this without introducing a CRD.
+


You argue that the restore operation, if executed in one pod, will result in high memory demand (etcd plus full restoration happening all in one pod). Yes, I see that if we can decouple that it’s better for us, i.e. detect the need before having scheduled etcd, restore when necessary and only then schedule etcd. By that we can decouple everything more nicely and solve the PV-can-only-be-mounted-once issue. Then again, it feels to me as if this could be ranked lower, wouldn’t you agree? The effort is very high and we have a solution that works today and that’s a rather rare operation anyways.

The effort is not really that high. In fact, the current backup-restore image already supports subcommand for restoration. So, technically, we can use the existing backup-restore image as a job to perform restoration. All that would be required would be a controller to co-ordinate sequence of scaling down the statefulset, triggering the job, waiting for completion and then scaling the statefulset back. The existing functionality could be fully re-used even in terms of the binary docker image.

I agree that restoration is a rare, but with the likely hood that our customers might hit this problem before us and the impact being quite high when the problem is encountered, I feel it is important to optimise the database restoration both in terms of memory and time. At least, prio-2 if not prio-1.

vlerenc · 2019-03-17T14:45:33Z

doc/controller.md

+A controller can be used to periodically perform compaction of the incremental backups in the backup storage. This can optimize both the restoration times as well as the backup storage space utilization while not affecting the regular backup performance because such compaction would be done asynchronously.
+
+This feature could be implemented as an enhancement of the [database restoration](#database-restoration) functionality.
+


Optimization for incremental backup compaction and backup health verification could be a use case, though an expensive one (needs disk, compute and network to compact what is already in store for the rare case of a restoration). E.g. I am not sure whether restore optimisation is (already) that important. Also shoot cluster migration to another seed should be a rare operation, it’s basically a DR measure. And as for the backup health verification, I have not heard that people have reason not to trust S3 or similar services. We also write new full backups again and again. That I really don’t see as an arg until we learn, that it’s not trustworthy.

The incremental backup compaction need not be that expensive at all. Yes, it would require disks, compute and network. But since it would be done in a separate job, it would be possible to do this with low priority jobs with throttled resources and possibly on dedicated low-cost hardware (that could even be spot instances). They could be done serially (effectively due to the throttled resources).

There might be alternative solutions to the problem of optimising the restoration time (other than compaction in the background). E.g., increasing the full-snapshot frequency to reduce incremental backup files. These should be pursued too. But my (not very deep) calculations indicate such approaches may not fully address the issue. I could be wrong, of course.

I have addressed the rareness of a restoration above

Health check and major version upgrade were minor points. I agree they are not urgent at all. Health check, we might not want to do at all. I have updated the document to reflect this more clearly.

vlerenc · 2019-03-17T14:45:51Z

doc/controller.md

+Another alternative is to create a custom image for etcd container to include the co-ordination logic. This is also not very desirable.
+
+As mentioned in the case of [database restoration](#database-restoration), database verification is a part of a subcommand of the existing backup-restore component. So, the existing verification functionality can be fully reused even at the level of the binary docker image.
+


Yes, I mentioned database verification also on top, but how can you decouple it from etcd restart? You need the first before you can do the latter. So I only see the memory (like above) and only that as a pro argument. Or do you mean an internal optimisation because of the current hand-over between the two? Then it is no runtime, but just an implementation improvement. OK, agreed, still not a compelling reason to change everything right now, especially since we have too much on our plate with etcd at present (the pressing needs, but also the etcd VPA topic).

I agree with the priority. But I would like to address the effort aspect. Like in the case of restoration, this is already a part of a subcommand in the backup-restore. So, we can re-use practically everything in the side-car as is even at the binary docker image level. The only thing that would change is the shell script that is doing the co-ordination which would have to be moved to the controller.

vlerenc · 2019-03-17T14:46:08Z

doc/controller.md

+The `initial` policy makes more sense when coupled with unlimited resource limits (but very clear autoscaled resource requests).
+
+With a controller, even the `off` option becomes feasible because the time for applying `VerticalPodAutoscaler`'s could be decided in a very custom way while still relying on the recommendations from the `VerticalPodAutoscaler`.
+


As for VPA, if it lacks, shouldn’t we rather help it have the missing feature? It needs to be crystal clear, what’s missing and that we won’t get it, before deciding to replicate/redo what VPA already has. So today this is no arg, at least for me.

I am not advocating for an alternative for VPA. VPA is being evaluated with the updatePolicy off to begin with (just like for prometheus) and the intention is to use it with updatePolicy initial as the next step. There is no question of replication or redoing VPA whatsoever.

But that is for now. It is very clear what is missing in VPA for the non-disruptive autoscaling for etcd. I see no way to make the current MutableWebhook approach of VPA work for the non-disruptive autoscaling. The recommender part of VPA would work very well still.

This is the line of reasoning. VPA with anything other than off or initial (i.e. recreate or auto) would kill the pod directly and rely on the MutableWebhook to inject the right resource recommendations. But this would be disruptive unless we have multi-node support. If we go for always multi-node etcd approach, this would not be such a problem. But our inclination has so far been to stick to single-node etcd for most scenarios and go for multi-node only if necessary. So, if we want to go temporarily multi-node during VPA scaling then we need to sit between the VPA’s recommendation and the updater. One way to do this would be to use a CRD for etcd with pod template (and hence, resource recommendations). If we avoid the MutableWebhook of VPA, we can still rely on VPA and its recommender (pretty much fully) but use a custom updater (or enhance VPA’s own updater) to update our CRD’s resource recommendations. The controller can then scale out the statefulset temporarily with the newer resource requirements (or permanently if we opt for multi-node always approach) in a rolling manner and scale in back similarly. This is all theory, of course with a big if. Practice will have its say when we try it.

vlerenc · 2019-03-17T14:46:39Z

doc/controller.md

+
+One way to make such changes less disruptive could be to temporarily scale the cluster into a multi-node (3) `etcd` cluster, perform the disruptive change on a rolling manner on each of the individual nodes of the multi-node `etcd` clusters and then scale down the `etcd` back to single-instance.
+This kind of multi-step process is better implemented as a controller.
+


Just a minor comment, the non-disruptive maintenance can also be implemented with two etcds and a controller. I would not invoke three etcd cluster nodes if not necessary.

That could be possible. But as you mentioned, it would require a controller still. 2 nodes or 3.

vlerenc · 2019-03-17T14:47:01Z

doc/controller.md

+So far, `etcd` versions we have been using have provided backward compatibility for the databases. But it is possible that in some future version there is a break in compatibility and some data migration is required. Without a controller, this would have to be done in an ad hoc manner. A controller can provide a good place to encode such logic.
+
+This feature could be implemented on demand when a new etcd version requires such data migration.
+


Yes, migration will most likely always require a coordinator of sorts.

vlerenc · 2019-03-17T14:47:54Z

doc/controller.md

+A controller can be used to perform such backup health verification asynchronously.
+
+This is a minor point which may not be important to implement.
+


See above, comment on "Database Incremental Backup Compaction".

vlerenc · 2019-03-17T14:52:23Z

doc/controller.md

+
+Because of the disruptive nature of scaling of single-node `etcd` instances, it might make sense to restrict some of the low priority life-cycle operations (for example, scaling down) to the maintenance window of the `Shoot` cluster which is backed by the given `etcd` instance.
+It would be possible to implement this with the current sidecar approach but might be cleaner to do it as controller (also, possibly a `CustomResourceDefinition` to capture co-ordination information such as maintenance window).
+


Yes, more advanced functions like scale-down during maintenance windows could be implemented by means of a controller better. Then again, not sure whether this alone would be necessary. We will roll seed cluster nodes fairly regularly, so there are natural windows of opportunities.

Yes. But it might be better to make this possible even independent of such events.

swapnilgm · 2019-08-12T08:36:32Z

@amshuman-kr Will you please update the PR, with conclusion from discussion. And rename the file as doc/proposals/controller.md, so that i can merge it.

swapnilgm · 2019-09-27T10:28:28Z

Closing the discussion thread. Since we already started with https://github.com/gardener/etcd-druid which will feature the conclusion from discussion.
Ref: gardener/etcd-druid#2

amshuman-kr requested review from georgekuruvillak, shreyas-s-rao and swapnilgm as code owners March 12, 2019 06:03

amshuman-kr force-pushed the proposal-controller branch 3 times, most recently from 5b44e84 to 79b9414 Compare March 15, 2019 08:12

A case for etcd controller.

a59e2f1

amshuman-kr force-pushed the proposal-controller branch from 79b9414 to a59e2f1 Compare March 17, 2019 14:04

vlerenc reviewed Mar 17, 2019

View reviewed changes

swapnilgm closed this Sep 27, 2019

amshuman-kr deleted the proposal-controller branch September 27, 2019 10:33

ghost added the kind/epic Large multi-story topic label Mar 7, 2020

gardener-robot added the area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related label Jun 5, 2020

gardener-robot added priority/3 Priority (lower number equals higher priority) and removed priority/3 Priority (lower number equals higher priority) labels Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A case for etcd controller. #135

A case for etcd controller. #135

amshuman-kr commented Mar 12, 2019

vlerenc left a comment

vlerenc Mar 17, 2019

vlerenc Mar 17, 2019

amshuman-kr Mar 18, 2019

amshuman-kr Mar 18, 2019

vlerenc Mar 17, 2019

amshuman-kr Mar 18, 2019

amshuman-kr Mar 18, 2019

amshuman-kr Mar 18, 2019

amshuman-kr Mar 18, 2019

vlerenc Mar 17, 2019

amshuman-kr Mar 18, 2019

vlerenc Mar 17, 2019

amshuman-kr Mar 18, 2019

vlerenc Mar 17, 2019

amshuman-kr Mar 18, 2019

vlerenc Mar 17, 2019

vlerenc Mar 17, 2019

vlerenc Mar 17, 2019

amshuman-kr Mar 18, 2019

swapnilgm commented Aug 12, 2019

swapnilgm commented Sep 27, 2019 •

edited

Loading

		The additional co-ordination steps to scale down the statefulset, trigger the restoration job, wait for the completion of restoration and scaling back the statefulset would have to be encoded in a controller.

		This could be a second step because we can introduce only a lean controller for restoration and it might even be possible to do this without introducing a CRD.

		A controller can be used to periodically perform compaction of the incremental backups in the backup storage. This can optimize both the restoration times as well as the backup storage space utilization while not affecting the regular backup performance because such compaction would be done asynchronously.

		This feature could be implemented as an enhancement of the [database restoration](#database-restoration) functionality.

		Another alternative is to create a custom image for etcd container to include the co-ordination logic. This is also not very desirable.

		As mentioned in the case of [database restoration](#database-restoration), database verification is a part of a subcommand of the existing backup-restore component. So, the existing verification functionality can be fully reused even at the level of the binary docker image.

		The `initial` policy makes more sense when coupled with unlimited resource limits (but very clear autoscaled resource requests).

		With a controller, even the `off` option becomes feasible because the time for applying `VerticalPodAutoscaler`'s could be decided in a very custom way while still relying on the recommendations from the `VerticalPodAutoscaler`.


		One way to make such changes less disruptive could be to temporarily scale the cluster into a multi-node (3) `etcd` cluster, perform the disruptive change on a rolling manner on each of the individual nodes of the multi-node `etcd` clusters and then scale down the `etcd` back to single-instance.
		This kind of multi-step process is better implemented as a controller.

		So far, `etcd` versions we have been using have provided backward compatibility for the databases. But it is possible that in some future version there is a break in compatibility and some data migration is required. Without a controller, this would have to be done in an ad hoc manner. A controller can provide a good place to encode such logic.

		This feature could be implemented on demand when a new etcd version requires such data migration.

		A controller can be used to perform such backup health verification asynchronously.

		This is a minor point which may not be important to implement.


		Because of the disruptive nature of scaling of single-node `etcd` instances, it might make sense to restrict some of the low priority life-cycle operations (for example, scaling down) to the maintenance window of the `Shoot` cluster which is backed by the given `etcd` instance.
		It would be possible to implement this with the current sidecar approach but might be cleaner to do it as controller (also, possibly a `CustomResourceDefinition` to capture co-ordination information such as maintenance window).

A case for etcd controller. #135

A case for etcd controller. #135

Conversation

amshuman-kr commented Mar 12, 2019

vlerenc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swapnilgm commented Aug 12, 2019

swapnilgm commented Sep 27, 2019 • edited Loading

swapnilgm commented Sep 27, 2019 •

edited

Loading