Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce HA shoot control planes in alpha state #5741

Merged

Conversation

shreyas-s-rao
Copy link
Contributor

@shreyas-s-rao shreyas-s-rao commented Apr 5, 2022

How to categorize this PR?

/area control-plane
/area high-availability
/kind enhancement

What this PR does / why we need it:
This PR introduces high availability control planes for shoots in an MVP or alpha state.

  • New feature gate HAControlPlanes introduced for gardenlet and gardener-scheduler to enable support for the feature on landscapes/seeds. Disabled by default.
  • Introduce annotation alpha.control-plane.shoot.gardener.cloud/high-availability with supported values single-zone and multi-zone, to enable HA control plane for individual shoot on a HA-supported seed
  • Currently etcd-main, etcd-events and kube-apiserver are spread across topologies based on this feature
  • Shoots that request multi-zonal control planes can now be scheduled onto multi-zonal seeds (matched based on shoot annotation alpha.control-plane.shoot.gardener.cloud/high-availability='multi-zone' and seed label seed.gardener.cloud/multi-zonal)
  • If a shoot requests a HA control plane without the explicit need for zonal spread, then the shoot will be scheduled to a non-multi-zonal seed, and the control plane will be spread at the node level

Which issue(s) this PR fixes:
Fixes partially #2817

Special notes for your reviewer:

Release note:

Introduce feature gate `HAControlPlanes` in alpha state for gardenlet and gardener-scheduler. :warning: This comes with a change to the certs used, which will cause a restart of the etcds.

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Apr 5, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/control-plane Control plane related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2022
@rfranzke
Copy link
Member

rfranzke commented Apr 5, 2022

/assign

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests are missing

pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/operation/botanist/etcd.go Outdated Show resolved Hide resolved
@rfranzke rfranzke changed the title [WIP] Introduce HA shoot control planes in alpha state [WIP] Upgrade to etcd-druid@v0.9 and introduce HA shoot control planes in alpha state Apr 12, 2022
@rfranzke
Copy link
Member

Can you please introduce a dedicated CA (with respective client certificates) for the peer communication (do not reuse the ca-etcd CA)?

@gardener-prow gardener-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 2, 2022
@gardener-prow gardener-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 5, 2022
@shreyas-s-rao
Copy link
Contributor Author

@rfranzke Thanks for your review comments. I have now rebased the PR and made the suggested changes. PTAL

@rfranzke rfranzke self-requested a review May 5, 2022 06:56
docs/deployment/feature_gates.md Outdated Show resolved Hide resolved
docs/deployment/feature_gates.md Outdated Show resolved Hide resolved
pkg/features/features.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/scheduler/controller/shoot/reconciler.go Outdated Show resolved Hide resolved
@gardener-prow gardener-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2022
@timuthy timuthy force-pushed the feature/ha-control-plane-mvp branch from 2c83aff to 1112b9c Compare June 9, 2022 09:32
@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. and removed cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. labels Jun 9, 2022
@timuthy
Copy link
Member

timuthy commented Jun 9, 2022

The e2e test fails due to #6081.
Apart from that the PR is ready for another round of review. I tested typical scenarios like reconciliation with an existing cluster, creating/hibernating/deleting a single node etcd shoot as well as the same for a multi-node etcd shoot ✅
As discussed, CA rotation doesn't work yet. I'm working on it and will file a separate PR.

@timebertt
Copy link
Member

I think, the test actually failed because of #6092.
However, I cannot be sure as the shoot state logs are missing in the test (fixed by #6088).
Let's give it another shot, now that #6089 is also in.
/test pull-gardener-e2e-kind

@unmarshall
Copy link
Contributor

For multi zonal control plane we have changed the scheduling constraints to either use topologyKey=kubernetes.io/hostname or topologyKey=topology.kubernetes.io/zone for Kube API Server and etcd based on the value of the annotation with possible values as single-zone and multi-zone.

There is another control plane component garden-resource-manager which today always runs with replicas=3 and with podAntiAffinity scheduling constraint which is hard coded to using topologyKey=kubernetes.io/hostname (See here)

Would it make sense to also have GRM be deployed across zones if the the value of annotation: alpha.control-plane.shoot.gardener.cloud/high-availability is multi-zone?

@timuthy
Copy link
Member

timuthy commented Jun 10, 2022

This makes sense @unmarshall. As discussed, let's include this and other improvements in new PRs because:
a) The improvements are not absolutely necessary for this PR.
b) The PR is opened for quite a long time and it'd be good to have it merged before the next Gardener version is released.
c) There were already several rounds of reviews and comments, so I'd propose to not yet open other points.

@rfranzke rfranzke self-requested a review June 10, 2022 09:23
docs/deployment/feature_gates.md Outdated Show resolved Hide resolved
pkg/apis/core/v1beta1/constants/types_constants.go Outdated Show resolved Hide resolved
pkg/features/features.go Outdated Show resolved Hide resolved
pkg/scheduler/controller/shoot/reconciler.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/operation/botanist/component/etcd/etcd.go Outdated Show resolved Hide resolved
@gardener-prow gardener-prow bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 10, 2022
@timuthy
Copy link
Member

timuthy commented Jun 10, 2022

Thanks for you input @rfranzke. I hope that I could address all of your comments. PTAL.

@timuthy timuthy force-pushed the feature/ha-control-plane-mvp branch from 84b8fc1 to 9a39dd8 Compare June 10, 2022 16:28
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @timuthy, let's get this in now :)

/lgtm
/approve

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Jun 13, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2022
@rfranzke
Copy link
Member

/unhold

@gardener-prow gardener-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2022
@gardener-prow gardener-prow bot merged commit f38fd6d into gardener:master Jun 13, 2022
@timuthy timuthy deleted the feature/ha-control-plane-mvp branch June 13, 2022 09:01
krgostev pushed a commit to krgostev/gardener that referenced this pull request Jul 5, 2022
* Introduce feature gate `HAControlPlanes` for gardenlet

Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>

* Add support for multi-node, multi-zonal etcds

* Add multi-AZ kube-apiservers

* Schedule multi-AZ shoots only to supported seeds.
Introduce feature gate `HAControlPlanes` for Gardener Scheduler.

Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>

* Adapt tests

Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>

* Fix etcd labels

* Address review comments

* add scheduler tests

* add HAControlPlanes option validation

* update feature gate release version

* address review comments

Co-authored-by: Tim Usner <tim.usner@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane Control plane related area/high-availability High availability related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants