Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularization & Status Enhancements #188

Merged
merged 7 commits into from
Jun 15, 2021

Conversation

timuthy
Copy link
Member

@timuthy timuthy commented Jun 2, 2021

How to categorize this PR?

/area control-plane
/kind enhancement

What this PR does / why we need it:
This PR adds the following status checks for the etcd resource:

  • Conditions:
    • Ready check: Checks if resource has enough Ready members in status.members to fulfill the quorum.
    • AllMembersReady check: Checks if all members in status.members are Ready.
  • Members:
    • Ready check: Treats the LastUpdateTime as a heartbeat and checks if it is within the expected time range (configurable via --etcd-member-threshold).

Which issue(s) this PR fixes:
Fixes #151

Special notes for your reviewer:
The PR does not handle cases in which an etcd member is lost and replaced and thus the corresponding status.member has to be removed. This needs to be done by the component which removes the lost member from the etcd cluster.

Release note:

Various `condition` and etcd `member` checks have been added to Etcd-Druid. The results of those checks will be reflected in the `etcd.status` sub-resource.
- Conditions:
  - Ready check: Checks if resource has enough `Ready` members in `status.members` to fulfill the quorum.
  - AllMembersReady check: Checks if all members in `status.members` are `Ready`.
- Members:
  - Ready check: Treats the `LastUpdateTime` as a heartbeat and checks if it is within the expected time range (configurable via `--etcd-member-threshold`).
A re-sync mechanism has been added for the Custodian controller. The new flag `--custodian-sync-period (default 30s)` controls the duration after which the Custodian controller re-enqueues `etcd` resources for reconciliation. This can be considered as a health check interval.

@timuthy timuthy requested a review from a team as a code owner June 2, 2021 09:43
@timuthy timuthy changed the title Feature.etcd status updater Modularization & Status Enhancements Jun 2, 2021
@gardener-robot gardener-robot added area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Jun 2, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 2, 2021
Copy link
Collaborator

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @timuthy! Some very minor comments/questions below.

Besides that, I see that you have addressed the transition to Unknown state for the members but not the further transition from Unknown to NotReady and also the separate transition for not ready pod to NotReady status. Would you address that in a separate PR or do you think those cases should be addressed differently?

pkg/health/condition/check_ready.go Outdated Show resolved Hide resolved
pkg/health/condition/check_ready.go Outdated Show resolved Hide resolved
pkg/health/condition/check_ready.go Outdated Show resolved Hide resolved
pkg/health/condition/condition_suite_test.go Outdated Show resolved Hide resolved
pkg/health/status/check.go Show resolved Hide resolved
@timuthy
Copy link
Member Author

timuthy commented Jun 2, 2021

Besides that, I see that you have addressed the transition to Unknown state for the members but not the further transition from Unknown to NotReady and also the separate transition for not ready pod to NotReady status. Would you address that in a separate PR or do you think those cases should be addressed differently?

Thanks for mentioning it @amshuman-kr. I will update this PR as written in the proposal and as discussed via Slack.

@timuthy
Copy link
Member Author

timuthy commented Jun 2, 2021

/status author-action

@gardener-robot gardener-robot added the status/author-action Issue requires issue author action label Jun 2, 2021
@gardener-robot
Copy link

@timuthy The pull request was assigned to you under author-action. Please unassign yourself when you are done. Thank you.

@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 4, 2021
@timuthy
Copy link
Member Author

timuthy commented Jun 4, 2021

Besides that, I see that you have addressed the transition to Unknown state for the members but not the further transition from Unknown to NotReady and also the separate transition for not ready pod to NotReady status

Added via 134958d

@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 4, 2021
Copy link
Collaborator

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change @timuthy! They look good. Do you think adding status.initialReplicas and using max(status.initialReplicas, len(status.Members)) make sense? The maintenance of status.initialReplicas could be postponed to later issues where the custodian is enhanced for other reconciliation actions like triggering bootstrapping.

WDYT?

pkg/health/etcdmember/check_ready.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 10, 2021
Remove `status.podRef` which was introduced by an earlier, unreleased commit
and use `status.name` instead.
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 14, 2021
@timuthy timuthy removed the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 14, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 14, 2021
@timuthy
Copy link
Member Author

timuthy commented Jun 14, 2021

Thanks for the discussion @amshuman-kr. I added the missing pieces in the last 3 commits (separated it for an easier review). We can squash merge the PR once we are ready.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 15, 2021
Copy link
Collaborator

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes @timuthy! The changes look good. Just a couple of comments below (the first one is merely FYI).

Comment on lines +139 to +142
// Bootstrap is a special case which is handled by the etcd controller.
if !inBootstrap(etcd) && len(members) != 0 {
etcd.Status.ClusterSize = pointer.Int32Ptr(int32(len(members)))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, this is incorrect. Only the initial bootstrapping is triggered by the etcd controller. Later bootstrapping due to quorum failure is to be done by the custodian controller. But this change could be done in #158.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this comment as this reflects the current state.

Tbh, I'm not convinced yet that the Custodian controller performs self healing in the form of re-bootstrapping because the Etcd controller can in the meantime still reconcile etcd resources which can lead to races. In order to avoid a race, you need to introduce some kind of lock which is also not too nice, IMO.

Maybe I didn't check all cases in detail but at the moment I think an alternative would be easier to implement:

  1. Custodian sees that cluster is not healthy and self healing is required.
  2. Custodian triggers a reconciliation for the affected etcd resource.
  3. Etcd controller kicks in and takes necessary action to bring cluster up again (=> desired state). In the meantime any changes to the very same etcd resource cannot lead to a race.

We don't have to decide in this PR of course, but still wanted to give an alternative approach to think about.

Copy link
Collaborator

@amshuman-kr amshuman-kr Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid a race, you need to introduce some kind of lock which is also not too nice, IMO.

Even if only the etcd controller is responsible for bootstrapping non-quorate clusters, some sort of race avoidance is still required because the Etcd spec might get updated while bootstrapping is still on-going, in which case, it should either wait until the on-going bootstrapping completed before applying the latest spec changes or should abort the current bootstrapping (which might unnecessarily prolong the downtime).

  1. Custodian triggers a reconciliation for the affected etcd resource.

This cannot be the regular etcd controller reconciliation which might apply Etcd resource spec changes along with triggering bootstrapping. As a part of the contract of a gardener extension, etcd-druid not apply Etcd resource spec changes unless reconcile operation annotation is present.

On the other hand, it makes no sense for the etcd-druid to wait for the next shoot reconciliation to fix a non-quorate ETCD cluster.

But let's take this discussion out of this PR.

pkg/health/condition/check_ready.go Show resolved Hide resolved
Copy link
Collaborator

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @timuthy for the changes. /lgtm

@timuthy timuthy merged commit f02ff02 into gardener:master Jun 15, 2021
@timuthy timuthy deleted the feature.etcd-status-updater branch June 15, 2021 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/author-action Issue requires issue author action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Modularize and enhance status management according to multi-node ETCD proposal
5 participants