handling kube-apiserver disruption #832

deads2k · 2021-07-12T18:19:28Z

We should gracefully handle 60s of kube-apiserver communication disruption. This adds information about what that involves to conventions.md

@dhellmann @romfreiman

deads2k · 2021-07-12T21:15:49Z

/override ci/prow/markdownlint

not applicable to this doc.

openshift-ci · 2021-07-12T21:15:51Z

@deads2k: Overrode contexts on behalf of deads2k: ci/prow/markdownlint

In response to this:

/override ci/prow/markdownlint

not applicable to this doc.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2021-07-13T15:31:52Z

/override ci/prow/markdownlint

not applicable to this doc.

The linter was complaining about things like whitespace, and I do think we want to apply those rules to all of the docs in this repo. The step that checks the enhancements titles is already smart enough to ignore this file.

dhellmann

The new text looks good, thanks for adding these details.

I have one question about a point that isn't clear inline.

It also would be good to track down someone familiar with controller-runtime who can resolve the uncertainty around those 2 points.

dhellmann · 2021-07-13T15:36:49Z

CONVENTIONS.md

+   They are already open to unauthenticated, so the delegated authorization check presents a reliability risk without
+   a security benefit.
+   This is now the default in the delegated authorizer in k8s.io/apiserver based servers, I'm unsure of controller-runtime.
+2. Binaries should terminate in-cluster client certificate connection.


I'm not sure what "terminate" means in this context. Is it that the binaries should handle the client certificate validation themselves?

I'm not sure what "terminate" means in this context. Is it that the binaries should handle the client certificate validation themselves?

Yes. they terminate the TLS connection so that the binary can negotiate mtls with the client. I will clarify in the doc.

romfreiman · 2021-07-13T19:19:00Z

openshift/origin#26215

JoelSpeed

Added some notes about controller runtime that you may want to include, I can expand on details if you need more

JoelSpeed · 2021-07-14T10:15:33Z

CONVENTIONS.md

+1. /healthz, /readyz, /livez should not require authorization.
+   They are already open to unauthenticated, so the delegated authorization check presents a reliability risk without
+   a security benefit.
+   This is now the default in the delegated authorizer in k8s.io/apiserver based servers, I'm unsure of controller-runtime.


Controller Runtime doesn't/has never required auth for healthz and readyz endpoints

JoelSpeed · 2021-07-14T10:23:16Z

CONVENTIONS.md

+   The canonical case here is the metrics scraper.
+   In 4.9, the metrics scraper will support using in-cluster client-certificates to increase reliability of scraping
+   in cases of kube-apiserver disruption.
+   This is now the default in the delegated authenticator in k8s.io/apiserver based servers, I'm unsure of controller-runtime.


Controller runtime doesn't support TLS for its metrics endpoint, typically you have to deploy kube-rbac-proxy in front of your metrics endpoint to allow a TLS connection for metrics. Kube-RBAC-Proxy talks to the API server a lot, this will cause issues and will need some thought to workaround.

We have had some discussion within CR before about allowing the endpoints, such as metrics endpoints to be extendable, eg to allow a user to implement their own TLS server, or add middlewares, but I don't think anything ever came of it

Controller runtime doesn't support TLS for its metrics endpoint, typically you have to deploy kube-rbac-proxy in front of your metrics endpoint to allow a TLS connection for metrics. Kube-RBAC-Proxy talks to the API server a lot, this will cause issues and will need some thought to workaround.

That's a significant shortcoming that was resolved in kube several years ago. Someone motivated to use controller-runtime probably ought to try to get it updated or perhaps switch to using the upstream project (k8s.io/client-go, etc).

The kube-rbac-proxy will need to support it in order to be adopted.

JoelSpeed · 2021-07-14T10:26:21Z

CONVENTIONS.md

+   The default in library-go has been upgaded to handle this case in 4.9: https://github.com/openshift/library-go/blob/4b9033d00d37b88393f837a88ff541a56fd13621/pkg/config/leaderelection/leaderelection.go#L84
+   In essence, the kube-apiserver downtime tolerance is `floor(renewDeadline/retryPeriod)*retryPeriod-retryPeriod`.
+   Recommended defaults are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
+   These are the configurable values in k8s.io/client-go based leases, I'm unsure of controller-runtime.


These are all configurable when you construct your manager in controller runtime, if a controller is using CR already with leader election, they will most likely have flags supplying these values already
https://github.com/kubernetes-sigs/controller-runtime/blob/8b55f85c90c3b1df1af58ddb7fe50096fdc7aa99/pkg/manager/manager.go#L177-L186

…rations According to [1] to successfully handle kube-apiserver disruption we need to updated default durations for MAO leader election operations. New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832

deads2k · 2021-07-16T20:59:44Z

updated for comments.

…rations According to [1] to successfully handle kube-apiserver disruption we need to updated default durations for MAO leader election operations. New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832

dhellmann · 2021-07-21T21:46:39Z

I'm happy with this draft.

It looks like the linter isn't so happy.

@JoelSpeed do you have any other comments, or is this good to merge when the linter issues are fixed?

dhellmann · 2021-07-21T21:46:53Z

/priority important-soon

JoelSpeed · 2021-07-22T14:05:10Z

All good as far as I can tell
/lgtm

dhellmann · 2021-07-22T14:28:36Z

/approve

dhellmann · 2021-07-22T14:28:53Z

/approve cancel

Oops, forgot that CI is still broken.

CONVENTIONS.md

dhellmann · 2021-07-26T18:56:21Z

/lgtm
/approve

openshift-ci · 2021-07-26T18:56:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dhellmann]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…rations According to [1] to successfully handle kube-apiserver disruption we need to updated default durations for MAO leader election operations. New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832

…perations According to [1] to successfully handle kube-apiserver disruption we need to update default durations for CCCMO leader election operations. New values should be LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832

openshift-ci bot requested review from hardys and ironcladlou July 12, 2021 18:19

dhellmann reviewed Jul 13, 2021

View reviewed changes

dhellmann self-assigned this Jul 13, 2021

JoelSpeed reviewed Jul 14, 2021

View reviewed changes

Fedosin mentioned this pull request Jul 16, 2021

Bug 1980930: Update the default durations for MAO leader election operations openshift/machine-api-operator#890

Merged

deads2k force-pushed the lease branch from 01c79eb to d3db905 Compare July 16, 2021 20:59

openshift-ci bot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 21, 2021

openshift-ci bot assigned JoelSpeed Jul 22, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2021

openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2021

dhellmann reviewed Jul 22, 2021

View reviewed changes

CONVENTIONS.md Outdated Show resolved Hide resolved

CONVENTIONS.md Outdated Show resolved Hide resolved

CONVENTIONS.md Outdated Show resolved Hide resolved

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2021

handling kube-apiserver disruption

84e894e

deads2k force-pushed the lease branch from 33f0f7b to 84e894e Compare July 26, 2021 18:34

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2021

openshift-merge-robot merged commit 4e41d62 into openshift:master Jul 26, 2021

Fedosin mentioned this pull request Sep 1, 2021

Bug 2000191: Update the default durations for CCCMO leader election operations openshift/cluster-cloud-controller-manager-operator#115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling kube-apiserver disruption #832

handling kube-apiserver disruption #832

deads2k commented Jul 12, 2021

deads2k commented Jul 12, 2021

openshift-ci bot commented Jul 12, 2021

dhellmann commented Jul 13, 2021

dhellmann left a comment

dhellmann Jul 13, 2021

deads2k Jul 16, 2021 •

edited

Loading

romfreiman commented Jul 13, 2021

JoelSpeed left a comment

JoelSpeed Jul 14, 2021

JoelSpeed Jul 14, 2021

deads2k Jul 16, 2021

JoelSpeed Jul 14, 2021

deads2k commented Jul 16, 2021

dhellmann commented Jul 21, 2021

dhellmann commented Jul 21, 2021

JoelSpeed commented Jul 22, 2021

dhellmann commented Jul 22, 2021

dhellmann commented Jul 22, 2021

dhellmann commented Jul 26, 2021

openshift-ci bot commented Jul 26, 2021

handling kube-apiserver disruption #832

handling kube-apiserver disruption #832

Conversation

deads2k commented Jul 12, 2021

deads2k commented Jul 12, 2021

openshift-ci bot commented Jul 12, 2021

dhellmann commented Jul 13, 2021

dhellmann left a comment

Choose a reason for hiding this comment

dhellmann Jul 13, 2021

Choose a reason for hiding this comment

deads2k Jul 16, 2021 • edited Loading

Choose a reason for hiding this comment

romfreiman commented Jul 13, 2021

JoelSpeed left a comment

Choose a reason for hiding this comment

JoelSpeed Jul 14, 2021

Choose a reason for hiding this comment

JoelSpeed Jul 14, 2021

Choose a reason for hiding this comment

deads2k Jul 16, 2021

Choose a reason for hiding this comment

JoelSpeed Jul 14, 2021

Choose a reason for hiding this comment

deads2k commented Jul 16, 2021

dhellmann commented Jul 21, 2021

dhellmann commented Jul 21, 2021

JoelSpeed commented Jul 22, 2021

dhellmann commented Jul 22, 2021

dhellmann commented Jul 22, 2021

dhellmann commented Jul 26, 2021

openshift-ci bot commented Jul 26, 2021

deads2k Jul 16, 2021 •

edited

Loading