Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling kube-apiserver disruption #832

Merged
merged 1 commit into from
Jul 26, 2021

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Jul 12, 2021

We should gracefully handle 60s of kube-apiserver communication disruption. This adds information about what that involves to conventions.md

@dhellmann @romfreiman

@deads2k
Copy link
Contributor Author

deads2k commented Jul 12, 2021

/override ci/prow/markdownlint

not applicable to this doc.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 12, 2021

@deads2k: Overrode contexts on behalf of deads2k: ci/prow/markdownlint

In response to this:

/override ci/prow/markdownlint

not applicable to this doc.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dhellmann
Copy link
Contributor

/override ci/prow/markdownlint

not applicable to this doc.

The linter was complaining about things like whitespace, and I do think we want to apply those rules to all of the docs in this repo. The step that checks the enhancements titles is already smart enough to ignore this file.

Copy link
Contributor

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new text looks good, thanks for adding these details.

I have one question about a point that isn't clear inline.

It also would be good to track down someone familiar with controller-runtime who can resolve the uncertainty around those 2 points.

CONVENTIONS.md Outdated
They are already open to unauthenticated, so the delegated authorization check presents a reliability risk without
a security benefit.
This is now the default in the delegated authorizer in k8s.io/apiserver based servers, I'm unsure of controller-runtime.
2. Binaries should terminate in-cluster client certificate connection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "terminate" means in this context. Is it that the binaries should handle the client certificate validation themselves?

Copy link
Contributor Author

@deads2k deads2k Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "terminate" means in this context. Is it that the binaries should handle the client certificate validation themselves?

Yes. they terminate the TLS connection so that the binary can negotiate mtls with the client. I will clarify in the doc.

@romfreiman
Copy link

openshift/origin#26215

@dhellmann dhellmann self-assigned this Jul 13, 2021
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some notes about controller runtime that you may want to include, I can expand on details if you need more

CONVENTIONS.md Outdated
1. /healthz, /readyz, /livez should not require authorization.
They are already open to unauthenticated, so the delegated authorization check presents a reliability risk without
a security benefit.
This is now the default in the delegated authorizer in k8s.io/apiserver based servers, I'm unsure of controller-runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller Runtime doesn't/has never required auth for healthz and readyz endpoints

CONVENTIONS.md Outdated
The canonical case here is the metrics scraper.
In 4.9, the metrics scraper will support using in-cluster client-certificates to increase reliability of scraping
in cases of kube-apiserver disruption.
This is now the default in the delegated authenticator in k8s.io/apiserver based servers, I'm unsure of controller-runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller runtime doesn't support TLS for its metrics endpoint, typically you have to deploy kube-rbac-proxy in front of your metrics endpoint to allow a TLS connection for metrics. Kube-RBAC-Proxy talks to the API server a lot, this will cause issues and will need some thought to workaround.

We have had some discussion within CR before about allowing the endpoints, such as metrics endpoints to be extendable, eg to allow a user to implement their own TLS server, or add middlewares, but I don't think anything ever came of it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller runtime doesn't support TLS for its metrics endpoint, typically you have to deploy kube-rbac-proxy in front of your metrics endpoint to allow a TLS connection for metrics. Kube-RBAC-Proxy talks to the API server a lot, this will cause issues and will need some thought to workaround.

That's a significant shortcoming that was resolved in kube several years ago. Someone motivated to use controller-runtime probably ought to try to get it updated or perhaps switch to using the upstream project (k8s.io/client-go, etc).

The kube-rbac-proxy will need to support it in order to be adopted.

CONVENTIONS.md Outdated
The default in library-go has been upgaded to handle this case in 4.9: https://github.com/openshift/library-go/blob/4b9033d00d37b88393f837a88ff541a56fd13621/pkg/config/leaderelection/leaderelection.go#L84
In essence, the kube-apiserver downtime tolerance is `floor(renewDeadline/retryPeriod)*retryPeriod-retryPeriod`.
Recommended defaults are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
These are the configurable values in k8s.io/client-go based leases, I'm unsure of controller-runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all configurable when you construct your manager in controller runtime, if a controller is using CR already with leader election, they will most likely have flags supplying these values already
https://github.com/kubernetes-sigs/controller-runtime/blob/8b55f85c90c3b1df1af58ddb7fe50096fdc7aa99/pkg/manager/manager.go#L177-L186

Fedosin added a commit to Fedosin/machine-api-operator that referenced this pull request Jul 16, 2021
…rations

According to [1] to successfully handle kube-apiserver disruption
we need to updated default durations for MAO leader election
operations.

New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

[1] openshift/enhancements#832
Fedosin added a commit to Fedosin/machine-api-operator that referenced this pull request Jul 16, 2021
…rations

According to [1] to successfully handle kube-apiserver disruption
we need to updated default durations for MAO leader election
operations.

New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

[1] openshift/enhancements#832
@deads2k
Copy link
Contributor Author

deads2k commented Jul 16, 2021

updated for comments.

Fedosin added a commit to Fedosin/machine-api-operator that referenced this pull request Jul 21, 2021
…rations

According to [1] to successfully handle kube-apiserver disruption
we need to updated default durations for MAO leader election
operations.

New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

[1] openshift/enhancements#832
@dhellmann
Copy link
Contributor

I'm happy with this draft.

It looks like the linter isn't so happy.

@JoelSpeed do you have any other comments, or is this good to merge when the linter issues are fixed?

@dhellmann
Copy link
Contributor

/priority important-soon

@openshift-ci openshift-ci bot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 21, 2021
@JoelSpeed
Copy link
Contributor

All good as far as I can tell
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2021
@dhellmann
Copy link
Contributor

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2021
@dhellmann
Copy link
Contributor

/approve cancel

Oops, forgot that CI is still broken.

@openshift-ci openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2021
CONVENTIONS.md Outdated Show resolved Hide resolved
CONVENTIONS.md Outdated Show resolved Hide resolved
CONVENTIONS.md Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2021
@dhellmann
Copy link
Contributor

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 26, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2021
@openshift-merge-robot openshift-merge-robot merged commit 4e41d62 into openshift:master Jul 26, 2021
Fedosin added a commit to Fedosin/machine-api-operator that referenced this pull request Jul 27, 2021
…rations

According to [1] to successfully handle kube-apiserver disruption
we need to updated default durations for MAO leader election
operations.

New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

[1] openshift/enhancements#832
Fedosin added a commit to Fedosin/machine-api-operator that referenced this pull request Jul 29, 2021
…rations

According to [1] to successfully handle kube-apiserver disruption
we need to updated default durations for MAO leader election
operations.

New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

[1] openshift/enhancements#832
Fedosin added a commit to Fedosin/cluster-cloud-controller-manager-operator that referenced this pull request Sep 1, 2021
…perations

According to [1] to successfully handle kube-apiserver disruption
we need to update default durations for CCCMO leader election operations.

New values should be LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

[1] openshift/enhancements#832
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants