-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling kube-apiserver disruption #832
Conversation
/override ci/prow/markdownlint not applicable to this doc. |
@deads2k: Overrode contexts on behalf of deads2k: ci/prow/markdownlint In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The linter was complaining about things like whitespace, and I do think we want to apply those rules to all of the docs in this repo. The step that checks the enhancements titles is already smart enough to ignore this file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new text looks good, thanks for adding these details.
I have one question about a point that isn't clear inline.
It also would be good to track down someone familiar with controller-runtime who can resolve the uncertainty around those 2 points.
CONVENTIONS.md
Outdated
They are already open to unauthenticated, so the delegated authorization check presents a reliability risk without | ||
a security benefit. | ||
This is now the default in the delegated authorizer in k8s.io/apiserver based servers, I'm unsure of controller-runtime. | ||
2. Binaries should terminate in-cluster client certificate connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what "terminate" means in this context. Is it that the binaries should handle the client certificate validation themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what "terminate" means in this context. Is it that the binaries should handle the client certificate validation themselves?
Yes. they terminate the TLS connection so that the binary can negotiate mtls with the client. I will clarify in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some notes about controller runtime that you may want to include, I can expand on details if you need more
CONVENTIONS.md
Outdated
1. /healthz, /readyz, /livez should not require authorization. | ||
They are already open to unauthenticated, so the delegated authorization check presents a reliability risk without | ||
a security benefit. | ||
This is now the default in the delegated authorizer in k8s.io/apiserver based servers, I'm unsure of controller-runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Controller Runtime doesn't/has never required auth for healthz and readyz endpoints
CONVENTIONS.md
Outdated
The canonical case here is the metrics scraper. | ||
In 4.9, the metrics scraper will support using in-cluster client-certificates to increase reliability of scraping | ||
in cases of kube-apiserver disruption. | ||
This is now the default in the delegated authenticator in k8s.io/apiserver based servers, I'm unsure of controller-runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Controller runtime doesn't support TLS for its metrics endpoint, typically you have to deploy kube-rbac-proxy in front of your metrics endpoint to allow a TLS connection for metrics. Kube-RBAC-Proxy talks to the API server a lot, this will cause issues and will need some thought to workaround.
We have had some discussion within CR before about allowing the endpoints, such as metrics endpoints to be extendable, eg to allow a user to implement their own TLS server, or add middlewares, but I don't think anything ever came of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Controller runtime doesn't support TLS for its metrics endpoint, typically you have to deploy kube-rbac-proxy in front of your metrics endpoint to allow a TLS connection for metrics. Kube-RBAC-Proxy talks to the API server a lot, this will cause issues and will need some thought to workaround.
That's a significant shortcoming that was resolved in kube several years ago. Someone motivated to use controller-runtime probably ought to try to get it updated or perhaps switch to using the upstream project (k8s.io/client-go, etc).
The kube-rbac-proxy will need to support it in order to be adopted.
CONVENTIONS.md
Outdated
The default in library-go has been upgaded to handle this case in 4.9: https://github.com/openshift/library-go/blob/4b9033d00d37b88393f837a88ff541a56fd13621/pkg/config/leaderelection/leaderelection.go#L84 | ||
In essence, the kube-apiserver downtime tolerance is `floor(renewDeadline/retryPeriod)*retryPeriod-retryPeriod`. | ||
Recommended defaults are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. | ||
These are the configurable values in k8s.io/client-go based leases, I'm unsure of controller-runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all configurable when you construct your manager in controller runtime, if a controller is using CR already with leader election, they will most likely have flags supplying these values already
https://github.com/kubernetes-sigs/controller-runtime/blob/8b55f85c90c3b1df1af58ddb7fe50096fdc7aa99/pkg/manager/manager.go#L177-L186
…rations According to [1] to successfully handle kube-apiserver disruption we need to updated default durations for MAO leader election operations. New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832
…rations According to [1] to successfully handle kube-apiserver disruption we need to updated default durations for MAO leader election operations. New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832
updated for comments. |
…rations According to [1] to successfully handle kube-apiserver disruption we need to updated default durations for MAO leader election operations. New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832
I'm happy with this draft. It looks like the linter isn't so happy. @JoelSpeed do you have any other comments, or is this good to merge when the linter issues are fixed? |
/priority important-soon |
All good as far as I can tell |
/approve |
/approve cancel Oops, forgot that CI is still broken. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhellmann The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…rations According to [1] to successfully handle kube-apiserver disruption we need to updated default durations for MAO leader election operations. New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832
…rations According to [1] to successfully handle kube-apiserver disruption we need to updated default durations for MAO leader election operations. New values are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832
…perations According to [1] to successfully handle kube-apiserver disruption we need to update default durations for CCCMO leader election operations. New values should be LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. [1] openshift/enhancements#832
We should gracefully handle 60s of kube-apiserver communication disruption. This adds information about what that involves to conventions.md
@dhellmann @romfreiman