-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP 3836]: Bump to beta for 1.29 #4258
[KEP 3836]: Bump to beta for 1.29 #4258
Conversation
7fa75d9
to
3ef92ad
Compare
FYI: I didn't have time to get to two of the graduation criterias for this KEP to beta. Specifically it's missing: enhancements/keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md Line 258 in 3ef92ad
and enhancements/keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md Line 261 in 3ef92ad
|
@@ -7,8 +7,8 @@ reviewers: ['@thockin', '@danwinship', "@aojea"] | |||
approvers: ['@thockin'] | |||
creation-date: "2023-02-03" | |||
status: implementable | |||
stage: alpha | |||
latest-milestone: "v1.28" | |||
stage: beta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments about the KEP itself:
As such any vendor doing this would get a heads-up during Alpha.
People are not using Alpha.
Can we make any survey of providers to check who will be hit by it?
@thockin - thoughts?
Implement change in Kube-proxy which will react to changes on the Node object and once the taint ToBeDeletedByClusterAutoscaler is placed on the Node object: start failing it's healthz state.
I started wondering about this one. Namely if we shouldn't do that in two steps...
(a) introduce livez that [reacts to the existence of taint], but doesn't touch healthz
(b) graduate to stable and give people time and chance to use something stable
(c) [potentially] only then change healthz itself
@thockin - thoguhts?
The metric: proxy_healthz_503_count mentioned in [Monitoring requirements]
I don't see that metric in the codebase - was this added?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
has this happened? Can you update the KEP with the test and findings?
Please fill in the remaining questions in Throubleshooting section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make any survey of providers to check who will be hit by it?
Do we have a general procedure for that? I guess we could create a Google Form and send that out to dev@kubernetes.io, but that would still feel inconclusive.
I don't see that metric in the codebase - was this added?
Yes, here: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/metrics/metrics.go#L189-L209
The name changed during the implementation however, because we decided to have the HTTP status code be used as a label. I will update the KEP with this
has this happened?
No
Please fill in the remaining questions in Throubleshooting section.
OK, will update that now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People are not using Alpha.
Can we make any survey of providers to check who will be hit by it?
I went back and re-read the whole KEP and scanned the code in question. I agree that signal during alpha is ~0. That said, I can't reason any way this explodes (other than logic bugs, obviously) - but I also know that this is super subtle stuff.
What would we do differently with more information, and as much I know the subtlety, I doubt anyone would look at this plan and see the explosion-path. If this explodes, it's going to be something nobody can see until it's lit up. :(
I think the mitigations in place are as low-risk as we can get with this, short of flipping the polarity entirely, which will almost guarantee low adoption.
Implement change in Kube-proxy which will react to changes on the Node object and once the taint ToBeDeletedByClusterAutoscaler is placed on the Node object: start failing it's healthz state.
I started wondering about this one. Namely if we shouldn't do that in two steps...
(a) introduce livez that [reacts to the existence of taint], but doesn't touch healthz
(b) graduate to stable and give people time and chance to use something stable
(c) [potentially] only then change healthz itself
This is the "flip the polarity" path, but it would not be livez
. More like "add /healthz-v2
; get people to use it; deprecate old /healthz". That is the safer path, for sure. If we go that way, I don't think we will ever actually get rid of or fix the old healthz, and all of the existing clusters in the world will get no benefits. Providers will be reluctant to adopt the new path because then they have to support the old AND the new, and who wants to do that?
The lack of in-place updates sort of kills the value of this KEP, IMO, and the risk is not proportional to the reaction. That said, this is not such a high-urgency thing that we should take on dramatic risk. If you think I am wrong on the risk level, then we can try to go through a new URL endpoint and see if we can drive adoption. That means figuring ouot, for example, if cloud-providers support updates to healthcheck configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tim - I guess I convinced by your explanation above. Let's just update the troubleshooting section as I requested and let's proceed with the plan
3ef92ad
to
10c7e47
Compare
10c7e47
to
3d2cb40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
Assuming no directional change.
/hold
to confirm
keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md
Show resolved
Hide resolved
@@ -424,8 +421,12 @@ Not any different than today. | |||
|
|||
###### What are other known failure modes? | |||
|
|||
No other are known. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one described in Risks/Mitigations:
Vendors of Kubernetes which deploy Kube-proxy and specify a livenessProbe targeting /healthz are expected to start seeing a CrashLooping Kube-proxy when the Node gets tainted with ToBeDeletedByClusterAutoscaler. This is because: if we modify /healthz to fail when this taint gets added on the Node, then the livenessProbe will fail, causing the Kubelet to restart the Pod until the Node is deleted. As far as we can tell, no vendor set livenessProbe, nor does kubeadm, so the risk is low.
is actually relevant and we should probably mention it here [with bullet points from the template]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -7,8 +7,8 @@ reviewers: ['@thockin', '@danwinship', "@aojea"] | |||
approvers: ['@thockin'] | |||
creation-date: "2023-02-03" | |||
status: implementable | |||
stage: alpha | |||
latest-milestone: "v1.28" | |||
stage: beta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tim - I guess I convinced by your explanation above. Let's just update the troubleshooting section as I requested and let's proceed with the plan
3d2cb40
to
6539f3b
Compare
Regarding the 2 criteria - docs and tests can be written after KEP freeze, if you feel comfortable with the idea of getting to it? |
Thanks! /lgtm |
/lgtm Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexanderConstantinescu, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/sig network
/assign @thockin
/assign @wojtek-t