[KEP 3836]: Bump to beta for 1.29 #4258

alexanderConstantinescu · 2023-10-02T09:08:50Z

One-line PR description: 1.29 milestone bump for KEP - 3836

Issue link: Kube-proxy improved ingress connectivity reliability #3836

Other comments: none

/sig network
/assign @thockin
/assign @wojtek-t

alexanderConstantinescu · 2023-10-02T09:27:06Z

FYI: I didn't have time to get to two of the graduation criterias for this KEP to beta. Specifically it's missing:

enhancements/keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md

Line 258 in 3ef92ad

    
           - E2E tests coded before any feature implementation is made which highlights the

and

enhancements/keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md

Line 261 in 3ef92ad

- Document written at https://kubernetes.io/docs/concepts/

keps/prod-readiness/sig-network/3836.yaml

wojtek-t · 2023-10-02T12:02:51Z

keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml

@@ -7,8 +7,8 @@ reviewers: ['@thockin', '@danwinship', "@aojea"]
 approvers: ['@thockin']
 creation-date: "2023-02-03"
 status: implementable
-stage: alpha
-latest-milestone: "v1.28"
+stage: beta


Few comments about the KEP itself:

As such any vendor doing this would get a heads-up during Alpha.

People are not using Alpha.
Can we make any survey of providers to check who will be hit by it?
@thockin - thoughts?

Implement change in Kube-proxy which will react to changes on the Node object and once the taint ToBeDeletedByClusterAutoscaler is placed on the Node object: start failing it's healthz state.

I started wondering about this one. Namely if we shouldn't do that in two steps...
(a) introduce livez that [reacts to the existence of taint], but doesn't touch healthz
(b) graduate to stable and give people time and chance to use something stable
(c) [potentially] only then change healthz itself

@thockin - thoguhts?

The metric: proxy_healthz_503_count mentioned in [Monitoring requirements]

I don't see that metric in the codebase - was this added?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

has this happened? Can you update the KEP with the test and findings?

Please fill in the remaining questions in Throubleshooting section.

Can we make any survey of providers to check who will be hit by it?

Do we have a general procedure for that? I guess we could create a Google Form and send that out to dev@kubernetes.io, but that would still feel inconclusive.

I don't see that metric in the codebase - was this added?

Yes, here: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/metrics/metrics.go#L189-L209

The name changed during the implementation however, because we decided to have the HTTP status code be used as a label. I will update the KEP with this

has this happened?

No

Please fill in the remaining questions in Throubleshooting section.

OK, will update that now

People are not using Alpha.
Can we make any survey of providers to check who will be hit by it?

I went back and re-read the whole KEP and scanned the code in question. I agree that signal during alpha is ~0. That said, I can't reason any way this explodes (other than logic bugs, obviously) - but I also know that this is super subtle stuff.

What would we do differently with more information, and as much I know the subtlety, I doubt anyone would look at this plan and see the explosion-path. If this explodes, it's going to be something nobody can see until it's lit up. :(

I think the mitigations in place are as low-risk as we can get with this, short of flipping the polarity entirely, which will almost guarantee low adoption.

Implement change in Kube-proxy which will react to changes on the Node object and once the taint ToBeDeletedByClusterAutoscaler is placed on the Node object: start failing it's healthz state.

I started wondering about this one. Namely if we shouldn't do that in two steps...
(a) introduce livez that [reacts to the existence of taint], but doesn't touch healthz
(b) graduate to stable and give people time and chance to use something stable
(c) [potentially] only then change healthz itself

This is the "flip the polarity" path, but it would not be livez. More like "add /healthz-v2; get people to use it; deprecate old /healthz". That is the safer path, for sure. If we go that way, I don't think we will ever actually get rid of or fix the old healthz, and all of the existing clusters in the world will get no benefits. Providers will be reluctant to adopt the new path because then they have to support the old AND the new, and who wants to do that?

The lack of in-place updates sort of kills the value of this KEP, IMO, and the risk is not proportional to the reaction. That said, this is not such a high-urgency thing that we should take on dramatic risk. If you think I am wrong on the risk level, then we can try to go through a new URL endpoint and see if we can drive adoption. That means figuring ouot, for example, if cloud-providers support updates to healthcheck configs.

Thanks Tim - I guess I convinced by your explanation above. Let's just update the troubleshooting section as I requested and let's proceed with the plan

thockin

/approve
/lgtm

Assuming no directional change.

/hold

to confirm

keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md

wojtek-t · 2023-10-04T09:00:02Z

keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/README.md

@@ -424,8 +421,12 @@ Not any different than today.

 ###### What are other known failure modes?

+No other are known.


The one described in Risks/Mitigations:

Vendors of Kubernetes which deploy Kube-proxy and specify a livenessProbe targeting /healthz are expected to start seeing a CrashLooping Kube-proxy when the Node gets tainted with ToBeDeletedByClusterAutoscaler. This is because: if we modify /healthz to fail when this taint gets added on the Node, then the livenessProbe will fail, causing the Kubelet to restart the Pod until the Node is deleted. As far as we can tell, no vendor set livenessProbe, nor does kubeadm, so the risk is low.

is actually relevant and we should probably mention it here [with bullet points from the template]

wojtek-t · 2023-10-04T09:01:55Z

keps/sig-network/3836-kube-proxy-improved-ingress-connectivity-reliability/kep.yaml

@@ -7,8 +7,8 @@ reviewers: ['@thockin', '@danwinship', "@aojea"]
 approvers: ['@thockin']
 creation-date: "2023-02-03"
 status: implementable
-stage: alpha
-latest-milestone: "v1.28"
+stage: beta


Thanks Tim - I guess I convinced by your explanation above. Let's just update the troubleshooting section as I requested and let's proceed with the plan

thockin · 2023-10-04T17:32:44Z

Regarding the 2 criteria - docs and tests can be written after KEP freeze, if you feel comfortable with the idea of getting to it?

thockin · 2023-10-04T17:34:07Z

Thanks!

/lgtm
/approve

wojtek-t · 2023-10-04T18:43:55Z

/lgtm
/approve PRR

Thanks!

k8s-ci-robot · 2023-10-04T18:44:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-network/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot assigned thockin Oct 2, 2023

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Oct 2, 2023

k8s-ci-robot assigned wojtek-t Oct 2, 2023

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 2, 2023

k8s-ci-robot requested review from dcbw and thockin October 2, 2023 09:09

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 2, 2023

alexanderConstantinescu force-pushed the 3836-v129 branch from 7fa75d9 to 3ef92ad Compare October 2, 2023 09:23

wojtek-t reviewed Oct 2, 2023

View reviewed changes

alexanderConstantinescu force-pushed the 3836-v129 branch from 3ef92ad to 10c7e47 Compare October 2, 2023 13:05

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 2, 2023

alexanderConstantinescu force-pushed the 3836-v129 branch from 10c7e47 to 3d2cb40 Compare October 2, 2023 13:08

thockin reviewed Oct 2, 2023

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 2, 2023

npolshakova mentioned this pull request Oct 3, 2023

Kube-proxy improved ingress connectivity reliability #3836

Open

12 tasks

wojtek-t reviewed Oct 4, 2023

View reviewed changes

[KEP 3836]: 1.29 update

6539f3b

alexanderConstantinescu force-pushed the 3836-v129 branch from 3d2cb40 to 6539f3b Compare October 4, 2023 13:08

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2023

thockin removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2023

k8s-ci-robot merged commit 25ada4e into kubernetes:master Oct 4, 2023

k8s-ci-robot added this to the v1.29 milestone Oct 4, 2023

alexanderConstantinescu mentioned this pull request Jan 22, 2024

KEP 3836 - promote to beta kubernetes/kubernetes#122914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KEP 3836]: Bump to beta for 1.29 #4258

[KEP 3836]: Bump to beta for 1.29 #4258

alexanderConstantinescu commented Oct 2, 2023

alexanderConstantinescu commented Oct 2, 2023

wojtek-t Oct 2, 2023

alexanderConstantinescu Oct 2, 2023

thockin Oct 2, 2023

wojtek-t Oct 4, 2023

thockin left a comment

wojtek-t Oct 4, 2023

alexanderConstantinescu Oct 4, 2023

wojtek-t Oct 4, 2023

thockin commented Oct 4, 2023

thockin commented Oct 4, 2023

wojtek-t commented Oct 4, 2023

k8s-ci-robot commented Oct 4, 2023

		@@ -424,8 +421,12 @@ Not any different than today.

		###### What are other known failure modes?

		No other are known.

[KEP 3836]: Bump to beta for 1.29 #4258

[KEP 3836]: Bump to beta for 1.29 #4258

Conversation

alexanderConstantinescu commented Oct 2, 2023

alexanderConstantinescu commented Oct 2, 2023

wojtek-t Oct 2, 2023

Choose a reason for hiding this comment

alexanderConstantinescu Oct 2, 2023

Choose a reason for hiding this comment

thockin Oct 2, 2023

Choose a reason for hiding this comment

wojtek-t Oct 4, 2023

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

wojtek-t Oct 4, 2023

Choose a reason for hiding this comment

alexanderConstantinescu Oct 4, 2023

Choose a reason for hiding this comment

wojtek-t Oct 4, 2023

Choose a reason for hiding this comment

thockin commented Oct 4, 2023

thockin commented Oct 4, 2023

wojtek-t commented Oct 4, 2023

k8s-ci-robot commented Oct 4, 2023