Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP 3836]: Bump to beta for 1.29 #4258

Merged
merged 1 commit into from
Oct 4, 2023

Conversation

alexanderConstantinescu
Copy link
Member

  • One-line PR description: 1.29 milestone bump for KEP - 3836
  • Other comments: none

/sig network
/assign @thockin
/assign @wojtek-t

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Oct 2, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 2, 2023
@k8s-ci-robot k8s-ci-robot requested review from dcbw and thockin October 2, 2023 09:09
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 2, 2023
@alexanderConstantinescu
Copy link
Member Author

FYI: I didn't have time to get to two of the graduation criterias for this KEP to beta. Specifically it's missing:

- E2E tests coded before any feature implementation is made which highlights the

and
- Document written at https://kubernetes.io/docs/concepts/

keps/prod-readiness/sig-network/3836.yaml Show resolved Hide resolved
@@ -7,8 +7,8 @@ reviewers: ['@thockin', '@danwinship', "@aojea"]
approvers: ['@thockin']
creation-date: "2023-02-03"
status: implementable
stage: alpha
latest-milestone: "v1.28"
stage: beta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments about the KEP itself:

As such any vendor doing this would get a heads-up during Alpha.

People are not using Alpha.
Can we make any survey of providers to check who will be hit by it?
@thockin - thoughts?

Implement change in Kube-proxy which will react to changes on the Node object and once the taint ToBeDeletedByClusterAutoscaler is placed on the Node object: start failing it's healthz state.

I started wondering about this one. Namely if we shouldn't do that in two steps...
(a) introduce livez that [reacts to the existence of taint], but doesn't touch healthz
(b) graduate to stable and give people time and chance to use something stable
(c) [potentially] only then change healthz itself

@thockin - thoguhts?

The metric: proxy_healthz_503_count mentioned in [Monitoring requirements]

I don't see that metric in the codebase - was this added?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

has this happened? Can you update the KEP with the test and findings?

Please fill in the remaining questions in Throubleshooting section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make any survey of providers to check who will be hit by it?

Do we have a general procedure for that? I guess we could create a Google Form and send that out to dev@kubernetes.io, but that would still feel inconclusive.

I don't see that metric in the codebase - was this added?

Yes, here: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/metrics/metrics.go#L189-L209

The name changed during the implementation however, because we decided to have the HTTP status code be used as a label. I will update the KEP with this

has this happened?

No

Please fill in the remaining questions in Throubleshooting section.

OK, will update that now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People are not using Alpha.
Can we make any survey of providers to check who will be hit by it?

I went back and re-read the whole KEP and scanned the code in question. I agree that signal during alpha is ~0. That said, I can't reason any way this explodes (other than logic bugs, obviously) - but I also know that this is super subtle stuff.

What would we do differently with more information, and as much I know the subtlety, I doubt anyone would look at this plan and see the explosion-path. If this explodes, it's going to be something nobody can see until it's lit up. :(

I think the mitigations in place are as low-risk as we can get with this, short of flipping the polarity entirely, which will almost guarantee low adoption.

Implement change in Kube-proxy which will react to changes on the Node object and once the taint ToBeDeletedByClusterAutoscaler is placed on the Node object: start failing it's healthz state.

I started wondering about this one. Namely if we shouldn't do that in two steps...
(a) introduce livez that [reacts to the existence of taint], but doesn't touch healthz
(b) graduate to stable and give people time and chance to use something stable
(c) [potentially] only then change healthz itself

This is the "flip the polarity" path, but it would not be livez. More like "add /healthz-v2; get people to use it; deprecate old /healthz". That is the safer path, for sure. If we go that way, I don't think we will ever actually get rid of or fix the old healthz, and all of the existing clusters in the world will get no benefits. Providers will be reluctant to adopt the new path because then they have to support the old AND the new, and who wants to do that?

The lack of in-place updates sort of kills the value of this KEP, IMO, and the risk is not proportional to the reaction. That said, this is not such a high-urgency thing that we should take on dramatic risk. If you think I am wrong on the risk level, then we can try to go through a new URL endpoint and see if we can drive adoption. That means figuring ouot, for example, if cloud-providers support updates to healthcheck configs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tim - I guess I convinced by your explanation above. Let's just update the troubleshooting section as I requested and let's proceed with the plan

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 2, 2023
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

Assuming no directional change.

/hold

to confirm

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 2, 2023
@@ -424,8 +421,12 @@ Not any different than today.

###### What are other known failure modes?

No other are known.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one described in Risks/Mitigations:

Vendors of Kubernetes which deploy Kube-proxy and specify a livenessProbe targeting /healthz are expected to start seeing a CrashLooping Kube-proxy when the Node gets tainted with ToBeDeletedByClusterAutoscaler. This is because: if we modify /healthz to fail when this taint gets added on the Node, then the livenessProbe will fail, causing the Kubelet to restart the Pod until the Node is deleted. As far as we can tell, no vendor set livenessProbe, nor does kubeadm, so the risk is low.

is actually relevant and we should probably mention it here [with bullet points from the template]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -7,8 +7,8 @@ reviewers: ['@thockin', '@danwinship', "@aojea"]
approvers: ['@thockin']
creation-date: "2023-02-03"
status: implementable
stage: alpha
latest-milestone: "v1.28"
stage: beta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tim - I guess I convinced by your explanation above. Let's just update the troubleshooting section as I requested and let's proceed with the plan

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2023
@thockin
Copy link
Member

thockin commented Oct 4, 2023

Regarding the 2 criteria - docs and tests can be written after KEP freeze, if you feel comfortable with the idea of getting to it?

@thockin thockin removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2023
@thockin
Copy link
Member

thockin commented Oct 4, 2023

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2023
@wojtek-t
Copy link
Member

wojtek-t commented Oct 4, 2023

/lgtm
/approve PRR

Thanks!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2023
@k8s-ci-robot k8s-ci-robot merged commit 25ada4e into kubernetes:master Oct 4, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants