Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4603: initial KEP for Tune CrashLoopBackoff #4604

Merged

Conversation

lauralorenz
Copy link
Contributor

@lauralorenz lauralorenz commented Apr 30, 2024

  • One-line PR description: Adding new KEP
  • Other comments: Reorganized entirely in order to make this mergeable as provisional ahead of 1.32 changes.

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 30, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 30, 2024
@lauralorenz lauralorenz mentioned this pull request Apr 30, 2024
5 tasks
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 3, 2024
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good explanation of the changes and the reasoning behind them. I like that this is addressing major use cases without a ton of new complexity.

I think it's important that we get something out that we can start testing, so I wouldn't consider any of my suggestions on the specific backoff rates blocking.

keps/sig-node/4603-tune-crashloopbackoff/README.md Outdated Show resolved Hide resolved
keps/sig-node/4603-tune-crashloopbackoff/README.md Outdated Show resolved Hide resolved
keps/sig-node/4603-tune-crashloopbackoff/kep.yaml Outdated Show resolved Hide resolved
keps/sig-node/4603-tune-crashloopbackoff/README.md Outdated Show resolved Hide resolved
#### Subsidize running time in backoff delay
FIXME: Subsidize latest succesful pod running time/readinessProbe/livenessProbe
into the CrashLoopBackOff backoff, potentially restarting the backoff counter as
a result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. If I'm interpretting it correctly, this would mean a container who's next backoff delay is 1m, that runs for 30s before crashing, would only need to wait 30s before restarting? That sounds like a desirable property, although in practice I suspect most containers entering crashloop are restarting in the first few seconds (or milliseconds).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would help in the case where a container was running for some short time before crashing, and the user wants it to restart immediately. In the rapid restart case, this could let the container restart immediately.

…lternatives

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add PRR document in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node. I'll handle the PRR approval for this.

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: lauralorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: lauralorenz <lauralorenz@google.com>
@lauralorenz
Copy link
Contributor Author

I've updated this PR to be valid as to merge now in a provisional state, without the PRR, and with several UNRESOLVED blocks, so I can separate out the updates to the 1.32 variation so that can be handled in a separate, clean PR for discussion.

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a little trouble identifying what specific changes are being proposed for the next release here. I think consolidating all the proposed changes into the proposal section will help with that. Or maybe I'm just confused because you're planning to add this in a separate PR?

Also, for every unresolved section, can you add a note with what stage it needs to be resolved by?

nitty-gritty.
-->

This design seeks to incorporate a two-pronged approach:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the proposal section should be updated to state concisely what you are proposing for v1.32, and move the unresolved portion to the end of the proposal section, and explicitly state that it's out of scope for the initial alpha. Or, move it to alternatives considered, or future extensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @tallclair above. Also can move some of them to beta phase.

and anticipate the change in load and node stability as a result of upgrading to
these changes.

Note that proposal will NOT change:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Merge this into the non-goals section

churn of Pods through CrashLoopBackoff phases, the central API server could
become unresponsive, effectively taking down an entire cluster.

The same risk exists for the <<[UNRESOLVED]>> per Node <<[/UNRESOLVED]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe just put this whole paragraph in unresolved?

keps/sig-node/4603-tune-crashloopbackoff/README.md Outdated Show resolved Hide resolved
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

LGTM for provisional state, for the non-unresolved sections.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 16, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Sep 19, 2024

/assign @SergeyKanzhelev
for approval

@lauralorenz
Copy link
Contributor Author

lauralorenz commented Sep 20, 2024

FYI to connect the dots for posterity, 1.32 PR (draft right now) is at lauralorenz#21

UPDATE: PR into k/k is at #4893

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 1, 2024
@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 1, 2024
@dchen1107
Copy link
Member

There are the following unresolved topics:

  • Rapid Restart Implementation
  • Node-Level Opt-in
  • Benchmarking and Stress Testing
  • Interaction with Job API
  • Kubelet Overhead Analysis

@lauralorenz and @tallclair indicates they are going to be covered in the following KEP PRs for 1.32, which makes this implementable.

/approve at the provisional state

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, lauralorenz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2024
@k8s-ci-robot k8s-ci-robot merged commit ed5d0b1 into kubernetes:master Oct 1, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 1, 2024
@lauralorenz
Copy link
Contributor Author

Thanks all! PR into k/k for 1.32 is at #4893

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

9 participants