-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4603: initial KEP for Tune CrashLoopBackoff #4604
KEP-4603: initial KEP for Tune CrashLoopBackoff #4604
Conversation
lauralorenz
commented
Apr 30, 2024
•
edited
Loading
edited
- One-line PR description: Adding new KEP
- Issue link: Tune CrashLoopBackOff #4603
- Other comments: Reorganized entirely in order to make this mergeable as provisional ahead of 1.32 changes.
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Skipping CI for Draft Pull Request. |
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good explanation of the changes and the reasoning behind them. I like that this is addressing major use cases without a ton of new complexity.
I think it's important that we get something out that we can start testing, so I wouldn't consider any of my suggestions on the specific backoff rates blocking.
#### Subsidize running time in backoff delay | ||
FIXME: Subsidize latest succesful pod running time/readinessProbe/livenessProbe | ||
into the CrashLoopBackOff backoff, potentially restarting the backoff counter as | ||
a result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea. If I'm interpretting it correctly, this would mean a container who's next backoff delay is 1m, that runs for 30s before crashing, would only need to wait 30s before restarting? That sounds like a desirable property, although in practice I suspect most containers entering crashloop are restarting in the first few seconds (or milliseconds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would help in the case where a container was running for some short time before crashing, and the user wants it to restart immediately. In the rapid restart case, this could let the container restart immediately.
…lternatives Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to add PRR document in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node. I'll handle the PRR approval for this.
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: lauralorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: lauralorenz <lauralorenz@google.com>
I've updated this PR to be valid as to merge now in a provisional state, without the PRR, and with several UNRESOLVED blocks, so I can separate out the updates to the 1.32 variation so that can be handled in a separate, clean PR for discussion. |
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a little trouble identifying what specific changes are being proposed for the next release here. I think consolidating all the proposed changes into the proposal section will help with that. Or maybe I'm just confused because you're planning to add this in a separate PR?
Also, for every unresolved section, can you add a note with what stage it needs to be resolved by?
nitty-gritty. | ||
--> | ||
|
||
This design seeks to incorporate a two-pronged approach: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the proposal section should be updated to state concisely what you are proposing for v1.32, and move the unresolved portion to the end of the proposal section, and explicitly state that it's out of scope for the initial alpha. Or, move it to alternatives considered, or future extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @tallclair above. Also can move some of them to beta phase.
and anticipate the change in load and node stability as a result of upgrading to | ||
these changes. | ||
|
||
Note that proposal will NOT change: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Merge this into the non-goals section
churn of Pods through CrashLoopBackoff phases, the central API server could | ||
become unresponsive, effectively taking down an entire cluster. | ||
|
||
The same risk exists for the <<[UNRESOLVED]>> per Node <<[/UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe just put this whole paragraph in unresolved?
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM for provisional state, for the non-unresolved sections.
/assign @SergeyKanzhelev |
FYI to connect the dots for posterity, 1.32 PR (draft right now) is at lauralorenz#21 UPDATE: PR into k/k is at #4893 |
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
/lgtm |
There are the following unresolved topics:
@lauralorenz and @tallclair indicates they are going to be covered in the following KEP PRs for 1.32, which makes this implementable. /approve at the provisional state |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, lauralorenz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks all! PR into k/k for 1.32 is at #4893 |