diff --git a/keps/sig-node/1972-kubelet-exec-probe-timeouts/README.md b/keps/sig-node/1972-kubelet-exec-probe-timeouts/README.md new file mode 100644 index 000000000000..febf0e955320 --- /dev/null +++ b/keps/sig-node/1972-kubelet-exec-probe-timeouts/README.md @@ -0,0 +1,108 @@ +# KEP-1972: Kubelet Exec Probe Timeouts + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [X] (R) KEP approvers have approved the KEP status as `implementable` +- [X] (R) Design details are appropriately documented +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [X] (R) Graduation criteria is in place +- [ ] (R) Production readiness review completed +- [ ] Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +Kubelet today does not respect exec probe timeouts. This is considered a bug we should fix since +the timeout value is supported in the Container Probe API. Because exec probe timeouts +were never respected by kubelet, a new feature gate `ExecProbeTimeouts` will be introduced +so users have an easy way to revert back if the newly introduced probe timeout results +in unexpected behavior. + +## Motivation + +Kubelet not respecting the probe timeout is a bug and should be fixed. + +### Goals + +* fix exec probe timeouts in kubelet + +### Non-Goals + +* ensuring exec processes that timed out have been killed by kubelet. + +## Proposal + +### Risks and Mitigations + +* existing workloads on Kubernetes that relied on this bug may unexpectedly see their probes timeout + +## Design Details + +Changes to kubelet: +* Ensure kubelet handles timeout errors and registers them as failing probes. +* Add feature gate `ExecProbeTimeouts` that is GA and on by default. +* If the feature gate `ExecProbeTimeouts` is disabled and an exec probe timeout is reached, add warning logs to inform users that exec probes are timing out. +* Re-enable existing exec liveness probe e2e test. +* Add new exec readiness probe e2e test. + +### Test Plan + +E2E tests: +* re-enable [existing exec liveness probe e2e test](https://github.com/kubernetes/kubernetes/blob/ea1458550077bdf3b26ac34551a3591d280fe1f5/test/e2e/common/container_probe.go#L210-L227) that is currently being skipped +* add new exec readiness probe e2e test. + +### Graduation Criteria + +This is a bug fix so the feature gate will be GA and on by default from the start. + +### Upgrade / Downgrade Strategy + +N/A + +### Version Skew Strategy + +N/A + +## Implementation History + +* 2020-09-08 - the KEP was merged as implementable for v1.20 + +## Drawbacks + +* Existing workloads may depend on the fact that exec probe timeouts were never respected. Introducing +the timeout now may result in unexpected behavior for some workloads. + +## Alternatives + +Some alternatives that were considered: +1. Increasing the default timeout for exec probes +2. Continuing to ignore the exec probe timeout + diff --git a/keps/sig-node/1972-kubelet-exec-probe-timeouts/kep.yaml b/keps/sig-node/1972-kubelet-exec-probe-timeouts/kep.yaml new file mode 100644 index 000000000000..da956cba5d52 --- /dev/null +++ b/keps/sig-node/1972-kubelet-exec-probe-timeouts/kep.yaml @@ -0,0 +1,36 @@ +title: Kubelet Exec Probe Timeouts +kep-number: 1972 +authors: + - "@andrewsykim" + - "@SergeyKanzhelev" +owning-sig: sig-node +participating-sigs: +status: implementable +creation-date: 2020-09-08 +reviewers: + - "@dchen1107" + - "@derekwaynecarr" +approvers: + - "@dchen1107" + - "@derekwaynecarr" + +# The target maturity stage in the current dev cycle for this KEP. +stage: stable + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.20" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + stable: "v1.20" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: ExecProbeTimeouts + components: + - kubelet +disable-supported: true +