Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1972: kubelet exec probe timeouts #1973

Merged
merged 1 commit into from
Oct 2, 2020

Conversation

andrewsykim
Copy link
Member

Signed-off-by: Andrew Sy Kim kim.andrewsy@gmail.com

Kubelet today does not respect exec probe timeouts. This is an unexpected behavior
since the timeout value is supported in the Container Probe API. Because exec probe timeouts
were never respected by kubelet, a new feature gate ExecProbeTimeouts will be introduced
so users have an easy way to revert back if the newly introduced probe timeout results
in unexpected behavior.

KEP issue: #1972

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 8, 2020
@andrewsykim
Copy link
Member Author

/assign @SergeyKanzhelev

For initial review

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 8, 2020
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 8, 2020
Comment on lines 44 to 47
the timeout value is supported in the Container Probe API. Because exec probe timeouts
were never respected by kubelet, a new feature gate `ExecProbeTimeouts` will be introduced
so users have an easy way to revert back if the newly introduced probe timeout results
in unexpected behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: the feature gate is for ops. For developers addressing the issue would be an increase of a timeout. So "users" and the action these users do in this context can be clarified.

Perhaps:

Suggested change
the timeout value is supported in the Container Probe API. Because exec probe timeouts
were never respected by kubelet, a new feature gate `ExecProbeTimeouts` will be introduced
so users have an easy way to revert back if the newly introduced probe timeout results
in unexpected behavior.
the timeout value is supported in the Container Probe API. Because exec probe timeouts
were never respected by kubelet, a new feature gate `ExecProbeTimeouts` will be introduced.
With this feature nodes can be configured to preserve the current behavior while the proper
timeouts are configured for the probes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


### Goals

* fix exec probe timeouts in kubelet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps "treat exec probe timeouts as probe failures in kubelet"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
@SergeyKanzhelev
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 8, 2020
@dims
Copy link
Member

dims commented Sep 9, 2020

@andrewsykim are we missing some discussion on TODO(s) for different runtimes / dockershim?

@andrewsykim
Copy link
Member Author

@dims I don't think so, I think the two TODOs that came out from this discussion were:

  1. we should have CRI errors to handle things like timeout, this is going to be addressed in KEP: Custom errors in the CRI #1654
  2. deprecating dockershim -- should also be addressed in a separate issue / KEP.

Both of those I imagine will be addressed in separate KEPs, this is one is to only fix exec probe timeouts

@andrewsykim
Copy link
Member Author

/assign @dchen1107 @derekwaynecarr

@dims
Copy link
Member

dims commented Sep 9, 2020

@andrewsykim sounds good. thanks!

@SergeyKanzhelev
Copy link
Member

Another suggestion: should we increase scope of the KEP and also respect timeout in preStop exec hook?

kubernetes/kubernetes#92432

If so, the name of the flag may need to change to RespectExecTimeout to signify it's not only about probes. @andrewsykim WDYT?

@andrewsykim
Copy link
Member Author

Another suggestion: should we increase scope of the KEP and also respect timeout in preStop exec hook?

I agree that we should fix this, but I think grace termination period and probe timeouts are separate concerns here. I know some of the efforts around side car containers were also trying to address this so I wouldn't want to overlap there too much. I might be missing context on this one though.

@SergeyKanzhelev
Copy link
Member

Another suggestion: should we increase scope of the KEP and also respect timeout in preStop exec hook?

I agree that we should fix this, but I think grace termination period and probe timeouts are separate concerns here. I know some of the efforts around side car containers were also trying to address this so I wouldn't want to overlap there too much. I might be missing context on this one though.

thinking more about it - I'd say even adding logic to respect the gracePeriodOverride would likely not be a breaking change so no flag is needed. So this KEP should be gtg

### Test Plan

E2E tests:
* re-enable [existing exec liveness probe e2e test](https://github.com/kubernetes/kubernetes/blob/ea1458550077bdf3b26ac34551a3591d280fe1f5/test/e2e/common/container_probe.go#L210-L227) that is currently being skipped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow :/

@derekwaynecarr
Copy link
Member

thanks for the write-up. agree with approach.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2020
@dchen1107
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, dchen1107, derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants