Behavior change in daemonset pod eviction #5240

jdn5126 · 2022-10-07T21:23:30Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
v1.21.3

Component version: v1.21.3

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.7-eks-4721010", GitCommit:"b77d9473a02fbfa834afa67d677fd12d690b195f", GitTreeState:"clean", BuildDate:"2022-06-27T22:22:16Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.13-eks-15b7512", GitCommit:"94138dfbea757d7aaf3b205419578ef186dd5efb", GitTreeState:"clean", BuildDate:"2022-08-31T19:15:48Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS

What did you expect to happen?:
Daemonset pods, especially those marked system-node-critical, to either not be evicted by default or to be evicted only after all other pods are evicted.

What happened instead?:
Daemonset pods were evicted before other non-critical and non-daemonset pods.

How to reproduce it (as minimally and precisely as possible):

Deploy daemonset pod without the cluster-autoscaler.kubernetes.io/enable-ds-eviction annotation and without --daemonset-eviction-for-occupied-nodes=false.
Deploy other pods on node with high termination grace period
Schedule node eviction
See that daemonset pod is evicted before other pods

Anything else we need to know?:
I understand that cluster-autoscaler.kubernetes.io/enable-ds-eviction and --daemonset-eviction-for-occupied-nodes were added to control daemonset pod eviction, but the default behavior on occupied nodes changed from "do not evict daemonset pods" to "evict daemonset pods unless annotation is present". This behavior change can lead to regressions and unexpected changes.

If the default behavior cannot be changed back, it seems like daemonset pods, or even more specifically system-node-critical pods, should not be evicted until all other pods are evicted. Adding issue #4337 for referencing similar discussion

The text was updated successfully, but these errors were encountered:

jdn5126 · 2022-11-15T19:58:06Z

Any thoughts on this?

Kuzbekov · 2023-01-15T23:39:09Z

We faced the same situation in GKE (1.24.6-gke.1500) with managed local dns service. Our services have slight delay while stopping, but local dns usually stops faster than some workloads and that leads to failed requests.
The problem is that we can't change needed settings of autoscaler, and can't change annotations of daemonset because it could be overwritten by cloud.

dims · 2023-03-17T14:20:14Z

xref: kubernetes-sigs/aws-fsx-csi-driver#253

k8s-triage-robot · 2023-06-15T14:59:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-07-15T15:22:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-01-19T22:05:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-01-19T22:05:57Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jdn5126 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2022

jbartosik added the area/cluster-autoscaler label Oct 10, 2022

jdn5126 mentioned this issue Oct 28, 2022

[CNI]: Teardown pod networking resources without IPAMD when possible aws/amazon-vpc-cni-k8s#2125

Closed

jdn5126 mentioned this issue Nov 17, 2022

[CNI]: Teardown pod network when IPAMD connection fails aws/amazon-vpc-cni-k8s#2145

Merged

dims mentioned this issue Mar 17, 2023

[AWS-FSX] Pod is hanging at terminating when evicting pods for scaling down kubernetes-sigs/aws-fsx-csi-driver#253

Closed

jacobwolfaws mentioned this issue Mar 20, 2023

Only evict ds pods after other pods are evicted #5607

Closed

jacobwolfaws mentioned this issue Apr 11, 2023

Only evict ds pods after other pods are evicted #5674

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 15, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2024

jfcoz mentioned this issue Mar 28, 2024

fix daemonset eviction #6666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior change in daemonset pod eviction #5240

Behavior change in daemonset pod eviction #5240

jdn5126 commented Oct 7, 2022

jdn5126 commented Nov 15, 2022

Kuzbekov commented Jan 15, 2023

dims commented Mar 17, 2023

k8s-triage-robot commented Jun 15, 2023

k8s-triage-robot commented Jul 15, 2023

k8s-triage-robot commented Jan 19, 2024

k8s-ci-robot commented Jan 19, 2024

Behavior change in daemonset pod eviction #5240

Behavior change in daemonset pod eviction #5240

Comments

jdn5126 commented Oct 7, 2022

jdn5126 commented Nov 15, 2022

Kuzbekov commented Jan 15, 2023

dims commented Mar 17, 2023

k8s-triage-robot commented Jun 15, 2023

k8s-triage-robot commented Jul 15, 2023

k8s-triage-robot commented Jan 19, 2024

k8s-ci-robot commented Jan 19, 2024