Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preemption picks wrong victim node with higher priority pod on it after #128307 #129136

Open
mochizuki875 opened this issue Dec 10, 2024 · 12 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@mochizuki875
Copy link
Member

What happened?

related: #128307

After #128307 has been merged, preemption logic picks wrong victim node with higher priority pod on it.
In the following situation, high pod on worker1 not mid on worker2 is evicted when very-high pod(Priority=10000) attempts to schedule.

  • worker1
    • high pod(Priority=1000)
    • low pod(Priority=0, preempting it will violate PDB)
  • worker2
    • mid pod(Priority=100, preempting it will violate PDB)

What did you expect to happen?

According to here, mid pod on worker2 seems to be picked as a victim.

// pickOneNodeForPreemption chooses one node among the given nodes.
// It assumes pods in each map entry are ordered by decreasing priority.
// If the scoreFuns is not empty, It picks a node based on score scoreFuns returns.
// If the scoreFuns is empty,
// It picks a node based on the following criteria:
// 1. A node with minimum number of PDB violations.
// 2. A node with minimum highest priority victim is picked.
// 3. Ties are broken by sum of priorities of all victims.
// 4. If there are still ties, node with the minimum number of victims is picked.
// 5. If there are still ties, node with the latest start time of all highest priority victims is picked.
// 6. If there are still ties, the first such node is picked (sort of randomly).
// The 'minNodes1' and 'minNodes2' are being reused here to save the memory
// allocation and garbage collection time.
func pickOneNodeForPreemption(logger klog.Logger, nodesToVictims map[string]*extenderv1.Victims, scoreFuncs []func(node string) int64) string {

How can we reproduce it (as minimally and precisely as possible)?

We can reproduce it by the following steps using kind.

Preparation

Use kind.

$ kind version
kind v0.25.0 go1.22.9 linux/amd64

Build Node image of v1.31.3.

$ kind build node-image v1.31.3 --image ndest/node:1.31.3-build

Prepare kind cluster config and create cluster.

kind-config-1.31.3.yaml

kind: Cluster
apiVersion: "kind.x-k8s.io/v1alpha4"
name: kind-v1.31.3
nodes:
- role: control-plane
  image: ndest/node:1.31.3-build
- role: worker
  image: ndest/node:1.31.3-build
- role: worker
  image: ndest/node:1.31.3-build
$ kind create cluster --config kind-config-1.31.3.yaml

$ kubectl get no
NAME                         STATUS   ROLES           AGE   VERSION
kind-v1.31.3-control-plane   Ready    control-plane   33s   v1.31.3
kind-v1.31.3-worker          Ready    <none>          18s   v1.31.3
kind-v1.31.3-worker2         Ready    <none>          18s   v1.31.3

Add PriorityClass to DaemonSet/kindnet to prevent it from being preempted.

$ kubectl -n kube-system patch ds kindnet --patch '{"spec": {"template": {"spec": {"priorityClassName": "system-node-critical"}}}}'

Modify maxPods of kubelet config on worker and worker2 to trigger preemption.

# worker
$ docker exec -it kind-v1.31.3-worker /bin/bash
root@kind-v1:/# echo "maxPods: 4" >> /var/lib/kubelet/config.yaml
root@kind-v1:/# systemctl restart kubelet
root@kind-v1:/# exit

# worker2
$ docker exec -it kind-v1.31.3-worker2 /bin/bash
root@kind-v1:/# echo "maxPods: 3" >> /var/lib/kubelet/config.yaml
root@kind-v1:/# systemctl restart kubelet
root@kind-v1:/# exit

Now, we can schedule 2 pods on worker and 1 pod on worker2.

$ k get no -A -o='custom-columns=NAME:.metadata.name,MAXPOD:.status.capacity.pods,VERSION:.status.nodeInfo.kubeletVersion'
NAME                         MAXPOD   VERSION
kind-v1.31.3-control-plane   110      v1.31.3
kind-v1.31.3-worker          4        v1.31.3
kind-v1.31.3-worker2         3        v1.31.3

$ k get po -A -owide | grep -w kind-v1.31.3-worker | wc -l
2
$ k get po -A -owide | grep -w kind-v1.31.3-worker2 | wc -l
2

Create PriorityClass, PodDisruptionBudget, and Pod.
Applying these manifests, we can see the following situation.

high-priority.yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000
globalDefault: false
description: "This priority class should be used for high priority pods."

high.yaml

apiVersion: v1
kind: Pod
metadata:
  name: high
  labels:
    app: high
spec:
  containers:
  - name: nginx
    image: nginx
  priorityClassName: high-priority
  nodeSelector:
    kubernetes.io/hostname: kind-v1.31.3-worker

mid-priority.yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: mid-priority
value: 100
globalDefault: false
description: "This priority class should be used for mid priority pods."

mid.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mid
  labels:
    app: mid
spec:
  containers:
  - name: nginx
    image: nginx
  priorityClassName: mid-priority
  nodeSelector:
    kubernetes.io/hostname: kind-v1.31.3-worker2

low-priority.yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 0
globalDefault: false
description: "This priority class should be used for low priority pods."

low.yaml

apiVersion: v1
kind: Pod
metadata:
  name: low
  labels:
    app: low
spec:
  containers:
  - name: nginx
    image: nginx
  priorityClassName: low-priority
  nodeSelector:
    kubernetes.io/hostname: kind-v1.31.3-worker

mid-pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: mid-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: mid

low-pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: mid-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: mid

very-high-priority.yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: very-high-priority
value: 10000
globalDefault: false
description: "This priority class should be used for very high priority pods."

worker and worker2 habe reached their maxPods limit and evicting low or mid pod violates PDB.

$ k get po -o='custom-columns=NAME:.metadata.name,STATUS:.status.phase,PRIORITY:.spec.priority,NODE:.spec.nodeName'
NAME   STATUS    PRIORITY   NODE
high   Running   1000       kind-v1.31.3-worker
low    Running   0          kind-v1.31.3-worker
mid    Running   100        kind-v1.31.3-worker2

$ k get priorityclasses
NAME                      VALUE        GLOBAL-DEFAULT   AGE
high-priority             1000         false            26s
low-priority              0            false            26s
mid-priority              100          false            26s
system-cluster-critical   2000000000   false            17m
system-node-critical      2000001000   false            17m
very-high-priority        10000        false            26s

$ k get pdb
NAME      MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
low-pdb   1               N/A               0                     47s
mid-pdb   1               N/A               0                     47s

$ k get po -A -owide | grep -w kind-v1.31.3-worker | wc -l
4
ubuntu-user@ubuntu-server01 ~/work/kubernetes/preemption $ k get po -A -owide | grep -w kind-v1.31.3-worker2 | wc -l
3

Now, attempt to schedule very-high pod.

very-high.yaml

apiVersion: v1
kind: Pod
metadata:
  name: very-high
  labels:
    app: very-high
spec:
  containers:
  - name: nginx
    image: nginx
  priorityClassName: very-high-priority
$ k apply -f very-high.yaml

We can see that high pod on worker is evicted instead of mid pod on worker2.

$ k get po -owide -w
NAME   READY   STATUS    RESTARTS   AGE   IP           NODE                   NOMINATED NODE   READINESS GATES
high   1/1     Running   0          6m    10.244.1.2   kind-v1.31.3-worker    <none>           <none>
low    1/1     Running   0          6m    10.244.1.3   kind-v1.31.3-worker    <none>           <none>
mid    1/1     Running   0          6m    10.244.2.2   kind-v1.31.3-worker2   <none>           <none>


very-high   0/1     Pending   0          0s    <none>       <none>                 <none>           <none>
high        1/1     Running   0          6m9s   10.244.1.2   kind-v1.31.3-worker    <none>           <none>
high        1/1     Terminating   0          6m9s   10.244.1.2   kind-v1.31.3-worker    <none>           <none>
very-high   0/1     Pending       0          0s     <none>       <none>                 kind-v1.31.3-worker   <none>
high        1/1     Terminating   0          6m9s   10.244.1.2   kind-v1.31.3-worker    <none>                <none>
high        0/1     Completed     0          6m9s   10.244.1.2   kind-v1.31.3-worker    <none>                <none>
high        0/1     Completed     0          6m10s   10.244.1.2   kind-v1.31.3-worker    <none>                <none>
high        0/1     Completed     0          6m10s   10.244.1.2   kind-v1.31.3-worker    <none>                <none>
very-high   0/1     Pending       0          2s      <none>       kind-v1.31.3-worker    kind-v1.31.3-worker   <none>
very-high   0/1     ContainerCreating   0          2s      <none>       kind-v1.31.3-worker    <none>                <none>
very-high   1/1     Running             0          3s      10.244.1.4   kind-v1.31.3-worker    <none>                <none>

$ k get po -owide
NAME        READY   STATUS    RESTARTS   AGE     IP           NODE                   NOMINATED NODE   READINESS GATES
low         1/1     Running   0          7m23s   10.244.1.3   kind-v1.31.3-worker    <none>           <none>
mid         1/1     Running   0          7m23s   10.244.2.2   kind-v1.31.3-worker2   <none>           <none>
very-high   1/1     Running   0          74s     10.244.1.4   kind-v1.31.3-worker    <none>           <none>

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.31.3

Cloud provider

none

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

$ kind version kind v0.25.0 go1.22.9 linux/amd64

Related plugins (CNI, CSI, ...) and versions (if applicable)

@mochizuki875 mochizuki875 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mochizuki875
Copy link
Member Author

/sig scheduling
/kind bug

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 10, 2024
@mochizuki875
Copy link
Member Author

I've investigated the codebase and tried to find the cause and fix it.
#129137

@mochizuki875
Copy link
Member Author

/assign

@AxeZhan
Copy link
Member

AxeZhan commented Dec 10, 2024

Since evicting high pod(Priority=1000) will not violate the PDB but evicting mid pod does, evicting it seems right to me?

@mochizuki875
Copy link
Member Author

mochizuki875 commented Dec 11, 2024

@AxeZhan
Thank you for your comment.

I also thought so at first.
However, code comment describe that:

  1. A node with minimum number of PDB violations.
  2. A node with minimum highest priority victim is picked.

So in this case,

  • step1: Each node has the same number of PDB violations.
    • worker1 has 1 PDB violation
    • worker2 has 1 PDB violation
  • step2: The priority of mid on worker2 is lower than high on worker1.
    • worker1 has high pod(Priority=1000)
    • worker2 has mid pod(Priority=100)

So it seems correct that mid(worker2) should be picked.

In addition, the original PR #128307 and the current test code also seem to intend the same result.

@AxeZhan
Copy link
Member

AxeZhan commented Dec 11, 2024

In addition, the original PR #128307 and the current test code also seem to intend the same result.

In this test case, we have two preempt options:

  1. evict p1 and p2 on node 1.
  2. evict p3 on node2

In such a scenario, both option violates one PDB.

However in the details you posted, only the high pod is evicted, which leads us to no PDB violated, and I think this is reasonable.

A node with minimum number of PDB violations.
A node with minimum highest priority victim is picked.

I think in this comment, what they mean by "node" is actually "victim."

In other words, on a certain node, we need to evict some pods (e.g., pod1 and pod2), but there are still some pods that don't need to be evicted (e.g., pod3). So, during the calculation process, we will ignore the pods that don't need to be evicted (like pod3).

also /cc @kerthcet @Huang-Wei

@Huang-Wei
Copy link
Member

However in the details you posted, only the high pod is evicted, which leads us to no PDB violated, and I think this is reasonable.

@AxeZhan is right. Only when two preemption solutions yield the same number of PDB violations, they're treated as a tie. Otherwise, less number of PDB violations win over the other.

@mochizuki875
Copy link
Member Author

@AxeZhan @Huang-Wei
Thank you!

I think in this comment, what they mean by "node" is actually "victim."

I've misunderstood this point.
So if there are two type of victims on each node, not violating the PDB takes precedence over Priority and high will be evicted?

  • worker1
    • high pod(Priority=1000)
  • worker2
    • low pod(Priority=0, preempting it will violate PDB)

@AxeZhan
Copy link
Member

AxeZhan commented Dec 11, 2024

So if there are two type of victims on each node, not violating the PDB takes precedence over Priority and high will be evicted?

Yes, I think so. We must prioritize ensuring PDB.

@kerthcet
Copy link
Member

Seems no problem here :)

@mochizuki875
Copy link
Member Author

Thanks to you, I gained a clear understanding.
So I've closed #129137.

By the way, should we rethink the comment related to #129136 (comment)?

In particular, I think it is difficult to understand that 1. A node with minimum number of PDB violations. means A node with victims having the minimum number of PDB violations is picked..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
5 participants