-
Notifications
You must be signed in to change notification settings - Fork 40.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preemption picks wrong victim node with higher priority pod on it after #128307 #129136
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig scheduling |
I've investigated the codebase and tried to find the cause and fix it. |
/assign |
Since evicting |
@AxeZhan I also thought so at first.
So in this case,
So it seems correct that In addition, the original PR #128307 and the current test code also seem to intend the same result. |
In this test case, we have two preempt options:
In such a scenario, both option violates one PDB. However in the details you posted, only the
I think in this comment, what they mean by "node" is actually "victim." In other words, on a certain node, we need to evict some pods (e.g., pod1 and pod2), but there are still some pods that don't need to be evicted (e.g., pod3). So, during the calculation process, we will ignore the pods that don't need to be evicted (like pod3). also /cc @kerthcet @Huang-Wei |
@AxeZhan is right. Only when two preemption solutions yield the same number of PDB violations, they're treated as a tie. Otherwise, less number of PDB violations win over the other. |
@AxeZhan @Huang-Wei
I've misunderstood this point.
|
Yes, I think so. We must prioritize ensuring PDB. |
Seems no problem here :) |
Thanks to you, I gained a clear understanding. By the way, should we rethink the comment related to #129136 (comment)? In particular, I think it is difficult to understand that |
What happened?
related: #128307
After #128307 has been merged, preemption logic picks wrong victim node with higher priority pod on it.
In the following situation,
high
pod onworker1
notmid
onworker2
is evicted whenvery-high
pod(Priority=10000) attempts to schedule.worker1
high
pod(Priority=1000)low
pod(Priority=0, preempting it will violate PDB)worker2
mid
pod(Priority=100, preempting it will violate PDB)What did you expect to happen?
According to here,
mid
pod onworker2
seems to be picked as a victim.kubernetes/pkg/scheduler/framework/preemption/preemption.go
Lines 411 to 424 in 1148e5e
How can we reproduce it (as minimally and precisely as possible)?
We can reproduce it by the following steps using kind.
Preparation
Use kind.
Build Node image of v1.31.3.
Prepare kind cluster config and create cluster.
kind-config-1.31.3.yaml
Add PriorityClass to DaemonSet/kindnet to prevent it from being preempted.
$ kubectl -n kube-system patch ds kindnet --patch '{"spec": {"template": {"spec": {"priorityClassName": "system-node-critical"}}}}'
Modify
maxPods
ofkubelet
config onworker
andworker2
to trigger preemption.Now, we can schedule 2 pods on
worker
and 1 pod onworker2
.Create
PriorityClass
,PodDisruptionBudget
, andPod
.Applying these manifests, we can see the following situation.
high-priority.yaml
high.yaml
mid-priority.yaml
mid.yaml
low-priority.yaml
low.yaml
mid-pdb.yaml
low-pdb.yaml
very-high-priority.yaml
worker
andworker2
habe reached theirmaxPods
limit and evictinglow
ormid
pod violates PDB.Now, attempt to schedule
very-high
pod.very-high.yaml
We can see that
high
pod onworker
is evicted instead ofmid
pod onworker2
.Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: