Avoid unnecessary requeue operations in coscheduling #700

Huang-Wei · 2024-01-30T00:48:15Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The current coscheduling code aggressively actives a pod's PodGroup siblings in Permit(). For example, if a deployment doesn't reach the minMember, all its pods are in pending state. Once it's scaled up to reach the minMember, it'd trigger ActivateSiblings() N times - this causes a ton of unnecessary re-queue operations which cause the CPU usage to spike, and may relates with potential starvation reported in #682.

Which issue(s) this PR fixes:

Fixes #682

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Performance fix to eliminate unnecessary re-queue actions in coscheduling plugin

netlify · 2024-01-30T00:48:21Z

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name	Link
🔨 Latest commit	`e35ce33`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/65ba8326cfb907000898b2af

k8s-ci-robot · 2024-01-30T00:48:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Huang-Wei · 2024-01-30T00:49:42Z

cc @denkensk

zwpaper · 2024-01-30T17:02:52Z

it seems to be a valuable issue and fix to me👍

/lgtm

/hold for @denkensk

kerthcet

One question about the implementation in the comment.

kerthcet · 2024-01-31T03:25:24Z

pkg/coscheduling/core/core.go

@@ -48,12 +48,22 @@ const (
 	PodGroupNotFound Status = "PodGroup not found"
 	Success          Status = "Success"
 	Wait             Status = "Wait"
+
+	preFilterStateKey = "PreFilterCoscheduling"


Seems not a proper name, or permitStateKey?

Yes, good catch...

Updated. PTAL.

kerthcet · 2024-01-31T04:01:01Z

pkg/coscheduling/core/core.go

@@ -209,6 +226,19 @@ func (pgMgr *PodGroupManager) Permit(ctx context.Context, pod *corev1.Pod) Statu
 	if int32(assigned)+1 >= pg.Spec.MinMember {
 		return Success
 	}
+
+	if assigned == 0 {


Based on my understanding, when assigned == 0, we should and only should trigger the activating, then in ActivateSiblings, once Activate is Ture, we should start the trigger but now we're the opposite. And what's more, shouldn't we set the Activate to False in the follow up.

when assigned == 0, we should and only should trigger the activating, then in ActivateSiblings, once Activate is Ture, we should start the trigger but now we're the opposite.

That's the current case. Why do you think it's the opposite.

shouldn't we set the Activate to False in the follow up.

Nope b/c every scheduling cycle comes with its own CycleState instance.

Make sense.
/lgtm
/hold cancel

Huang-Wei · 2024-04-15T05:53:52Z

/cherrypick release-1.28

k8s-infra-cherrypick-robot · 2024-04-15T05:53:54Z

@Huang-Wei: once the present PR merges, I will cherry-pick it on top of release-1.28 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.28

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-infra-cherrypick-robot · 2024-04-15T06:37:44Z

@Huang-Wei: new pull request created: #718

In response to this:

/cherrypick release-1.28

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Avoid unnecessary requeue operations in coscheduling

d16270f

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 30, 2024

k8s-ci-robot requested review from Tal-or and zwpaper January 30, 2024 00:48

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 30, 2024

Huang-Wei mentioned this pull request Jan 30, 2024

[DO NOT MERGE] experimental fix for coscheduling issue #684

Closed

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2024

k8s-ci-robot assigned zwpaper Jan 30, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2024

kerthcet reviewed Jan 31, 2024

View reviewed changes

fixup: rename preFilterState to permitState

e35ce33

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2024

k8s-ci-robot assigned kerthcet Apr 15, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 15, 2024

k8s-ci-robot merged commit 10a2173 into kubernetes-sigs:master Apr 15, 2024
10 checks passed

k8s-infra-cherrypick-robot mentioned this pull request Apr 15, 2024

[release-1.28] Avoid unnecessary requeue operations in coscheduling #718

Merged

Huang-Wei deleted the coscheduling-perf-fix branch April 15, 2024 07:05

Huang-Wei mentioned this pull request Apr 15, 2024

Release 0.28 #715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary requeue operations in coscheduling #700

Avoid unnecessary requeue operations in coscheduling #700

Huang-Wei commented Jan 30, 2024

netlify bot commented Jan 30, 2024 •

edited

Loading

k8s-ci-robot commented Jan 30, 2024

Huang-Wei commented Jan 30, 2024

zwpaper commented Jan 30, 2024

kerthcet left a comment

kerthcet Jan 31, 2024

Huang-Wei Jan 31, 2024

Huang-Wei Jan 31, 2024

kerthcet Jan 31, 2024

Huang-Wei Jan 31, 2024

kerthcet Apr 15, 2024

Huang-Wei commented Apr 15, 2024

k8s-infra-cherrypick-robot commented Apr 15, 2024

k8s-infra-cherrypick-robot commented Apr 15, 2024

Avoid unnecessary requeue operations in coscheduling #700

Avoid unnecessary requeue operations in coscheduling #700

Conversation

Huang-Wei commented Jan 30, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

netlify bot commented Jan 30, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

k8s-ci-robot commented Jan 30, 2024

Huang-Wei commented Jan 30, 2024

zwpaper commented Jan 30, 2024

kerthcet left a comment

Choose a reason for hiding this comment

kerthcet Jan 31, 2024

Choose a reason for hiding this comment

Huang-Wei Jan 31, 2024

Choose a reason for hiding this comment

Huang-Wei Jan 31, 2024

Choose a reason for hiding this comment

kerthcet Jan 31, 2024

Choose a reason for hiding this comment

Huang-Wei Jan 31, 2024

Choose a reason for hiding this comment

kerthcet Apr 15, 2024

Choose a reason for hiding this comment

Huang-Wei commented Apr 15, 2024

k8s-infra-cherrypick-robot commented Apr 15, 2024

k8s-infra-cherrypick-robot commented Apr 15, 2024

netlify bot commented Jan 30, 2024 •

edited

Loading