Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary requeue operations in coscheduling #700

Merged
merged 2 commits into from
Apr 15, 2024

Conversation

Huang-Wei
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

The current coscheduling code aggressively actives a pod's PodGroup siblings in Permit(). For example, if a deployment doesn't reach the minMember, all its pods are in pending state. Once it's scaled up to reach the minMember, it'd trigger ActivateSiblings() N times - this causes a ton of unnecessary re-queue operations which cause the CPU usage to spike, and may relates with potential starvation reported in #682.

Which issue(s) this PR fixes:

Fixes #682

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Performance fix to eliminate unnecessary re-queue actions in coscheduling plugin

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 30, 2024
Copy link

netlify bot commented Jan 30, 2024

Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name Link
🔨 Latest commit e35ce33
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/65ba8326cfb907000898b2af

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 30, 2024
@Huang-Wei
Copy link
Contributor Author

cc @denkensk

@zwpaper
Copy link
Member

zwpaper commented Jan 30, 2024

it seems to be a valuable issue and fix to me👍

/lgtm

/hold for @denkensk

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2024
Copy link
Contributor

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about the implementation in the comment.

@@ -48,12 +48,22 @@ const (
PodGroupNotFound Status = "PodGroup not found"
Success Status = "Success"
Wait Status = "Wait"

preFilterStateKey = "PreFilterCoscheduling"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not a proper name, or permitStateKey?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. PTAL.

@@ -209,6 +226,19 @@ func (pgMgr *PodGroupManager) Permit(ctx context.Context, pod *corev1.Pod) Statu
if int32(assigned)+1 >= pg.Spec.MinMember {
return Success
}

if assigned == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, when assigned == 0, we should and only should trigger the activating, then in ActivateSiblings, once Activate is Ture, we should start the trigger but now we're the opposite. And what's more, shouldn't we set the Activate to False in the follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when assigned == 0, we should and only should trigger the activating, then in ActivateSiblings, once Activate is Ture, we should start the trigger but now we're the opposite.

That's the current case. Why do you think it's the opposite.

shouldn't we set the Activate to False in the follow up.

Nope b/c every scheduling cycle comes with its own CycleState instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.
/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2024
@Huang-Wei
Copy link
Contributor Author

/cherrypick release-1.28

@k8s-infra-cherrypick-robot

@Huang-Wei: once the present PR merges, I will cherry-pick it on top of release-1.28 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.28

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 15, 2024
@k8s-ci-robot k8s-ci-robot merged commit 10a2173 into kubernetes-sigs:master Apr 15, 2024
10 checks passed
@k8s-infra-cherrypick-robot

@Huang-Wei: new pull request created: #718

In response to this:

/cherrypick release-1.28

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Huang-Wei Huang-Wei deleted the coscheduling-perf-fix branch April 15, 2024 07:05
@Huang-Wei Huang-Wei mentioned this pull request Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Group of unschedulable pods causes scheduler (coscheduling) to be blocked
5 participants