Group of unschedulable pods causes scheduler (coscheduling) to be blocked #682

fg91 · 2023-12-04T15:42:46Z

Area

Scheduler
Controller
Helm Chart
Documents

Other components

No response

What happened?

We use the coscheduling plugin for ML training workloads.

Due to quota and resource restrictions, it regularly happens that the required nodes cannot be scaled up and a group of pending pod remains in the cluster.

Once that happens, sometimes we observe that the scheduler even fails to schedule pods for which resources would exist. We don't even see events on the respective pods.

This behaviour coincides with the scheduler emitting millions of log lines within a short period of time saying:

"To-activate pod does not exist in unschedulablePods or backoffQ"

At the same time, we observe a high increase in CPU usage:

To make the scheduler schedule pods again, sometimes it's enough to restart the scheduler. Sometimes, however, we need to manually delete the pods/pod groups referenced in those millions of log lines.

To us it seems as if the issue isn't just starvation but generally the coscheduler not doing anything meaningful anymore.

What did you expect to happen?

We would expect the coscheduler to continue to schedule pods (especially those not part of a pod group) for which resources do exist.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately we haven't been able to create a minimal working example.

We tried debugging this issue by running the scheduler locally with a visual debugger attached. We did hit the line logging the error with the debugger but don't know how to proceed from here to debug this further. We would be very keen to get hints.

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.10-gke.2700", GitCommit:"df2448d7739c44e41ee999b9065ce506cab86344", GitTreeState:"clean", BuildDate:"2023-06-22T09:25:37Z", GoVersion:"go1.19.9 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Scheduler Plugins version

`registry.k8s.io/scheduler-plugins/controller:v0.26.7`

The text was updated successfully, but these errors were encountered:

Huang-Wei · 2023-12-05T02:10:08Z

May I know which coscheduler version you're running?

"To-activate pod does not exist in unschedulablePods or backoffQ"

May I know the PodGroup's minMember setting? and at the time you saw this log, how many pods are associated to its PodGroup?

And, did you use minResources for the PodGroup?

AxeZhan · 2023-12-05T06:31:15Z

/cc

fg91 · 2023-12-05T10:12:02Z

First of all, thanks a lot for the quick reply @Huang-Wei, much appreciated 🙏

May I know which coscheduler version you're running?

We currently use 0.26.7 for both the controller and the scheduler:

registry.k8s.io/scheduler-plugins/kube-scheduler:v0.26.7

"To-activate pod does not exist in unschedulablePods or backoffQ"

May I know the PodGroup's minMember setting?
And, did you use minResources for the PodGroup?

Unfortunately the state of the pod group from the logging screenshots above isn't persisted. That being said, this is the pod group of a currently running distributed training of the "same kind":

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  creationTimestamp: "2023-12-04T15:18:19Z"
  generation: 1
  name: fd91ed718eb3747998fa-n0-0
  namespace: development
  ownerReferences:
  - apiVersion: kubeflow.org/v1
    blockOwnerDeletion: true
    controller: true
    kind: PyTorchJob
    name: fd91ed718eb3747998fa-n0-0
    uid: 4eb7fa19-7291-4d6b-b25a-3b60197529e3
  resourceVersion: "322041525"
  uid: 10600ea0-ed89-45bf-8ed8-c77b7670eb8e
spec:
  minMember: 16  # Was the same
  minResources:  # Was also set, exact resources might have been different
    cpu: "176"
    ephemeral-storage: 32Gi
    memory: 640Gi
    nvidia.com/gpu: "16"
status:
  occupiedBy: development/fd91ed718eb3747998fa-n0-0
  phase: Running
  running: 16  # Must have been 0

Our pod groups are created by the kubeflow training operator (see here and here). Our training jobs of this kind need exactly 16 replicas in order to run, meaning that there are immediately 16 pods with the label scheduling.x-k8s.io/pod-group: fd91ed718eb3747998fa-n0-0 created by the kubeflow training operator. Does this answer this question?

[...] and at the time you saw this log, how many pods are associated to its PodGroup?

In situations where we observe the above mentioned logs, our GKE cluster typically can't scale up nodes because we reached our GCE quota limits or GCE is out of resources in the respective zone for the GPU kind required by this particular job. So running must have been 0.

Huang-Wei · 2023-12-07T20:49:42Z

After looking into the code, I think I know the root cause of excessive log entries of "To-activate pod does not exist in unschedulablePods or backoffQ". It's b/c of in-optimal code that over requeues a PodGroup's pods - it should have been more efficient to avoid unnecessary requeue operations. It's for sure related with the CPU spike you observed, and also quite likely related with the unschedulable symptom. It's just I'd need more time to dig into it, and if luck enough, create a minimum example to repro this issue.

Could you also pass me the Coscheduling plugin's config arguments, and the spec of PodGroup?

Once that happens, sometimes we observe that the scheduler even fails to schedule pods for which resources would exist.
To us it seems as if the issue isn't just starvation but generally the coscheduler not doing anything meaningful anymore.

I think it might still because of starvation. Due to the over-requeuing, under the hood for each pod of the PodGroup's, it requeues the other 15 pods once minResource is met. So it may end up with an endless loop. A simple verification step is to specify an explicit podGroupBackoffSeconds argument (like 60 or 90 secs; check #559 for more details) to see if the issue still surfaces. Could you help verify?

fg91 · 2023-12-08T10:48:49Z

We already included a podGroupBackoffSeconds in the config a few weeks ago and believe to have observed that it improved the situation in the sense that in most situations, the scheduler handles the starvation issue.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
profiles:
# Compose all plugins in one profile
- schedulerName: scheduler-plugins-scheduler
  plugins:
    multiPoint:
      enabled:
      - name: Coscheduling
      - name: CapacityScheduling
      - name: NodeResourceTopologyMatch
      - name: NodeResourcesAllocatable
      disabled:
      - name: PrioritySort
  pluginConfig:
  - args:
      podGroupBackoffSeconds: 60

But to us it seems this bug is not related to starvation as we don't see the backoff between the log lines.

Could you also pass me the [...] spec of PodGroup?

Are you looking for something else than the spec shown here? If yes, could you please explain where to find it?

We have tried to reproduce such incidents in our staging cluster but unfortunately haven't succeeded in triggering this on purpose.
In case you have an idea you want to try, we'd be more than happy to build an image off from a branch, deploy it, and observe the behaviour.

Huang-Wei · 2023-12-10T06:07:59Z

@fg91 I created an experimental fix #684. It's based on latest code of release-1.26, you can apply this patch and build an image to see if it helps. Test in my local env shows it eliminates unnecessary requeue operations.

PS: you can switch to this branch by running git pr 684, and pr is a git alias that should be added to ~/.gitconfig:

[alias]
  pr = "!f() { git fetch -fu ${2:-$(git remote |grep ^upstream || echo origin)} refs/pull/$1/head:pr/$1 && git checkout pr/$1; }; f"

fg91 · 2023-12-11T11:39:46Z

Thanks a lot again @Huang-Wei for looking into this issue so quickly 🙏
I deployed an image with your fix in our prod cluster and will observe over the next days whether we see fewer log lines/spikes in cpu.
Complete hanging luckily/unfortunately happened only every few weeks after we set a podGroupBackoffSeconds so we might observe this for a while.

fg91 · 2023-12-14T08:30:42Z

Reporting first observations:

The log line

"To-activate pod does not exist in unschedulablePods or backoffQ"

was logged only 8 times in the last 12 hours so your PR seems to have fixed the large volume of this log line 👏

We also didn't observe any hanging of the scheduler (but as I said, since this happened only every 2-3 weeks on average, it's probably too early to say for sure it doesn't happen any more).

We still saw strong spikes in cpu utilization, much higher than what is requested by default in the helm chart. The scheduler continued to work though.

In the last 12 hours, we also saw a lot of occurrences of the following logs:

312,697 log entries:

{
  "insertId": "...",
  "jsonPayload": {
    "pid": "1",
    "message": "can not retrieve list of objects using index : Index with name namespace does not exist"
  },
  "resource": {
    "type": "k8s_container",
  },
  "severity": "WARNING",
  },
  "sourceLocation": {
    "file": "listers.go",
    "line": "63"
  },
  ...
}

21,775 log entries:

{
  "jsonPayload": {
    "pid": "1",
    "message": "Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"fa0ba30c0bba246c4b5d-n1-0-dn8-0.17a05a1306f9aaf7\", GenerateName:\"\", Namespace:\"development\", SelfLink:\"\", UID:\"a50f968a-c2aa-4a1b-837d-bbb720638520\", ResourceVersion:\"16159314\", Generation:0, CreationTimestamp:time.Date(2023, time.December, 13, 9, 28, 16, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:\"kube-scheduler\", Operation:\"Update\", APIVersion:\"events.k8s.io/v1\", Time:time.Date(2023, time.December, 13, 9, 28, 16, 0, time.Local), FieldsType:\"FieldsV1\", FieldsV1:(*v1.FieldsV1)(0xc007585b18), Subresource:\"\"}}}, EventTime:time.Date(2023, time.December, 13, 9, 28, 16, 913947000, time.Local), Series:(*v1.EventSeries)(0xc00a134d20), ReportingController:\"scheduler-plugins-scheduler\", ReportingInstance:\"scheduler-plugins-scheduler-scheduler-plugins-scheduler-7969dd7c64-dgnms\", Action:\"Scheduling\", Reason:\"FailedScheduling\", Regarding:v1.ObjectReference{Kind:\"Pod\", Namespace:\"development\", Name:\"fa0ba30c0bba246c4b5d-n1-0-dn8-0\", UID:\"911c41d1-0645-46db-9ac1-f98219817962\", APIVersion:\"v1\", ResourceVersion:\"337782296\", FieldPath:\"\"}, Related:(*v1.ObjectReference)(nil), Note:\"0/212 nodes are available: 137 node(s) had untolerated taint {...}, 19 Insufficient cpu, 19 Insufficient memory, 20 Insufficient nvidia.com/gpu, 22 node(s) had untolerated taint {cloud.google.com/gke-spot: true}, 28 node(s) had untolerated taint {...}, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/212 nodes are available: 192 Preemption is not helpful for scheduling, 20 No preemption victims found for incoming pod., can not find pod group, 0/212 nodes are available: 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemp ...\", Type:\"Warning\", DeprecatedSource:v1.EventSource{Component:\"\", Host:\"\"}, DeprecatedFirstTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedLastTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedCount:0}': 'Event \"fa0ba30c0bba246c4b5d-n1-0-dn8-0.17a05a1306f9aaf7\" is invalid: series.count: Invalid value: \"\": should be at least 2' (will not retry!)"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
        ...
    }
  },
  "timestamp": "2023-12-13T09:28:35.846143132Z",
  "severity": "ERROR",
  "labels": {
    ...
  },
  "sourceLocation": {
    "file": "event_broadcaster.go",
    "line": "253"
  },
  ...
}

Are these expected?

Thanks 🙏

Huang-Wei · 2023-12-15T06:17:52Z

was logged only 8 times in the last 12 hours

Good to hear that. Yes, this log is still possible - a gang of pods may get triggered by a relevant system event to be moved to activeQ already when the co-scheduling plugin's activateSiblings function is triggered. But its frequency should be reduced a lot.

We still saw strong spikes in cpu utilization, much higher than what is requested by default in the helm chart.

I'm not sure if it's relevant atm.

In the last 12 hours, we also saw a lot of occurrences of the following logs

The first message is as expected, #654 introduced the fix in v0.27.

The second message seems to be complaining API incompatibility of Series:(*v1.EventSeries). Upgrading your API server to be same or higher than scheduler version should be good.

fg91 · 2024-01-11T08:55:27Z

Hey @Huang-Wei,

we haven't observed any further hangings of the scheduler in our training cluster and the two weeks leading up to the winter break were very busy.
I'm optimistic that #684 helped 🙏

Huang-Wei · 2024-01-11T19:22:45Z

@fg91 good to hear that!

I will create an equivalent fix of #684 to master soon. BTW: are you going to stay in v0.26 for a while, or plan to upgrade to v0.27 soon?

fg91 · 2024-01-12T10:32:56Z

@fg91 good to hear that!

I will create an equivalent fix of #684 to master soon. BTW: are you going to stay in v0.26 for a while, or plan to upgrade to v0.27 soon?

Thank you 🙏
Generally, we try to keep up with with the newest versions so we don't have plans to stick with 0.26 (unless you recommend we do). Or we directly jump to 1.28 once it's out including the fix equivalent to #684.

willgleich · 2024-01-29T20:00:28Z

@Huang-Wei is there a fix for this that is pre-built anywhere or in any release yet?

Huang-Wei · 2024-01-30T00:06:55Z

It's not yet. I plan to release it with v0.28.X and the ETA is end of Feb.

maratrixx · 2024-04-09T09:53:39Z

Is there any progress? thx

Huang-Wei · 2024-04-09T16:53:35Z

Is there any progress? thx

It's on my radar. Postponed a bit due to my personal bandwidth. I will get new release cut by end of this week.

fg91 · 2024-04-09T20:33:31Z

It's on my radar. Postponed a bit due to my personal bandwidth. I will get new release cut by end of this week.

Happy to report that we didn't see any further issues after deploying your experimental fix 👏

fg91 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 4, 2023

Huang-Wei pinned this issue Dec 7, 2023

Huang-Wei mentioned this issue Dec 10, 2023

[DO NOT MERGE] experimental fix for coscheduling issue #684

Closed

Huang-Wei added this to the v1.28 milestone Jan 11, 2024

Huang-Wei mentioned this issue Jan 30, 2024

Avoid unnecessary requeue operations in coscheduling #700

Merged

willgleich mentioned this issue Feb 24, 2024

Scheduler inconsistent pod start up time #703

Closed

3 tasks

Huang-Wei unpinned this issue Apr 9, 2024

k8s-ci-robot closed this as completed in #700 Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group of unschedulable pods causes scheduler (coscheduling) to be blocked #682

Group of unschedulable pods causes scheduler (coscheduling) to be blocked #682

fg91 commented Dec 4, 2023

Huang-Wei commented Dec 5, 2023

AxeZhan commented Dec 5, 2023

fg91 commented Dec 5, 2023

Huang-Wei commented Dec 7, 2023

fg91 commented Dec 8, 2023 •

edited

Loading

Huang-Wei commented Dec 10, 2023

fg91 commented Dec 11, 2023

fg91 commented Dec 14, 2023

Huang-Wei commented Dec 15, 2023

fg91 commented Jan 11, 2024

Huang-Wei commented Jan 11, 2024

fg91 commented Jan 12, 2024 •

edited

Loading

willgleich commented Jan 29, 2024 •

edited

Loading

Huang-Wei commented Jan 30, 2024

maratrixx commented Apr 9, 2024

Huang-Wei commented Apr 9, 2024

fg91 commented Apr 9, 2024

Group of unschedulable pods causes scheduler (coscheduling) to be blocked #682

Group of unschedulable pods causes scheduler (coscheduling) to be blocked #682

Comments

fg91 commented Dec 4, 2023

Area

Other components

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Scheduler Plugins version

Huang-Wei commented Dec 5, 2023

AxeZhan commented Dec 5, 2023

fg91 commented Dec 5, 2023

Huang-Wei commented Dec 7, 2023

fg91 commented Dec 8, 2023 • edited Loading

Huang-Wei commented Dec 10, 2023

fg91 commented Dec 11, 2023

fg91 commented Dec 14, 2023

Huang-Wei commented Dec 15, 2023

fg91 commented Jan 11, 2024

Huang-Wei commented Jan 11, 2024

fg91 commented Jan 12, 2024 • edited Loading

willgleich commented Jan 29, 2024 • edited Loading

Huang-Wei commented Jan 30, 2024

maratrixx commented Apr 9, 2024

Huang-Wei commented Apr 9, 2024

fg91 commented Apr 9, 2024

fg91 commented Dec 8, 2023 •

edited

Loading

fg91 commented Jan 12, 2024 •

edited

Loading

willgleich commented Jan 29, 2024 •

edited

Loading