Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group of unschedulable pods causes scheduler (coscheduling) to be blocked #682

Closed
3 of 4 tasks
fg91 opened this issue Dec 4, 2023 · 17 comments · Fixed by #700
Closed
3 of 4 tasks

Group of unschedulable pods causes scheduler (coscheduling) to be blocked #682

fg91 opened this issue Dec 4, 2023 · 17 comments · Fixed by #700
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@fg91
Copy link

fg91 commented Dec 4, 2023

Area

  • Scheduler
  • Controller
  • Helm Chart
  • Documents

Other components

No response

What happened?

We use the coscheduling plugin for ML training workloads.

Due to quota and resource restrictions, it regularly happens that the required nodes cannot be scaled up and a group of pending pod remains in the cluster.

Once that happens, sometimes we observe that the scheduler even fails to schedule pods for which resources would exist. We don't even see events on the respective pods.

This behaviour coincides with the scheduler emitting millions of log lines within a short period of time saying:

"To-activate pod does not exist in unschedulablePods or backoffQ"

Screenshot_2023-12-04_at_16_31_03 Screenshot_2023-12-04_at_16_30_01

At the same time, we observe a high increase in CPU usage:

Screenshot 2023-12-04 at 16 34 14

To make the scheduler schedule pods again, sometimes it's enough to restart the scheduler. Sometimes, however, we need to manually delete the pods/pod groups referenced in those millions of log lines.


To us it seems as if the issue isn't just starvation but generally the coscheduler not doing anything meaningful anymore.

What did you expect to happen?

We would expect the coscheduler to continue to schedule pods (especially those not part of a pod group) for which resources do exist.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately we haven't been able to create a minimal working example.

We tried debugging this issue by running the scheduler locally with a visual debugger attached. We did hit the line logging the error with the debugger but don't know how to proceed from here to debug this further. We would be very keen to get hints.

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.10-gke.2700", GitCommit:"df2448d7739c44e41ee999b9065ce506cab86344", GitTreeState:"clean", BuildDate:"2023-06-22T09:25:37Z", GoVersion:"go1.19.9 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Scheduler Plugins version

`registry.k8s.io/scheduler-plugins/controller:v0.26.7`
@fg91 fg91 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 4, 2023
@Huang-Wei
Copy link
Contributor

May I know which coscheduler version you're running?

"To-activate pod does not exist in unschedulablePods or backoffQ"

May I know the PodGroup's minMember setting? and at the time you saw this log, how many pods are associated to its PodGroup?

And, did you use minResources for the PodGroup?

@AxeZhan
Copy link

AxeZhan commented Dec 5, 2023

/cc

@fg91
Copy link
Author

fg91 commented Dec 5, 2023

First of all, thanks a lot for the quick reply @Huang-Wei, much appreciated 🙏

May I know which coscheduler version you're running?

We currently use 0.26.7 for both the controller and the scheduler:

registry.k8s.io/scheduler-plugins/kube-scheduler:v0.26.7

"To-activate pod does not exist in unschedulablePods or backoffQ"

May I know the PodGroup's minMember setting?
And, did you use minResources for the PodGroup?

Unfortunately the state of the pod group from the logging screenshots above isn't persisted. That being said, this is the pod group of a currently running distributed training of the "same kind":

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  creationTimestamp: "2023-12-04T15:18:19Z"
  generation: 1
  name: fd91ed718eb3747998fa-n0-0
  namespace: development
  ownerReferences:
  - apiVersion: kubeflow.org/v1
    blockOwnerDeletion: true
    controller: true
    kind: PyTorchJob
    name: fd91ed718eb3747998fa-n0-0
    uid: 4eb7fa19-7291-4d6b-b25a-3b60197529e3
  resourceVersion: "322041525"
  uid: 10600ea0-ed89-45bf-8ed8-c77b7670eb8e
spec:
  minMember: 16  # Was the same
  minResources:  # Was also set, exact resources might have been different
    cpu: "176"
    ephemeral-storage: 32Gi
    memory: 640Gi
    nvidia.com/gpu: "16"
status:
  occupiedBy: development/fd91ed718eb3747998fa-n0-0
  phase: Running
  running: 16  # Must have been 0

Our pod groups are created by the kubeflow training operator (see here and here). Our training jobs of this kind need exactly 16 replicas in order to run, meaning that there are immediately 16 pods with the label scheduling.x-k8s.io/pod-group: fd91ed718eb3747998fa-n0-0 created by the kubeflow training operator. Does this answer this question?

[...] and at the time you saw this log, how many pods are associated to its PodGroup?

In situations where we observe the above mentioned logs, our GKE cluster typically can't scale up nodes because we reached our GCE quota limits or GCE is out of resources in the respective zone for the GPU kind required by this particular job. So running must have been 0.

@Huang-Wei
Copy link
Contributor

After looking into the code, I think I know the root cause of excessive log entries of "To-activate pod does not exist in unschedulablePods or backoffQ". It's b/c of in-optimal code that over requeues a PodGroup's pods - it should have been more efficient to avoid unnecessary requeue operations. It's for sure related with the CPU spike you observed, and also quite likely related with the unschedulable symptom. It's just I'd need more time to dig into it, and if luck enough, create a minimum example to repro this issue.

Could you also pass me the Coscheduling plugin's config arguments, and the spec of PodGroup?

Once that happens, sometimes we observe that the scheduler even fails to schedule pods for which resources would exist.
To us it seems as if the issue isn't just starvation but generally the coscheduler not doing anything meaningful anymore.

I think it might still because of starvation. Due to the over-requeuing, under the hood for each pod of the PodGroup's, it requeues the other 15 pods once minResource is met. So it may end up with an endless loop. A simple verification step is to specify an explicit podGroupBackoffSeconds argument (like 60 or 90 secs; check #559 for more details) to see if the issue still surfaces. Could you help verify?

@Huang-Wei Huang-Wei pinned this issue Dec 7, 2023
@fg91
Copy link
Author

fg91 commented Dec 8, 2023

We already included a podGroupBackoffSeconds in the config a few weeks ago and believe to have observed that it improved the situation in the sense that in most situations, the scheduler handles the starvation issue.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
profiles:
# Compose all plugins in one profile
- schedulerName: scheduler-plugins-scheduler
  plugins:
    multiPoint:
      enabled:
      - name: Coscheduling
      - name: CapacityScheduling
      - name: NodeResourceTopologyMatch
      - name: NodeResourcesAllocatable
      disabled:
      - name: PrioritySort
  pluginConfig:
  - args:
      podGroupBackoffSeconds: 60

But to us it seems this bug is not related to starvation as we don't see the backoff between the log lines.

Could you also pass me the [...] spec of PodGroup?

Are you looking for something else than the spec shown here? If yes, could you please explain where to find it?

We have tried to reproduce such incidents in our staging cluster but unfortunately haven't succeeded in triggering this on purpose.
In case you have an idea you want to try, we'd be more than happy to build an image off from a branch, deploy it, and observe the behaviour.

@Huang-Wei
Copy link
Contributor

@fg91 I created an experimental fix #684. It's based on latest code of release-1.26, you can apply this patch and build an image to see if it helps. Test in my local env shows it eliminates unnecessary requeue operations.

PS: you can switch to this branch by running git pr 684, and pr is a git alias that should be added to ~/.gitconfig:

[alias]
  pr = "!f() { git fetch -fu ${2:-$(git remote |grep ^upstream || echo origin)} refs/pull/$1/head:pr/$1 && git checkout pr/$1; }; f"

@fg91
Copy link
Author

fg91 commented Dec 11, 2023

Thanks a lot again @Huang-Wei for looking into this issue so quickly 🙏
I deployed an image with your fix in our prod cluster and will observe over the next days whether we see fewer log lines/spikes in cpu.
Complete hanging luckily/unfortunately happened only every few weeks after we set a podGroupBackoffSeconds so we might observe this for a while.

@fg91
Copy link
Author

fg91 commented Dec 14, 2023

Reporting first observations:

The log line

"To-activate pod does not exist in unschedulablePods or backoffQ"

was logged only 8 times in the last 12 hours so your PR seems to have fixed the large volume of this log line 👏

We also didn't observe any hanging of the scheduler (but as I said, since this happened only every 2-3 weeks on average, it's probably too early to say for sure it doesn't happen any more).


We still saw strong spikes in cpu utilization, much higher than what is requested by default in the helm chart. The scheduler continued to work though.

Screenshot 2023-12-13 at 10 34 35

In the last 12 hours, we also saw a lot of occurrences of the following logs:

312,697 log entries:

{
  "insertId": "...",
  "jsonPayload": {
    "pid": "1",
    "message": "can not retrieve list of objects using index : Index with name namespace does not exist"
  },
  "resource": {
    "type": "k8s_container",
  },
  "severity": "WARNING",
  },
  "sourceLocation": {
    "file": "listers.go",
    "line": "63"
  },
  ...
}

21,775 log entries:

{
  "jsonPayload": {
    "pid": "1",
    "message": "Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"fa0ba30c0bba246c4b5d-n1-0-dn8-0.17a05a1306f9aaf7\", GenerateName:\"\", Namespace:\"development\", SelfLink:\"\", UID:\"a50f968a-c2aa-4a1b-837d-bbb720638520\", ResourceVersion:\"16159314\", Generation:0, CreationTimestamp:time.Date(2023, time.December, 13, 9, 28, 16, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:\"kube-scheduler\", Operation:\"Update\", APIVersion:\"events.k8s.io/v1\", Time:time.Date(2023, time.December, 13, 9, 28, 16, 0, time.Local), FieldsType:\"FieldsV1\", FieldsV1:(*v1.FieldsV1)(0xc007585b18), Subresource:\"\"}}}, EventTime:time.Date(2023, time.December, 13, 9, 28, 16, 913947000, time.Local), Series:(*v1.EventSeries)(0xc00a134d20), ReportingController:\"scheduler-plugins-scheduler\", ReportingInstance:\"scheduler-plugins-scheduler-scheduler-plugins-scheduler-7969dd7c64-dgnms\", Action:\"Scheduling\", Reason:\"FailedScheduling\", Regarding:v1.ObjectReference{Kind:\"Pod\", Namespace:\"development\", Name:\"fa0ba30c0bba246c4b5d-n1-0-dn8-0\", UID:\"911c41d1-0645-46db-9ac1-f98219817962\", APIVersion:\"v1\", ResourceVersion:\"337782296\", FieldPath:\"\"}, Related:(*v1.ObjectReference)(nil), Note:\"0/212 nodes are available: 137 node(s) had untolerated taint {...}, 19 Insufficient cpu, 19 Insufficient memory, 20 Insufficient nvidia.com/gpu, 22 node(s) had untolerated taint {cloud.google.com/gke-spot: true}, 28 node(s) had untolerated taint {...}, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/212 nodes are available: 192 Preemption is not helpful for scheduling, 20 No preemption victims found for incoming pod., can not find pod group, 0/212 nodes are available: 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemp ...\", Type:\"Warning\", DeprecatedSource:v1.EventSource{Component:\"\", Host:\"\"}, DeprecatedFirstTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedLastTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedCount:0}': 'Event \"fa0ba30c0bba246c4b5d-n1-0-dn8-0.17a05a1306f9aaf7\" is invalid: series.count: Invalid value: \"\": should be at least 2' (will not retry!)"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
        ...
    }
  },
  "timestamp": "2023-12-13T09:28:35.846143132Z",
  "severity": "ERROR",
  "labels": {
    ...
  },
  "sourceLocation": {
    "file": "event_broadcaster.go",
    "line": "253"
  },
  ...
}

Are these expected?

Thanks 🙏

@Huang-Wei
Copy link
Contributor

was logged only 8 times in the last 12 hours

Good to hear that. Yes, this log is still possible - a gang of pods may get triggered by a relevant system event to be moved to activeQ already when the co-scheduling plugin's activateSiblings function is triggered. But its frequency should be reduced a lot.

We still saw strong spikes in cpu utilization, much higher than what is requested by default in the helm chart.

I'm not sure if it's relevant atm.

In the last 12 hours, we also saw a lot of occurrences of the following logs

The first message is as expected, #654 introduced the fix in v0.27.

The second message seems to be complaining API incompatibility of Series:(*v1.EventSeries). Upgrading your API server to be same or higher than scheduler version should be good.

@fg91
Copy link
Author

fg91 commented Jan 11, 2024

Hey @Huang-Wei,

we haven't observed any further hangings of the scheduler in our training cluster and the two weeks leading up to the winter break were very busy.
I'm optimistic that #684 helped 🙏

@Huang-Wei
Copy link
Contributor

@fg91 good to hear that!

I will create an equivalent fix of #684 to master soon. BTW: are you going to stay in v0.26 for a while, or plan to upgrade to v0.27 soon?

@Huang-Wei Huang-Wei added this to the v1.28 milestone Jan 11, 2024
@fg91
Copy link
Author

fg91 commented Jan 12, 2024

@fg91 good to hear that!

I will create an equivalent fix of #684 to master soon. BTW: are you going to stay in v0.26 for a while, or plan to upgrade to v0.27 soon?

Thank you 🙏
Generally, we try to keep up with with the newest versions so we don't have plans to stick with 0.26 (unless you recommend we do). Or we directly jump to 1.28 once it's out including the fix equivalent to #684.

@willgleich
Copy link

willgleich commented Jan 29, 2024

@Huang-Wei is there a fix for this that is pre-built anywhere or in any release yet?

@Huang-Wei
Copy link
Contributor

It's not yet. I plan to release it with v0.28.X and the ETA is end of Feb.

@maratrixx
Copy link

Is there any progress? thx

@Huang-Wei
Copy link
Contributor

Is there any progress? thx

It's on my radar. Postponed a bit due to my personal bandwidth. I will get new release cut by end of this week.

@Huang-Wei Huang-Wei unpinned this issue Apr 9, 2024
@fg91
Copy link
Author

fg91 commented Apr 9, 2024

It's on my radar. Postponed a bit due to my personal bandwidth. I will get new release cut by end of this week.

Happy to report that we didn't see any further issues after deploying your experimental fix 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants