-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group of unschedulable pods causes scheduler (coscheduling) to be blocked #682
Comments
May I know which coscheduler version you're running?
May I know the PodGroup's And, did you use |
/cc |
First of all, thanks a lot for the quick reply @Huang-Wei, much appreciated 🙏
We currently use 0.26.7 for both the controller and the scheduler:
Unfortunately the state of the pod group from the logging screenshots above isn't persisted. That being said, this is the pod group of a currently running distributed training of the "same kind": apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
creationTimestamp: "2023-12-04T15:18:19Z"
generation: 1
name: fd91ed718eb3747998fa-n0-0
namespace: development
ownerReferences:
- apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: PyTorchJob
name: fd91ed718eb3747998fa-n0-0
uid: 4eb7fa19-7291-4d6b-b25a-3b60197529e3
resourceVersion: "322041525"
uid: 10600ea0-ed89-45bf-8ed8-c77b7670eb8e
spec:
minMember: 16 # Was the same
minResources: # Was also set, exact resources might have been different
cpu: "176"
ephemeral-storage: 32Gi
memory: 640Gi
nvidia.com/gpu: "16"
status:
occupiedBy: development/fd91ed718eb3747998fa-n0-0
phase: Running
running: 16 # Must have been 0 Our pod groups are created by the kubeflow training operator (see here and here). Our training jobs of this kind need exactly 16 replicas in order to run, meaning that there are immediately 16 pods with the label
In situations where we observe the above mentioned logs, our GKE cluster typically can't scale up nodes because we reached our GCE quota limits or GCE is out of resources in the respective zone for the GPU kind required by this particular job. So |
After looking into the code, I think I know the root cause of excessive log entries of "To-activate pod does not exist in unschedulablePods or backoffQ". It's b/c of in-optimal code that over requeues a PodGroup's pods - it should have been more efficient to avoid unnecessary requeue operations. It's for sure related with the CPU spike you observed, and also quite likely related with the unschedulable symptom. It's just I'd need more time to dig into it, and if luck enough, create a minimum example to repro this issue. Could you also pass me the Coscheduling plugin's config arguments, and the spec of PodGroup?
I think it might still because of starvation. Due to the over-requeuing, under the hood for each pod of the PodGroup's, it requeues the other 15 pods once minResource is met. So it may end up with an endless loop. A simple verification step is to specify an explicit |
We already included a apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
profiles:
# Compose all plugins in one profile
- schedulerName: scheduler-plugins-scheduler
plugins:
multiPoint:
enabled:
- name: Coscheduling
- name: CapacityScheduling
- name: NodeResourceTopologyMatch
- name: NodeResourcesAllocatable
disabled:
- name: PrioritySort
pluginConfig:
- args:
podGroupBackoffSeconds: 60 But to us it seems this bug is not related to starvation as we don't see the backoff between the log lines.
Are you looking for something else than the spec shown here? If yes, could you please explain where to find it? We have tried to reproduce such incidents in our staging cluster but unfortunately haven't succeeded in triggering this on purpose. |
@fg91 I created an experimental fix #684. It's based on latest code of PS: you can switch to this branch by running
|
Thanks a lot again @Huang-Wei for looking into this issue so quickly 🙏 |
Reporting first observations: The log line
was logged only 8 times in the last 12 hours so your PR seems to have fixed the large volume of this log line 👏 We also didn't observe any hanging of the scheduler (but as I said, since this happened only every 2-3 weeks on average, it's probably too early to say for sure it doesn't happen any more). We still saw strong spikes in cpu utilization, much higher than what is requested by default in the helm chart. The scheduler continued to work though. In the last 12 hours, we also saw a lot of occurrences of the following logs: 312,697 log entries:
21,775 log entries:
Are these expected? Thanks 🙏 |
Good to hear that. Yes, this log is still possible - a gang of pods may get triggered by a relevant system event to be moved to activeQ already when the co-scheduling plugin's activateSiblings function is triggered. But its frequency should be reduced a lot.
I'm not sure if it's relevant atm.
The first message is as expected, #654 introduced the fix in v0.27. The second message seems to be complaining API incompatibility of |
Hey @Huang-Wei, we haven't observed any further hangings of the scheduler in our training cluster and the two weeks leading up to the winter break were very busy. |
Thank you 🙏 |
@Huang-Wei is there a fix for this that is pre-built anywhere or in any release yet? |
It's not yet. I plan to release it with v0.28.X and the ETA is end of Feb. |
Is there any progress? thx |
It's on my radar. Postponed a bit due to my personal bandwidth. I will get new release cut by end of this week. |
Happy to report that we didn't see any further issues after deploying your experimental fix 👏 |
Area
Other components
No response
What happened?
We use the coscheduling plugin for ML training workloads.
Due to quota and resource restrictions, it regularly happens that the required nodes cannot be scaled up and a group of pending pod remains in the cluster.
Once that happens, sometimes we observe that the scheduler even fails to schedule pods for which resources would exist. We don't even see events on the respective pods.
This behaviour coincides with the scheduler emitting millions of log lines within a short period of time saying:
At the same time, we observe a high increase in CPU usage:
To make the scheduler schedule pods again, sometimes it's enough to restart the scheduler. Sometimes, however, we need to manually delete the pods/pod groups referenced in those millions of log lines.
To us it seems as if the issue isn't just starvation but generally the coscheduler not doing anything meaningful anymore.
What did you expect to happen?
We would expect the coscheduler to continue to schedule pods (especially those not part of a pod group) for which resources do exist.
How can we reproduce it (as minimally and precisely as possible)?
Unfortunately we haven't been able to create a minimal working example.
We tried debugging this issue by running the scheduler locally with a visual debugger attached. We did hit the line logging the error with the debugger but don't know how to proceed from here to debug this further. We would be very keen to get hints.
Anything else we need to know?
No response
Kubernetes version
Scheduler Plugins version
The text was updated successfully, but these errors were encountered: