node outOfSync state cause deadlock and block pod scheduled #2999

Monokaix · 2023-07-26T08:26:05Z

What happened:
After the pod is scheduled to the node, when the allocated resources of the node decrease due to some reasons (such as an exception reported by the gpu device), the node will be set to outofsync state, and node will not be added to the session, new pod cannot be scheduled to the current node until the allocatable resources reported by the node become normal ,even if there are other idle resources on the node, the pod cannot be scheduled. If the pod is used to report gpu resources, the premise of the pod being scheduled is that the node ends the outofsync state , and the end of outofsync requires the gpu device to be scheduled and report resources correctly, which causes a deadlock
What you expected to happen:
pod used to report gpu resource should be scheduled even though node is in outOfSync state.
How to reproduce it (as minimally and precisely as possible):

Run a device-plugin daemonset to report gpu resource
Run a pod using gpu resource
Uninstall device-plugin daemonset and wait gpu resource of node allocatable become zero
Re-deploy device-plugin daemonset, one daemonset pod can't be scheduled

Anything else we need to know?:

Environment:

Volcano Version: latest
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

Monokaix added the kind/bug Categorizes issue or PR as related to a bug. label Jul 26, 2023

Monokaix mentioned this issue Jul 26, 2023

remove node out of sync state #2998

Merged

volcano-sh-bot closed this as completed in #2998 Jul 28, 2023

william-wang added this to the v1.8 milestone Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node outOfSync state cause deadlock and block pod scheduled #2999

node outOfSync state cause deadlock and block pod scheduled #2999

Monokaix commented Jul 26, 2023

node outOfSync state cause deadlock and block pod scheduled #2999

node outOfSync state cause deadlock and block pod scheduled #2999

Comments

Monokaix commented Jul 26, 2023