Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node outOfSync state cause deadlock and block pod scheduled #2999

Closed
Monokaix opened this issue Jul 26, 2023 · 0 comments · Fixed by #2998
Closed

node outOfSync state cause deadlock and block pod scheduled #2999

Monokaix opened this issue Jul 26, 2023 · 0 comments · Fixed by #2998
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@Monokaix
Copy link
Member

What happened:
After the pod is scheduled to the node, when the allocated resources of the node decrease due to some reasons (such as an exception reported by the gpu device), the node will be set to outofsync state, and node will not be added to the session, new pod cannot be scheduled to the current node until the allocatable resources reported by the node become normal ,even if there are other idle resources on the node, the pod cannot be scheduled. If the pod is used to report gpu resources, the premise of the pod being scheduled is that the node ends the outofsync state , and the end of outofsync requires the gpu device to be scheduled and report resources correctly, which causes a deadlock
What you expected to happen:
pod used to report gpu resource should be scheduled even though node is in outOfSync state.
How to reproduce it (as minimally and precisely as possible):

  1. Run a device-plugin daemonset to report gpu resource
  2. Run a pod using gpu resource
  3. Uninstall device-plugin daemonset and wait gpu resource of node allocatable become zero
  4. Re-deploy device-plugin daemonset, one daemonset pod can't be scheduled

Anything else we need to know?:

Environment:

  • Volcano Version: latest
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@Monokaix Monokaix added the kind/bug Categorizes issue or PR as related to a bug. label Jul 26, 2023
@william-wang william-wang added this to the v1.8 milestone Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants