Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

hivedscheduler is prone to misconfig due to daemon Pods, such as weave net and nginx proxy #4331

Closed
hzy46 opened this issue Mar 26, 2020 · 5 comments

Comments

@hzy46
Copy link
Contributor

hzy46 commented Mar 26, 2020

After PAI is deployed, run kubectl describe <worker-node>:

image

This will cause a mis-configuration in hivedscheduler.

@yqwang-ms yqwang-ms changed the title Weave net and nginx proxy has CPU request, which will reduce hivedscheduler's allocatable resource. hivedscheduler is prone to misconfig due to daemon Pods, such as weave net and nginx proxy Mar 26, 2020
@yqwang-ms
Copy link
Member

yqwang-ms commented Mar 26, 2020

Operator need to pay attention when config Hived to avoid this misconfig:

  1. Using kubectl describe nodes to check if these K80 nodes have nearly the same (Allocatable Resources - All Daemon Pods Requests, such as Pods for Device Plugin, Network Plugin, etc), especially for gpu, cpu, memory. If not, please fix it. Assume the aligned minimal resources are: 4 gpus, 23 cpus, and 219GB memory.

Which is mentioned in doc:
https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md#config-quickstart

@hzy46
Copy link
Contributor Author

hzy46 commented Mar 26, 2020

Operator need to pay attention when config Hived to avoid this misconfig:

  1. Using kubectl describe nodes to check if these K80 nodes have nearly the same (Allocatable Resources - All Daemon Pods Requests, such as Pods for Device Plugin, Network Plugin, etc), especially for gpu, cpu, memory. If not, please fix it. Assume the aligned minimal resources are: 4 gpus, 23 cpus, and 219GB memory.

Which is mentioned in doc:
https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md#config-quickstart

Could hivedscheduler do a pre-check to avoid mis-configuration? Configuring hivedscheduler manually is error-prone (at least for me).

@yqwang-ms
Copy link
Member

Yeap. But anyway, it is not just precheck, we need to always check in case this "Allocatable Resources - All Daemon Pods Requests, such as Pods for Device Plugin, Network Plugin, etc" is changed during Hived running.
Operator ensures this, is current the easiest apporach.
Will consider how to auto it later.

@hzy46
Copy link
Contributor Author

hzy46 commented Mar 26, 2020

Yeap. But anyway, it is not just precheck, we need to always check in case this "Allocatable Resources - All Daemon Pods Requests, such as Pods for Device Plugin, Network Plugin, etc" is changed during Hived running.
Operator ensures this, is current the easiest apporach.
Will consider how to auto it later.

Yes, exactly.

@yqwang-ms
Copy link
Member

Resolved by #4855

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants