This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 549
support different types of computing hardware #5138
Comments
Detailed Work Items for this issue:
If all P0 items are done, we can support different hardwares in default scheduler. |
Test cases for rest-server: 1. Default Scheduler: Test the resource requirement is correctly specified in pod definition.
2. Hived Scheduler: Test the environment varibales is set in pod spec.
|
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Motivation
Currently, OpenPAI has supported the most widely used computing devices: Nvidia GPU, AMD GPU and CPU. In addition, it has the potential to support other types of device, e.g. AI computing chips (NPU).
Goal
Decouple OpenPAI services and specific hardware types. One OpenPAI service container can support a list of hardware types.
Requirements
For every type of computing device, the vendor should guarantee:
MVP with default scheduler
By assuming that there is only one type of computing device in a cluster, we could build a minimal viable solution with the default scheduler by
ComputeDevice
(default isnvidia.com/gpu
) in deployment and record it in configmapComputeDevice
in quick startnvidia.com/gpu
toComputeDevice
in rest serverpai/src/rest-server/src/models/v2/job/k8s.js
Lines 483 to 487 in 2fb370a
Beside the necessary works, we (pai-dev team and device vendor) could make better support by
devices
subfolders. The basic idea is to quick locate device related codes and isolate codes for different devices (e.g. different device vendors should avoid editing the same file).If a component must support diverse types of computing device, there will be a
devices
folder in it. For PAI services, they should take these files into consideration in build time. And one container will support a list of different machine models. For other components like the deploy script, they should check these files in runtime.nvidia-smi
and prometheus exporterPerfect support with HiveD
By enabling HiveD, we could get better support
Some extra efforts must be done to achieve this
layout.yaml
replace master.csv / worker.csv by layout.yaml #5151NVIDIA_VISIBLE_DEVICES
andPAI_AMD_VISIBLE_DEVICES
.pai/src/rest-server/src/models/v2/job/k8s.js
Lines 656 to 676 in 2fb370a
Some optional work items include
layout.yaml
and HiveD skussku-(cpu,gpu,mem)
converting simply, predictably and decoupled with devices CPU/GPU/Memory information to SKU definition API #5148.The text was updated successfully, but these errors were encountered: