Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Latest commit

 

History

History
27 lines (20 loc) · 4.43 KB

system_architecture.md

File metadata and controls

27 lines (20 loc) · 4.43 KB

System Architecture

The figure above provides an overview of OpenPAI. OpenPAI is managed by Kubernetes, it assumes Kubernetes has already been deployed to the cluster by third-party tools like Azure Kubernetes Service (AKS) or Kubespray. OpenPAI provides paictl, a tool to help user deploy OpenPAI services to the k8s cluster.

One key design goal of OpenPAI is to facilitate the sharing and reproducing of AI innovations. To this end, OpenPAI introduces marketplace, where people can share their workloads and data within a private group or publically.

The workloads and data in the marketplace are described by OpenPAI protocol, a specification that describes the hardware and software requirement of a workload or dataset. The hardware and software requirements include GPU/CPU/Memory resource requirement, docker images, data/code location, the training method (gang scheduling or elastic), job completion policy, etc. OpenPAI protocol facilitates platform interoperability and job portability, a job described by the protocol can run on different clusters managed by OpenPAI, as long as the clusters can meet the specification. The OpenPAI protocol also enables great flexibility, any AI workload, being it Tensorflow, PyTorch, or your proprietary deep learning workload, can be described by the protocol.

Job orchestrator and OpenPAI runtime are two key components that understand and execute the workload specified by the OpenPAI protocol. Job orchestrator is implemented by leveraging FrameworkController, a general-purpose k8s controller that orchestrates k8s Pods supporting all kinds of AI workloads. The OpenPAI runtime provides runtime support to the workload and implement OpenPAI runtime parameters/variables that are necessary to support the OpenPAI protocol. OpenPAI runtime also prebuilds with failure analysis rules that can detect typical runtime failure patterns. OpenPAI might take some actions against the detected failure pattens. For example, if OpenPAI finds the job failure is due to a Python syntax error, it will not retry the job regardless of the job retry behavior specified by the user to prevent unnecessary retry and the corresponding waste of cluster resources. The failure rules can be updated on-the-fly by the cluster operators. Whenever new failure patterns are discovered the cluster operator can build them into the OpenPAI runtime.

OpenPAI provides comprehensive monitoring tools to users and cluster admins for the job and cluster monitoring. OpenPAI also monitors the status of key OpenPAI components in the cluster and can send alerts (e.g., as in email) if pre-confined conditions have been triggered.

OpenPAI is a modular platform, which is designed to enable various innovations. With the standard k8s scheduling API, OpenPAI introduces HiveD, an optional but recommended scheduler designed for deep learning workloads in a multi-tenant GPU cluster. HiveD provides various advantages over the standard k8s scheduler. For example, it introduces a notion of a "virtual cluster", which allows a team of users to run a workload in the virtual cluster as if they reserve a private, dedicated (smaller) GPU cluster. HiveD's virtual cluster reserves GPU resources not only in terms of quota (i.e., number of GPU) but also in terms of topology. For example, with HiveD a virtual cluster can reserve a GPU node, or a rack of GPU nodes within the same InfiniBand domain, instead of a set of GPUs randomly scatters across the cluster. This is important to preserve the training speed for jobs within the virtual cluster. With HiveD, OpenPAI also provides better topology-aware gang scheduling with no resource starvation. HiveD also supports multi-priority jobs and job preemption.