Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[Document] update content about resource SKU and CPU VC #4887

Merged
merged 7 commits into from
Sep 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions docs/manual/cluster-admin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@ This manual is for cluster administrators to learn the installation and uninstal
5. [How to Set Up Storage](./how-to-set-up-storage.md)
6. [How to Set Up Virtual Clusters](./how-to-set-up-virtual-clusters.md)
7. [How to Add and Remove Nodes](./how-to-add-and-remove-nodes.md)
8. [How to use CPU Nodes](./how-to-use-cpu-nodes.md)
9. [How to Customize Cluster by Plugins](./how-to-customize-cluster-by-plugins.md)
10. [Troubleshooting](./troubleshooting.md)
11. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
12. [Upgrade Guide](./upgrade-guide.md)
8. [How to Customize Cluster by Plugins](./how-to-customize-cluster-by-plugins.md)
9. [Troubleshooting](./troubleshooting.md)
10. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
11. [Upgrade Guide](./upgrade-guide.md)
31 changes: 12 additions & 19 deletions docs/manual/cluster-admin/how-to-add-and-remove-nodes.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# How to Add and Remove Nodes

OpenPAI doesn't support changing master nodes, thus, only the solution of adding/removing worker nodes is provided.
OpenPAI doesn't support changing master nodes, thus, only the solution of adding/removing worker nodes is provided. You can add GPU or CPU workers into the cluster.

## How to Add Nodes

Expand All @@ -12,9 +12,9 @@ To add worker nodes, please check if the nodes meet the following requirements:
- Assign each node a **static IP address**, and make sure nodes can communicate with each other.
- The nodes can access internet, especially need to have access to the docker hub registry service or its mirror. Deployment process will pull Docker images.
- SSH service is enabled and share the same username/password with current master/worker machines and have sudo privilege.
- **Have GPU and GPU driver is installed.** You may use [a command](./installation-faqs-and-troubleshooting.md#how-to-check-whether-the-gpu-driver-is-installed) to check it. Refer to [the installation guidance](./installation-faqs-and-troubleshooting.md#how-to-install-gpu-driver) in FAQs if the driver is not successfully installed. If you are wondering which version of GPU driver you should use, please also refer to [FAQs](./installation-faqs-and-troubleshooting.md#which-version-of-nvidia-driver-should-i-install).
- (For CPU workers, you can ignore this requirement) **Have GPU and GPU driver is installed.** You may use [a command](./installation-faqs-and-troubleshooting.md#how-to-check-whether-the-gpu-driver-is-installed) to check it. Refer to [the installation guidance](./installation-faqs-and-troubleshooting.md#how-to-install-gpu-driver) in FAQs if the driver is not successfully installed. If you are wondering which version of GPU driver you should use, please also refer to [FAQs](./installation-faqs-and-troubleshooting.md#which-version-of-nvidia-driver-should-i-install).
- **Docker is installed.** You may use command `docker --version` to check it. Refer to [docker's installation guidance](https://docs.docker.com/engine/install/ubuntu/) if it is not successfully installed.
- **[nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) or other device runtime is installed. And be configured as the default runtime of docker. Please configure it in [docker-config-file](https://docs.docker.com/config/daemon/#configure-the-docker-daemon), because kubespray will overwrite systemd's env.**
- (For CPU workers, you can ignore this requirement) **[nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) or other device runtime is installed. And be configured as the default runtime of docker. Please configure it in [docker-config-file](https://docs.docker.com/config/daemon/#configure-the-docker-daemon), because kubespray will overwrite systemd's env.**
- You may use command `sudo docker run nvidia/cuda:10.0-base nvidia-smi` to check it. This command should output information of available GPUs if it is setup properly.
- Refer to [the installation guidance](./installation-faqs-and-troubleshooting.md#how-to-install-nvidia-container-runtime) if the it is not successfully set up.
- OpenPAI reserves memory and CPU for service running, so make sure there are enough resource to run machine learning jobs. Check hardware requirements for details.
Expand Down Expand Up @@ -86,6 +86,7 @@ all:
origin4:

############# Example start ###################
#### For CPU workers, please don't add them here.
a:
b:
############## Example end ####################
Expand Down Expand Up @@ -143,12 +144,9 @@ machine-list:
- Stop the service, push the latest configuration, and then start services:

```bash
./paictl.py service stop -n rest-server
./paictl.py service stop -n hivedscheduler
./paictl.py config push -p ~/pai-deploy/cluster-cfg -m service
./paictl.py service start -n cluster-configuration
./paictl.py service start -n hivedscheduler
./paictl.py service start -n rest-server
./paictl.py service stop -n cluster-configuration hivedscheduler rest-server
./paictl.py config push -p <config-folder> -m service
./paictl.py service start -n cluster-configuration hivedscheduler rest-server
```

If you have configured any PV/PVC storage, please confirm the added worker node meets the PV's requirements. See [Confirm Worker Nodes Environment](./how-to-set-up-storage.md#confirm-environment-on-worker-nodes) for details.
Expand All @@ -163,17 +161,12 @@ First, modify `host.yml` accordingly, then go into `~/pai-deploy/kubespray/`, ru
ansible-playbook -i inventory/mycluster/hosts.yml upgrade-cluster.yml --become --become-user=root --limit=a,b -e "@inventory/mycluster/openpai.yml"
```

Modify the `layout.yaml` and `services-configuration.yaml`, then push them to the cluster:
Modify the `layout.yaml` and `services-configuration.yaml`.

```bash
./paictl config push -p ~/pai-deploy/cluster-cfg -m service
```

Restart services:
Stop the service, push the latest configuration, and then start services:

```bash
./paictl.py service stop -n rest-server hivedscheduler
./paictl.py service start -n cluster-configuration
./paictl.py service start -n hivedscheduler rest-server
./paictl.py service stop -n cluster-configuration hivedscheduler rest-server
./paictl.py config push -p <config-folder> -m service
./paictl.py service start -n cluster-configuration hivedscheduler rest-server
```

69 changes: 62 additions & 7 deletions docs/manual/cluster-admin/how-to-set-up-virtual-clusters.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@

## What is Hived Scheduler and How to Configure it

HiveD is a standalone component of OpenPAI, designed to be a Kubernetes Scheduler Extender for Multi-Tenant GPU clusters. A multi-tenant GPU cluster assumes multiple tenants (teams) share the same GPU pool in a single physical cluster (PC) and provides some resource guarantees to each tenant. HiveD models each tenant as a virtual cluster (VC), so that one tenant can use its own VC as if it is a private cluster, while it can also use other VCs' free resource at lower priority.
As a standalone component of OpenPAI, [HiveD](https://github.com/microsoft/hivedscheduler) is a Kubernetes Scheduler for Deep Learning.

Before we start, please read [this doc](https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md) to learn how to write hived scheduler configuration.

## Set Up Virtual Clusters

### Configuration for GPU Virtual Cluster

In [`services-configuration.yaml`](./basic-management-operations.md#pai-service-management-and-paictl), there is a section for hived scheduler, for example:

```yaml
Expand Down Expand Up @@ -82,17 +84,70 @@ hivedscheduler:
...
```

After modification, use the following commands to apply the settings:
### Configuration for CPU-only Virtual Cluster

Currently we recommend you to set up a pure-CPU virtual cluster, and don't mix CPU nodes with GPU nodes in one virtual cluster. Please omit `gpu` field or use `gpu: 0` in `skuTypes` for the VC. Here is an example:

```
hivedscheduler:
config: |
physicalCluster:
skuTypes:
DT:
gpu: 1
cpu: 5
memory: 56334Mi
CPU:
cpu: 1
memory: 10240Mi
cellTypes:
DT-NODE:
childCellType: DT
childCellNumber: 4
isNodeLevel: true
DT-NODE-POOL:
childCellType: DT-NODE
childCellNumber: 3
CPU-NODE:
childCellType: CPU
childCellNumber: 8
isNodeLevel: true
CPU-NODE-POOL:
childCellType: CPU-NODE
childCellNumber: 1
physicalCells:
- cellType: DT-NODE-POOL
cellChildren:
- cellAddress: worker1
- cellAddress: worker2
- cellAddress: worker3
- cellType: CPU-NODE-POOL
cellChildren:
- cellAddress: cpu-worker1
virtualClusters:
default:
virtualCells:
- cellType: DT-NODE-POOL.DT-NODE
cellNumber: 3
cpu:
virtualCells:
- cellType: CPU-NODE-POOL.CPU-NODE
cellNumber: 1
```

Explanation of the above example: Supposing we have a node named `cpu-worker1` in Kubernetes. It has 80GB memory and 8 allocatable CPUs (please use `kubectl describe node cpu-worker1` to confirm the allocatable resources). Then, in `skuTypes`, we can set a `CPU` sku, which has 1 CPU and 10240 MiB (80GiB / 8) memory. You can reserve some memory or CPUs if you want. `CPU-NODE` and `CPU-NODE-POOL` are set correspondingly in the `cellTypes`. Finally, the setting will result in one `default` VC and one `cpu` VC. The `cpu` VC contains one CPU node.

### Apply Configuration in Cluster

After modification of the configuration, use the following commands to apply the settings:

```bash
./paictl.py service stop -n rest-server
./paictl.py service stop -n hivedscheduler
./paictl.py service stop -n rest-server hivedscheduler
./paictl.py config push -p <config-folder> -m service
./paictl.py service start -n hivedscheduler
./paictl.py service start -n rest-server
./paictl.py service start -n hivedscheduler rest-server
```

You can now test the `default` VC and `new` VC, with any admin accounts in OpenPAI. [Next section](#how-to-grant-vc-to-users) will introduce how to grant VC access to non-admin users.
You can now test these new VCs, with any admin accounts in OpenPAI. [Next section](#how-to-grant-vc-to-users) will introduce how to grant VC access to non-admin users.

## How to Grant VC to Users

Expand Down
68 changes: 0 additions & 68 deletions docs/manual/cluster-admin/how-to-use-cpu-nodes.md

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

#### How to include CPU-only worker nodes?

In current release, the support for CPU nodes is limited. Please refer to [How to Use CPU Nodes](./how-to-use-cpu-nodes.md) for details.
Currently, the support for CPU-only worker is limited in the installation script. If you have both GPU workers and CPU workers, please first set up PAI with GPU workers only. After PAI is successfully installed, you can attach CPU workers to it and set up a CPU-only virtual cluster. Please refer to [How to add and remove nodes](./how-to-add-and-remove-nodes.md) for details. If you only have CPU workers, we haven't had an official installation support yet. Please submit an issue for feature request.

#### Which version of NVIDIA driver should I install?

Expand Down
2 changes: 1 addition & 1 deletion docs/manual/cluster-admin/installation-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ To be detailed, please check the following requirements before installation:

#### Tips to Use CPU-only Worker

Currently, the support for CPU-only worker is limited. If you have both GPU workers and CPU workers, please first set up PAI with GPU workers only. After PAI is successfully installed, you can attach CPU workers to it and set up a CPU-only virtual cluster. Please refer to [How to use CPU Nodes](./how-to-use-cpu-nodes.md) for details. If you only have CPU workers, we haven't had an official installation support yet. Please submit an issue for feature request.
Currently, the support for CPU-only worker is limited in the installation script. If you have both GPU workers and CPU workers, please first set up PAI with GPU workers only. After PAI is successfully installed, you can attach CPU workers to it and set up a CPU-only virtual cluster. Please refer to [How to add and remove nodes](./how-to-add-and-remove-nodes.md) for details. If you only have CPU workers, we haven't had an official installation support yet. Please submit an issue for feature request.

#### Tips for Network-related Issues

Expand Down
Binary file removed docs/manual/cluster-user/imgs/input-command.gif
Binary file not shown.
Binary file added docs/manual/cluster-user/imgs/input-command.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/manual/cluster-user/imgs/input-docker.gif
Binary file not shown.
Binary file added docs/manual/cluster-user/imgs/input-docker.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/manual/cluster-user/imgs/input-resource.gif
Binary file not shown.
8 changes: 3 additions & 5 deletions docs/manual/cluster-user/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,15 +51,13 @@ python train_image_classifier.py --dataset_name=cifar10 --dataset_dir=/tmp/data

Note: Please **Do Not** use `#` for comments or use `\` for line continuation in the command box. These symbols may break the syntax and will be supported in the future.

<img src="./imgs/input-command.gif" width="90%" height="90%" />
<img src="./imgs/input-command.png" width="90%" height="90%" />

**Step 4.** Specify the resources you need. By default only GPU number could be set. Toggle the `custom` button if you need to customize CPU number and memory. Here we use a customized setting: 1 GPU, 1 CPU, and 6500 MB memory.

<img src="./imgs/input-resource.gif" width="90%" height="90%" />
**Step 4.** Specify the resources you need. OpenPAI uses **resource SKU** to quantify the resource in one instance. For example, here 1 `DT` SKU means 1 GPU, 5 CPUs, and 53914 MB memory. If you specify one `DT` SKU, you will get a container with 1 GPU, 5 CPUs, and 53914 MB memory. If you specify two `DT` SKUs, you will get a container with 2 GPUs, 10 CPUs, and 107828 MB memory.
abuccts marked this conversation as resolved.
Show resolved Hide resolved

**Step 5.** Specify the docker image. You can either use the listed docker images or take advantage of your own one. Here we select `TensorFlow 1.15.0 + Python 3.6 with GPU, CUDA 10.0`, which is a pre-built image. We will introduce more about docker images in [Docker Images and Job Examples](./docker-images-and-job-examples.md).

<img src="./imgs/input-docker.gif" width="90%" height="90%" />
<img src="./imgs/input-docker.png" width="90%" height="90%" />

**Step 6.** Click **Submit** to submit the job.

Expand Down
1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ nav:
- How to Set Up Storage: manual/cluster-admin/how-to-set-up-storage.md
- How to Set Up Virtual Clusters: manual/cluster-admin/how-to-set-up-virtual-clusters.md
- How to Add and Remove Nodes: manual/cluster-admin/how-to-add-and-remove-nodes.md
- How to use CPU Nodes: manual/cluster-admin/how-to-use-cpu-nodes.md
- How to Customize Cluster by Plugins: manual/cluster-admin/how-to-customize-cluster-by-plugins.md
- Troubleshooting: manual/cluster-admin/troubleshooting.md
- How to Uninstall OpenPAI: manual/cluster-admin/how-to-uninstall-openpai.md
Expand Down