[Document] update content about resource SKU and CPU VC (#4887)

* fix * fix * fix * fix * fix * fix * fix
microsoft · Sep 9, 2020 · b3d7404 · b3d7404
1 parent 069cab5
commit b3d7404
Show file tree

Hide file tree

Showing 13 changed files with 83 additions and 107 deletions.
diff --git a/docs/manual/cluster-admin/README.md b/docs/manual/cluster-admin/README.md
@@ -13,8 +13,7 @@ This manual is for cluster administrators to learn the installation and uninstal
 5. [How to Set Up Storage](./how-to-set-up-storage.md)
 6. [How to Set Up Virtual Clusters](./how-to-set-up-virtual-clusters.md)
 7. [How to Add and Remove Nodes](./how-to-add-and-remove-nodes.md)
-8. [How to use CPU Nodes](./how-to-use-cpu-nodes.md)
-9. [How to Customize Cluster by Plugins](./how-to-customize-cluster-by-plugins.md)
-10. [Troubleshooting](./troubleshooting.md)
-11. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
-12. [Upgrade Guide](./upgrade-guide.md)
+8. [How to Customize Cluster by Plugins](./how-to-customize-cluster-by-plugins.md)
+9. [Troubleshooting](./troubleshooting.md)
+10. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
+11. [Upgrade Guide](./upgrade-guide.md)
diff --git a/docs/manual/cluster-admin/how-to-add-and-remove-nodes.md b/docs/manual/cluster-admin/how-to-add-and-remove-nodes.md
@@ -1,6 +1,6 @@
 # How to Add and Remove Nodes
 
-OpenPAI doesn't support changing master nodes, thus, only the solution of adding/removing worker nodes is provided.
+OpenPAI doesn't support changing master nodes, thus, only the solution of adding/removing worker nodes is provided. You can add GPU or CPU workers into the cluster.
 
 ## How to Add Nodes
 
@@ -12,9 +12,9 @@ To add worker nodes, please check if the nodes meet the following requirements:
   - Assign each node a **static IP address**, and make sure nodes can communicate with each other. 
   - The nodes can access internet, especially need to have access to the docker hub registry service or its mirror. Deployment process will pull Docker images.
   - SSH service is enabled and share the same username/password with current master/worker machines and have sudo privilege.
-  - **Have GPU and GPU driver is installed.**  You may use [a command](./installation-faqs-and-troubleshooting.md#how-to-check-whether-the-gpu-driver-is-installed) to check it. Refer to [the installation guidance](./installation-faqs-and-troubleshooting.md#how-to-install-gpu-driver) in FAQs if the driver is not successfully installed. If you are wondering which version of GPU driver you should use, please also refer to [FAQs](./installation-faqs-and-troubleshooting.md#which-version-of-nvidia-driver-should-i-install).
+  - (For CPU workers, you can ignore this requirement) **Have GPU and GPU driver is installed.**  You may use [a command](./installation-faqs-and-troubleshooting.md#how-to-check-whether-the-gpu-driver-is-installed) to check it. Refer to [the installation guidance](./installation-faqs-and-troubleshooting.md#how-to-install-gpu-driver) in FAQs if the driver is not successfully installed. If you are wondering which version of GPU driver you should use, please also refer to [FAQs](./installation-faqs-and-troubleshooting.md#which-version-of-nvidia-driver-should-i-install).
   - **Docker is installed.**  You may use command `docker --version` to check it. Refer to [docker's installation guidance](https://docs.docker.com/engine/install/ubuntu/) if it is not successfully installed.
-  - **[nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) or other device runtime is installed. And be configured as the default runtime of docker. Please configure it in [docker-config-file](https://docs.docker.com/config/daemon/#configure-the-docker-daemon), because kubespray will overwrite systemd's env.**
+  - (For CPU workers, you can ignore this requirement) **[nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) or other device runtime is installed. And be configured as the default runtime of docker. Please configure it in [docker-config-file](https://docs.docker.com/config/daemon/#configure-the-docker-daemon), because kubespray will overwrite systemd's env.**
     - You may use command `sudo docker run nvidia/cuda:10.0-base nvidia-smi` to check it. This command should output information of available GPUs if it is setup properly.
     - Refer to [the installation guidance](./installation-faqs-and-troubleshooting.md#how-to-install-nvidia-container-runtime) if the it is not successfully set up.
   - OpenPAI reserves memory and CPU for service running, so make sure there are enough resource to run machine learning jobs. Check hardware requirements for details.
@@ -86,6 +86,7 @@ all:
         origin4:
 
 ############# Example start ################### 
+#### For CPU workers, please don't add them here.
         a:
         b:
 ############## Example end #################### 
@@ -143,12 +144,9 @@ machine-list:
 - Stop the service, push the latest configuration, and then start services:
 
 ```bash
-./paictl.py service stop -n rest-server
-./paictl.py service stop -n hivedscheduler
-./paictl.py config push -p ~/pai-deploy/cluster-cfg -m service
-./paictl.py service start -n cluster-configuration
-./paictl.py service start -n hivedscheduler
-./paictl.py service start -n rest-server
+./paictl.py service stop -n cluster-configuration hivedscheduler rest-server
+./paictl.py config push -p <config-folder> -m service
+./paictl.py service start -n cluster-configuration hivedscheduler rest-server
 ```
 
 If you have configured any PV/PVC storage, please confirm the added worker node meets the PV's requirements. See [Confirm Worker Nodes Environment](./how-to-set-up-storage.md#confirm-environment-on-worker-nodes) for details.
@@ -163,17 +161,12 @@ First, modify `host.yml` accordingly, then go into `~/pai-deploy/kubespray/`, ru
 ansible-playbook -i inventory/mycluster/hosts.yml upgrade-cluster.yml --become --become-user=root  --limit=a,b -e "@inventory/mycluster/openpai.yml"
 ``` 
 
-Modify the `layout.yaml` and `services-configuration.yaml`, then push them to the cluster:
+Modify the `layout.yaml` and `services-configuration.yaml`.
 
-```bash
-./paictl config push -p ~/pai-deploy/cluster-cfg -m service
-```
-
-Restart services:
+Stop the service, push the latest configuration, and then start services:
 
 ```bash
-./paictl.py service stop -n rest-server hivedscheduler
-./paictl.py service start -n cluster-configuration
-./paictl.py service start -n hivedscheduler rest-server
+./paictl.py service stop -n cluster-configuration hivedscheduler rest-server
+./paictl.py config push -p <config-folder> -m service
+./paictl.py service start -n cluster-configuration hivedscheduler rest-server
 ```
-
diff --git a/docs/manual/cluster-admin/how-to-set-up-virtual-clusters.md b/docs/manual/cluster-admin/how-to-set-up-virtual-clusters.md
@@ -2,12 +2,14 @@
 
 ## What is Hived Scheduler and How to Configure it
 
-HiveD is a standalone component of OpenPAI, designed to be a Kubernetes Scheduler Extender for Multi-Tenant GPU clusters. A multi-tenant GPU cluster assumes multiple tenants (teams) share the same GPU pool in a single physical cluster (PC) and provides some resource guarantees to each tenant. HiveD models each tenant as a virtual cluster (VC), so that one tenant can use its own VC as if it is a private cluster, while it can also use other VCs' free resource at lower priority.
+As a standalone component of OpenPAI, [HiveD](https://github.com/microsoft/hivedscheduler) is a Kubernetes Scheduler for Deep Learning.
 
 Before we start, please read [this doc](https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md) to learn how to write hived scheduler configuration.
 
 ## Set Up Virtual Clusters
 
+### Configuration for GPU Virtual Cluster
+
 In [`services-configuration.yaml`](./basic-management-operations.md#pai-service-management-and-paictl), there is a section for hived scheduler, for example:
 
 ```yaml
@@ -82,17 +84,70 @@ hivedscheduler:
 ...
 ```
 
-After modification, use the following commands to apply the settings:
+### Configuration for CPU-only Virtual Cluster
+
+Currently we recommend you to set up a pure-CPU virtual cluster, and don't mix CPU nodes with GPU nodes in one virtual cluster. Please omit `gpu` field or use `gpu: 0` in `skuTypes` for the VC. Here is an example:
+
+```
+hivedscheduler:
+  config: |
+    physicalCluster:
+      skuTypes:
+        DT:
+          gpu: 1
+          cpu: 5
+          memory: 56334Mi
+        CPU:
+          cpu: 1
+          memory: 10240Mi
+      cellTypes:
+        DT-NODE:
+          childCellType: DT
+          childCellNumber: 4
+          isNodeLevel: true
+        DT-NODE-POOL:
+          childCellType: DT-NODE
+          childCellNumber: 3
+        CPU-NODE:
+          childCellType: CPU
+          childCellNumber: 8
+          isNodeLevel: true
+        CPU-NODE-POOL:
+          childCellType: CPU-NODE
+          childCellNumber: 1
+      physicalCells:
+      - cellType: DT-NODE-POOL
+        cellChildren:
+        - cellAddress: worker1
+        - cellAddress: worker2
+        - cellAddress: worker3
+      - cellType: CPU-NODE-POOL
+        cellChildren:
+        - cellAddress: cpu-worker1
+    virtualClusters:
+      default:
+        virtualCells:
+        - cellType: DT-NODE-POOL.DT-NODE
+          cellNumber: 3
+      cpu:
+        virtualCells:
+        - cellType: CPU-NODE-POOL.CPU-NODE
+          cellNumber: 1
+```
+
+Explanation of the above example: Supposing we have a node named `cpu-worker1` in Kubernetes. It has 80GB memory and 8 allocatable CPUs (please use `kubectl describe node cpu-worker1` to confirm the allocatable resources). Then, in `skuTypes`, we can set a `CPU` sku, which has 1 CPU and 10240 MiB (80GiB / 8) memory. You can reserve some memory or CPUs if you want. `CPU-NODE` and `CPU-NODE-POOL` are set correspondingly in the `cellTypes`. Finally, the setting will result in one `default` VC and one `cpu` VC. The `cpu` VC contains one CPU node.
+
+### Apply Configuration in Cluster
+
+After modification of the configuration, use the following commands to apply the settings:
 
 ```bash
-./paictl.py service stop -n rest-server
-./paictl.py service stop -n hivedscheduler
+./paictl.py service stop -n rest-server hivedscheduler
 ./paictl.py config push -p <config-folder> -m service
-./paictl.py service start -n hivedscheduler
-./paictl.py service start -n rest-server
+./paictl.py service start -n hivedscheduler rest-server
 ```
 
-You can now test the `default` VC and `new` VC, with any admin accounts in OpenPAI. [Next section](#how-to-grant-vc-to-users) will introduce how to grant VC access to non-admin users.
+You can now test these new VCs, with any admin accounts in OpenPAI. [Next section](#how-to-grant-vc-to-users) will introduce how to grant VC access to non-admin users.
 
 ## How to Grant VC to Users
 

diff --git a/docs/manual/cluster-admin/how-to-use-cpu-nodes.md b/docs/manual/cluster-admin/how-to-use-cpu-nodes.md
diff --git a/docs/manual/cluster-admin/installation-faqs-and-troubleshooting.md b/docs/manual/cluster-admin/installation-faqs-and-troubleshooting.md
@@ -4,7 +4,7 @@
 
 #### How to include CPU-only worker nodes?
 
-In current release, the support for CPU nodes is limited. Please refer to [How to Use CPU Nodes](./how-to-use-cpu-nodes.md) for details.
+Currently, the support for CPU-only worker is limited in the installation script. If you have both GPU workers and CPU workers, please first set up PAI with GPU workers only. After PAI is successfully installed, you can attach CPU workers to it and set up a CPU-only virtual cluster. Please refer to [How to add and remove nodes](./how-to-add-and-remove-nodes.md) for details. If you only have CPU workers, we haven't had an official installation support yet. Please submit an issue for feature request.
 
 #### Which version of NVIDIA driver should I install?
 

diff --git a/docs/manual/cluster-admin/installation-guide.md b/docs/manual/cluster-admin/installation-guide.md
@@ -50,7 +50,7 @@ To be detailed, please check the following requirements before installation:
 
 #### Tips to Use CPU-only Worker
 
-Currently, the support for CPU-only worker is limited. If you have both GPU workers and CPU workers, please first set up PAI with GPU workers only. After PAI is successfully installed, you can attach CPU workers to it and set up a CPU-only virtual cluster. Please refer to [How to use CPU Nodes](./how-to-use-cpu-nodes.md) for details. If you only have CPU workers, we haven't had an official installation support yet. Please submit an issue for feature request.
+Currently, the support for CPU-only worker is limited in the installation script. If you have both GPU workers and CPU workers, please first set up PAI with GPU workers only. After PAI is successfully installed, you can attach CPU workers to it and set up a CPU-only virtual cluster. Please refer to [How to add and remove nodes](./how-to-add-and-remove-nodes.md) for details. If you only have CPU workers, we haven't had an official installation support yet. Please submit an issue for feature request.
 
 #### Tips for Network-related Issues
 

diff --git a/docs/manual/cluster-user/imgs/input-command.gif b/docs/manual/cluster-user/imgs/input-command.gif
diff --git a/docs/manual/cluster-user/imgs/input-command.png b/docs/manual/cluster-user/imgs/input-command.png
diff --git a/docs/manual/cluster-user/imgs/input-docker.gif b/docs/manual/cluster-user/imgs/input-docker.gif
diff --git a/docs/manual/cluster-user/imgs/input-docker.png b/docs/manual/cluster-user/imgs/input-docker.png
diff --git a/docs/manual/cluster-user/imgs/input-resource.gif b/docs/manual/cluster-user/imgs/input-resource.gif
diff --git a/docs/manual/cluster-user/quick-start.md b/docs/manual/cluster-user/quick-start.md
@@ -51,15 +51,13 @@ python train_image_classifier.py --dataset_name=cifar10 --dataset_dir=/tmp/data
 
 Note: Please **Do Not** use `#` for comments or use `\` for line continuation in the command box. These symbols may break the syntax and will be supported in the future.
 
-<img src="./imgs/input-command.gif" width="90%" height="90%" />
+<img src="./imgs/input-command.png" width="90%" height="90%" />
 
-**Step 4.** Specify the resources you need. By default only GPU number could be set. Toggle the `custom` button if you need to customize CPU number and memory. Here we use a customized setting: 1 GPU, 1 CPU, and 6500 MB memory.
-
-<img src="./imgs/input-resource.gif" width="90%" height="90%" />
+**Step 4.** Specify the resources you need. OpenPAI uses **resource SKU** to quantify the resource in one instance. For example, here 1 `DT` SKU means 1 GPU, 5 CPUs, and 53914 MB memory. If you specify one `DT` SKU, you will get a container with 1 GPU, 5 CPUs, and 53914 MB memory. If you specify two `DT` SKUs, you will get a container with 2 GPUs, 10 CPUs, and 107828 MB memory.
 
 **Step 5.** Specify the docker image. You can either use the listed docker images or take advantage of your own one. Here we select `TensorFlow 1.15.0 + Python 3.6 with GPU, CUDA 10.0`, which is a pre-built image. We will introduce more about docker images in [Docker Images and Job Examples](./docker-images-and-job-examples.md).
 
-<img src="./imgs/input-docker.gif" width="90%" height="90%" />
+<img src="./imgs/input-docker.png" width="90%" height="90%" />
 
 **Step 6.** Click **Submit** to submit the job.
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -19,7 +19,6 @@ nav:
       - How to Set Up Storage: manual/cluster-admin/how-to-set-up-storage.md
       - How to Set Up Virtual Clusters: manual/cluster-admin/how-to-set-up-virtual-clusters.md
       - How to Add and Remove Nodes: manual/cluster-admin/how-to-add-and-remove-nodes.md
-      - How to use CPU Nodes: manual/cluster-admin/how-to-use-cpu-nodes.md
       - How to Customize Cluster by Plugins: manual/cluster-admin/how-to-customize-cluster-by-plugins.md
       - Troubleshooting: manual/cluster-admin/troubleshooting.md
       - How to Uninstall OpenPAI: manual/cluster-admin/how-to-uninstall-openpai.md