Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Add document for v1.0 #4725

Merged
merged 1 commit into from
Jul 16, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Build documentation in the docs/ directory with Sphinx
#sphinx:
# configuration: docs/conf.py

# Build documentation with MkDocs
mkdocs:
configuration: mkdocs.yml

# Optionally build your docs in additional formats such as PDF and ePub
formats: all

# Optionally set the version of Python and requirements required to build your docs
python:
version: 3.7
install:
- requirements: docs/requirements.txt
17 changes: 17 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# OpenPAI Handbook

[![Build Status](https://travis-ci.org/microsoft/pai.svg?branch=master)](https://travis-ci.org/microsoft/pai)
[![Join the chat at https://gitter.im/Microsoft/pai](https://badges.gitter.im/Microsoft/pai.svg)](https://gitter.im/Microsoft/pai?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![Version](https://img.shields.io/github/release/Microsoft/pai.svg)](https://github.com/Microsoft/pai/releases/latest)

OpenPAI is an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale.

This handbook is based on OpenPAI >= v1.0.0, and it contains two parts: [User Manual](./manual/cluster-user/README.md) and [Admin Manual](./manual/cluster-admin/README.md).

To learn how to submit job, debug job, manage data, use Marketplace and VSCode extension on OpenPAI, please follow our [User Manual](./manual/cluster-user/README.md).

To set up a new cluster, learn how to manage cluster on OpenPAI, please follow [Admin Manual](./manual/cluster-admin/README.md).

To view a general introduction of OpenPAI, please refer to the [Github Readme](https://github.com/microsoft/pai/blob/master/README.md).

For any issue/bug/feature request, please submit it to [GitHub](https://github.com/microsoft/pai).
20 changes: 20 additions & 0 deletions docs/manual/cluster-admin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# OpenPAI Manual for Cluster Administrators

OpenPAI is an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale.

This manual is for cluster administrators to learn the installation and uninstallation of OpenPAI, some basic management operations, storage management, troubleshootiong, etc. It is based on OpenPAI >= v1.0.0.

## Table of Content

1. [Installation Guide](./installation-guide.md)
2. [Installation FAQs and Troubleshooting](./installation-faqs-and-troubleshooting.md)
3. [Basic Management Operations](./basic-management-operations.md)
4. [How to Manage Users and Groups](./how-to-manage-users-and-groups.md)
5. [How to Set Up Storage](./how-to-set-up-storage.md)
6. [How to Set Up Virtual Clusters](./how-to-set-up-virtual-clusters.md)
7. [How to Add and Remove Nodes](./how-to-add-and-remove-nodes.md)
8. [How to use CPU Nodes](./how-to-use-cpu-nodes.md)
9. [How to Customize Cluster by Plugins](./how-to-customize-cluster-by-plugins.md)
10. [Troubleshooting](./troubleshooting.md)
11. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
12. [Upgrade Guide](./upgrade-guide.md)
154 changes: 154 additions & 0 deletions docs/manual/cluster-admin/basic-management-operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Basic Management Operations

## Management on Webportal

The webportal provides some basic administration functions. If you log in to it as an administrator, you will find several buttons about administration on the left sidebar, as shown in the following image.

<img src="./imgs/administration.png" width="100%" height="100%" />

Most of these functions are easy to understand. We will go through them quickly in this section.

### Hardware Utilization Page

The hardware page shows the CPU, GPU, memory, disk, network utilization of each node in your cluster. The utilization is shown in different color basically. If you hover your mouse on these colored circles, exact utilization percentage will be shown.

<img src="./imgs/hardware.png" width="100%" height="100%" />

### Services Page

The services page shows OpenPAI services deployed in Kubernetes. These services are daemonset, deployment, or stateful sets.

<img src="./imgs/services.png" width="100%" height="100%" />


### User Management

The user management page lets you create, modify, and delete users. Users have two types: non-admin users and admin users. You can choose which type to create. This page only shows up when OpenPAI is deployed in basic authentication mode, which is the default mode. If your cluster uses [AAD](./how-to-manage-users-and-groups.md#users-and-groups-in-aad-mode) to manage users, this page won't be available to you.

<img src="./imgs/user-management.png" width="100%" height="100%" />


### Abnormal Jobs

On the homepage, there is an `abnormal jobs` section for administrators. A job is treated as an abnormal job if it runs more than 5 days or GPU usage is lower than 10%. You can choose to stop some abnormal jobs if you desire so.

<img src="./imgs/abnormal-jobs.png" width="100%" height="100%" />

### Access Kubernetes Dashboard

There is a shortcut to k8s dashboard on the webportal. However, it needs special authentication for security issues.

<img src="./imgs/k8s-dashboard.png" width="100%" height="100%" />

To use it, you should first set up `https` access (Using `http://<ip>` won't work) for OpenPAI. Then, on the dev box machine, follow the steps below:

**Step 1.** Save following yaml text as `admin-user.yaml`

```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: admin-user
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: admin-user
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: admin-user
namespace: kube-system
```
**Step 2.** Run `kubectl apply -f admin-user.yaml`

**Step 3.** Run `kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep admin-user | awk '{print $1}')`. It will print the token which can be used to login k8s-dashboard.

## PAI Service Management and Paictl

Generally speaking, PAI services are daemon sets, deployments or stateful sets created by PAI system, running on Kubernetes. You can find them on the [k8s dashboard](#access-kubernetes-dashboard) and [services page](#services-page). For example, `webportal` is a PAI service which provides front-end interface, and `rest-server` is another one for back-end APIs. These services are all configurable. If you have followed the [installation-guide](./installation-guide.md), you can find two files, `layout.yaml` and `services-configuration.yaml`, in folder `~/pai-deploy/cluster-cfg` on the dev box machine. These two files are the default service configuration.

`paictl` is a CLI tool which helps you manage cluster configuration and PAI services. To use it, we recommend you to leverage our dev box docker image to avoid environment-related problems. First, go to the dev box machine, launch the dev box docker by:

```bash
sudo docker run -itd \
-e COLUMNS=$COLUMNS -e LINES=$LINES -e TERM=$TERM \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ${HOME}/pai-deploy/cluster-cfg:/cluster-configuration \
-v ${HOME}/pai-deploy/kube:/root/.kube \
--pid=host \
--privileged=true \
--net=host \
--name=dev-box \
openpai/dev-box:<openpai version tag>
```

You should replace the `<openpai version tag>` with your current OpenPAI version, e.g. `v1.0.0`. In the command, we mount `${HOME}/pai-deploy/kube` to `/root/.kube` in the container. Thus the container has correct config file to access Kubernetes. Also, we mount `${HOME}/pai-deploy/cluster-cfg`, the configuration created by installation, to `/cluster-configuration` in the container.

To use `paictl`, go into the container by:

```bash
sudo docker exec -it dev-box bash
```

Then, go to folder `/pai`, try to retrieve your cluster id:

```bash
cd /pai
./paictl.py config get-id
```

If the command prints your cluster id, you can confirm the `paictl` tool works fine.

Here are some basic usage examples of `paictl`:

```bash
# get cluster id
./paictl.py config get-id
# pull service config to a certain folder
# the configuration containers two files: layout.yaml and services-configuration.yaml
# if <config-folder> already has these files, they will be overrided
./paictl.py config pull -o <config-folder>
# push service config to the cluster
# only pushed config is effective
./paictl.py config push -p <config-folder> -m service
# stop all PAI services
./paictl.py service stop
# start all PAI services
./paictl.py service start
# stop several PAI services
./paictl.py service stop -n <service-name-1> <service-name-2>
# start several PAI services
./paictl.py service start -n <service-name-1> <service-name-2>
```

If you want to change configuration of some services, please follow the steps of `service stop`, `config push` and `service start`.

For example, if you want to customize webportal, you should modify the `webportal` section in `services-configuration.yaml`. Then use the following command to push the configuration and restart webportal:

```bash
./paictl.py service stop -n webportal
./paictl.py config push -p <config-folder> -m service
./paictl.py service start -n webportal
```

Another example is to restart the whole cluster:

```bash
# restart cluster
./paictl.py service stop
./paictl.py service start
```

You can use `exit` to leave the dev-box container, and use `sudo docker exec -it dev-box bash` to re-enter it if you desire so. If you don't need it any more, use `sudo docker stop dev-box` and `sudo docker rm dev-box` to delete the docker container.
30 changes: 30 additions & 0 deletions docs/manual/cluster-admin/configuration-for-china.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
如果您是中国用户,在[创建设置文件这一步](./installation-guide.md#create-configurations),请使用下面的`config`文件:

###### `config` example

```yaml
user: <ssh用户>
password: <ssh密码>
branch_name: pai-1.0.y
docker_image_tag: v1.0.1

gcr_image_repo: "gcr.azk8s.cn"
kube_image_repo: "gcr.azk8s.cn/google-containers"
kubeadm_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/kubeadm"
hyperkube_download_url: "https://shaiictestblob01.blob.core.chinacloudapi.cn/share-all/hyperkube"

openpai_kubespray_extra_var:
pod_infra_image_repo: "gcr.azk8s.cn/google_containers/pause-{{ image_arch }}"
dnsautoscaler_image_repo: "gcr.azk8s.cn/google_containers/cluster-proportional-autoscaler-{{ image_arch }}"
tiller_image_repo: "gcr.azk8s.cn/kubernetes-helm/tiller"
registry_proxy_image_repo: "gcr.azk8s.cn/google_containers/kube-registry-proxy"
metrics_server_image_repo: "gcr.azk8s.cn/google_containers/metrics-server-amd64"
addon_resizer_image_repo: "gcr.azk8s.cn/google_containers/addon-resizer"
dashboard_image_repo: "gcr.azk8s.cn/google_containers/kubernetes-dashboard-{{ image_arch }}"
```
此文件中,请把`user`和`password`替换为您master和worker机器的SSH用户及密码;`branch_name`和`docker_image_tag`请替换为想要安装的OpenPAI版本,例如如果想要安装`v1.1.0`版本,请将`branch_name`和`docker_image_tag`分别替换为`pai-1.1.y`和`v1.1.0`。另外,如果您在Azure China中搭建,请加入一行`openpai_kube_network_plugin: weave`,因为Azure暂时不支持默认的calico插件。

如果使用此`config`文件,会从我们合作伙伴[上海仪电创新院](https://www.shaiic.com/)提供的地址下载必要的`kubeadm`和`hyperkube`文件;此外会使用`gcr.azk8s.cn`作为`gcr.io`的镜像服务器。如果您的网络无法访问`gcr.azk8s.cn`,可以寻找别的`gcr.io`替代镜像,并对`config`文件作对应修改。

除了该`config`文件外,其他的步骤都和[Installation Guide](./installation-guide.md)一致。
Loading