Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
Merge branch 'master' of https://github.com/microsoft/pai into zgw_ht…
Browse files Browse the repository at this point in the history
…tps_eng
  • Loading branch information
vvfreesoul committed Nov 13, 2020
2 parents aa4e71c + 5f12ee5 commit 540c75c
Show file tree
Hide file tree
Showing 93 changed files with 1,872 additions and 823 deletions.
File renamed without changes.
100 changes: 60 additions & 40 deletions contrib/aks-engine/readme.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,76 @@
#### Install Necessary Package.
# Cluster Autoscaler on AKS Engine

- [ Install Azure CLI ](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest)
- [ Install AKS-Engine ](https://github.com/Azure/aks-engine/blob/master/docs/tutorials/quickstart.md#install-the-aks-engine-binary)
[AKS Engine](https://github.com/Azure/aks-engine) is a tool to help you provision a self-managed Kubernetes cluster on Azure,
while [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) is another tool that automatically adjusts the size of the Kubernetes cluster.
The Cluster Autoscaler on Azure dynamically scales Kubernetes worker nodes.

#### Create Resource Group
This contrib aims to help you deploy a OpenPAI cluster on Azure using AKS Engine, and runs Cluster Autoscaler as a deployment in your cluster.

- Solution A [ Azure Portal ](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal#create-resource-groups) (Recommended)
- Solution B [ Azure CLI ](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-cli#create-resource-groups)

Remember the following parameters
## Preparations on Azure

- subscription id: ```${subscriptionId}```
- resource groupname: ```${resourcegroup}```
- location: ```${location}```
1. Install Dependencies

#### Create Service Principle
1. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest)
2. Install [AKS Engine](https://github.com/Azure/aks-engine/blob/master/docs/tutorials/quickstart.md#install-the-aks-engine-binary)

```bash
az ad sp create-for-rbac --skip-assignment --name ${service-principal-name}
```
2. Create resource group

If the command success, the output will like the following example.
There're two options to create resource group in your subscription:
* It's recommended to use [Azure Portal](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal#create-resource-groups)
* You can also use [Azure CLI](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-cli#create-resource-groups)

```json
{
"appId": "559513bd-0c19-4c1a-87cd-851a26afd5fc",
"displayName": "${service-principal-name}",
"name": "http://${service-principal-name}",
"password": "e763725a-5eee-40e8-a466-dc88d980f415",
"tenant": "72f988bf-86f1-41af-91ab-2d7cd011db48"
}
```
Remember the following parameters.
Remember the following parameters which will be used later:
* subscription id `${subscriptionId}`
* resource groupname `${resourcegroup}`
* location `${location}`

- ```appId```: ```${appId}```
- ```password```: ```${password}```
- ```displayName```: ```${spName}```
- ```tenant```: ```${tenant}```


[The doc about this steps](https://docs.microsoft.com/en-us/azure/aks/kubernetes-service-principal#manually-create-a-service-principal)
3. Create Service Principal

#### Ask your subscription's admin to add the new service principal as the owner of the new resource group.
Run the following command:

Content as the title. Important and don't forget it.
```sh
az ad sp create-for-rbac --skip-assignment --name ${service-principal-name}
```

#### Write Configuration
You will see the following output if it succeed:

[Configuration example](config.yml)
```json
{
"appId": "87432405-56b6-4d76-923b-39d1d75d19f7",
"displayName": "${service-principal-name}",
"name": "http://${service-principal-name}",
"password": "ff5b1601-1298-460d-a94f-fcc8b5ef96f0",
"tenant": "72e9b8a0-54c8-4742-8da6-1f5d1418c3c5"
}
```

#### Start Cluster
Remember the following parameters which will be used later:
* appId `${appId}`
* password `${password}`
* displayName `${spName}`
* tenant `${tenant}`

```
python3 azure.py -c config.yml
```
For more details on how to create service principal, please refer to [manually-create-a-service-principal document](https://docs.microsoft.com/en-us/azure/aks/kubernetes-service-principal#manually-create-a-service-principal).

4. Add the service principal as the owner of the resource group.


## OpenPAI Deployment

1. Prepare the [configuration file](./config.yaml), replace the variables with parameters in previous steps.
To use Cluster Autosaler, specify the following lines in `openpai_worker_vmss`:

```yaml
openpai_worker_vmss:
...
ca_enable: true
min_vm_count: 1
max_vm_count: 10
```

2. Deploy Kubernetes cluster with AKS Engine, and deploy OpenPAI:

```sh
python3 azure.py -c config.yaml
```
17 changes: 9 additions & 8 deletions contrib/kubespray/quick-start-kubespray.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,10 @@ then
exit 1
fi

# environment set up
/bin/bash script/environment.sh -c ${CLUSTER_CONFIG} || exit $?

/bin/bash script/configuration.sh -m ${MASTER_LIST} -w ${WORKER_LIST} -c ${CLUSTER_CONFIG} || exit $?

echo "Ping Test"

ansible all -i ${HOME}/pai-deploy/cluster-cfg/hosts.yml -m ping || exit $?

# check requirements
/bin/bash requirement.sh -m ${MASTER_LIST} -w ${WORKER_LIST} -c ${CLUSTER_CONFIG}
ret_code_check=$?
if [ $ret_code_check -ne 0 ]; then
Expand All @@ -60,8 +56,13 @@ if [ $ret_code_check -ne 0 ]; then
fi
fi

/bin/bash preinstall.sh -c ${CLUSTER_CONFIG} || exit $?
# prepare cluster-cfg folder
/bin/bash script/configuration-kubespray.sh -m ${MASTER_LIST} -w ${WORKER_LIST} -c ${CLUSTER_CONFIG} || exit $?

/bin/bash script/kubernetes-boot.sh || exit $?
echo "Ping Test"
ansible all -i ${HOME}/pai-deploy/cluster-cfg/hosts.yml -m ping || exit $?

/bin/bash preinstall.sh -c ${CLUSTER_CONFIG} || exit $?

# setup k8s cluster
/bin/bash script/kubernetes-boot.sh || exit $?
5 changes: 4 additions & 1 deletion contrib/kubespray/quick-start-service.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,7 @@ then
exit 1
fi

/bin/bash script/service-boot.sh -c ${CLUSTER_CONFIG}
# prepare quick-start-config folder
/bin/bash script/configuration-services.sh -m ${MASTER_LIST} -w ${WORKER_LIST} -c ${CLUSTER_CONFIG} || exit $?

/bin/bash script/service-boot.sh -c ${CLUSTER_CONFIG}
22 changes: 14 additions & 8 deletions contrib/kubespray/quick-start/services-configuration.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -183,16 +183,18 @@ authentication:
# - receiver: pai-email-admin-user-and-stop-job
# match:
# alertname: PAIJobGpuPercentLowerThan0_3For1h
# customized-receivers:
# customized-receivers: # receivers are combination of several actions
# - name: "pai-email-admin-user-and-stop-job"
# actions:
# - email-admin
# - email-user
# - stop-jobs
# - tag-jobs
# tags:
# - 'stopped-by-alert-manager'

# # the email template for `email-admin` and `email-user `can be chosen from ['general-template', 'kill-low-efficiency-job-alert']
# # if no template specified, 'general-template' will be used.
# email-admin:
# email-user:
# template: 'kill-low-efficiency-job-alert'
# stop-jobs: # no parameters required for stop-jobs action
# tag-jobs:
# tags:
# - 'stopped-by-alert-manager'

# uncomment following if you want to customize prometheus
# prometheus:
Expand Down Expand Up @@ -238,6 +240,10 @@ authentication:
# uncomment following section if you want to customize the port of log-manager
# log-manager:
# port: 9103
# admin_name: "admin"
# admin_password: "admin"
# jwt_secret: "jwt_secret"
# token_expired_second: 120


# uncomment following section if you want to customize the port of storage-manager
Expand Down
25 changes: 25 additions & 0 deletions contrib/kubespray/script/configuration-kubespray.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash

while getopts "w:m:c:" opt; do
case $opt in
w)
WORKER_LIST=$OPTARG
;;
m)
MASTER_LIST=$OPTARG
;;
c)
CLUSTER_CONFIG=$OPTARG
;;
\?)
echo "Invalid option: -$OPTARG"
exit 1
;;
esac
done

mkdir -p ${HOME}/pai-deploy/cluster-cfg
python3 ${HOME}/pai-deploy/pai/contrib/kubespray/script/k8s-generator.py -m ${MASTER_LIST} -w ${WORKER_LIST} -c ${CLUSTER_CONFIG} -o ${HOME}/pai-deploy/cluster-cfg || exit $?

cp ${HOME}/pai-deploy/cluster-cfg/openpai.yml ${HOME}/pai-deploy/kubespray/inventory/pai/
cp ${HOME}/pai-deploy/cluster-cfg/hosts.yml ${HOME}/pai-deploy/kubespray/inventory/pai/
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,6 @@ while getopts "w:m:c:" opt; do
esac
done

cd ${HOME}/pai-deploy/pai/contrib/kubespray
mkdir -p ${HOME}/pai-deploy/cluster-cfg
python3 ${HOME}/pai-deploy/pai/contrib/kubespray/script/k8s-generator.py -m ${MASTER_LIST} -w ${WORKER_LIST} -c ${CLUSTER_CONFIG} -o ${HOME}/pai-deploy/cluster-cfg || exit $?

cp ${HOME}/pai-deploy/cluster-cfg/openpai.yml ${HOME}/pai-deploy/kubespray/inventory/pai/
cp ${HOME}/pai-deploy/cluster-cfg/hosts.yml ${HOME}/pai-deploy/kubespray/inventory/pai/

mkdir -p ${HOME}/pai-deploy/quick-start-config/
cp ${WORKER_LIST} ${HOME}/pai-deploy/quick-start-config/worker.csv
cp ${MASTER_LIST} ${HOME}/pai-deploy/quick-start-config/master.csv
Expand Down
23 changes: 7 additions & 16 deletions contrib/kubespray/script/environment.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,10 @@ OPENPAI_BRANCH_NAME=`cat ${CLUSTER_CONFIG} | grep branch_name | tr -d "[:space:]

echo "Create working folder in ${HOME}/pai-deploy"
mkdir -p ${HOME}/pai-deploy/
cd ${HOME}/pai-deploy

echo "Clone kubespray source code from github"
git clone https://github.com/kubernetes-sigs/kubespray.git

echo "Checkout to the Release Branch"
cd kubespray
git checkout release-2.11
echo "Clone kubespray source code from github to ${HOME}/pai-deploy"
sudo rm -rf ${HOME}/pai-deploy/kubespray
git clone -b release-2.11 https://github.com/kubernetes-sigs/kubespray.git ${HOME}/pai-deploy/kubespray

echo "Copy inventory folder, and save it "
cp -rfp ${HOME}/pai-deploy/kubespray/inventory/sample ${HOME}/pai-deploy/kubespray/inventory/pai
Expand All @@ -44,13 +40,8 @@ echo "Install sshpass"
sudo apt-get -y install sshpass

echo "Install kubespray's requirements and ansible is included"
cd ${HOME}/pai-deploy/kubespray
sudo pip3 install -r requirements.txt

echo "Clone OpenPAI source code from github"
cd ${HOME}/pai-deploy
git clone https://github.com/microsoft/pai.git
cd pai
sudo pip3 install -r ${HOME}/pai-deploy/kubespray/requirements.txt

echo "switch to the branch ${OPENPAI_BRANCH_NAME}"
git checkout ${OPENPAI_BRANCH_NAME}
echo "Clone OpenPAI source code from github to ${HOME}/pai-deploy"
sudo rm -rf ${HOME}/pai-deploy/pai
git clone -b ${OPENPAI_BRANCH_NAME} https://github.com/microsoft/pai.git ${HOME}/pai-deploy/pai
15 changes: 3 additions & 12 deletions contrib/kubespray/script/service-boot.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,16 +46,6 @@ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
pip3 install kubernetes==11.0.0b2 jinja2
cd /root
git clone https://github.com/microsoft/pai.git
cd pai
echo "branch name: ${OPENPAI_BRANCH_NAME}"
git checkout ${OPENPAI_BRANCH_NAME}
git pull
echo "starting nvidia device plugin to detect nvidia gpu resource"
svn cat https://github.com/NVIDIA/k8s-device-plugin.git/tags/1.0.0-beta4/nvidia-device-plugin.yml \
| kubectl apply --overwrite=true -f - || exit $?
Expand All @@ -66,6 +56,7 @@ svn cat https://github.com/RadeonOpenCompute/k8s-device-plugin.git/trunk/k8s-ds-
| kubectl apply --overwrite=true -f - || exit $?
sleep 5
git clone -b ${OPENPAI_BRANCH_NAME} https://github.com/microsoft/pai.git /root/pai
python3 /root/pai/contrib/kubespray/script/openpai-generator.py -m /quick-start-config/master.csv -w /quick-start-config/worker.csv -c /quick-start-config/config.yml -o /cluster-configuration || exit $?
kubectl delete ds nvidia-device-plugin-daemonset -n kube-system || exit $?
Expand All @@ -79,10 +70,10 @@ pip3 install kubernetes
kubectl create namespace pai-storage
# 1. Push cluster config to cluster
echo -e "pai\n" | python paictl.py config push -p /cluster-configuration -m service
echo -e "pai\n" | python /root/pai/paictl.py config push -p /cluster-configuration -m service
# 2. Start OpenPAI service
echo -e "pai\n" | python paictl.py service start
echo -e "pai\n" | python /root/pai/paictl.py service start
EOF_DEV_BOX

if [ $? -ne 0 ]; then
Expand Down
21 changes: 14 additions & 7 deletions deployment/quick-start/services-configuration.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -88,15 +88,18 @@ rest-server:
# - receiver: pai-email-admin-user-and-stop-job
# match:
# alertname: PAIJobGpuPercentLowerThan0_3For1h
# customized-receivers:
# customized-receivers: # receivers are combination of several actions
# - name: "pai-email-admin-user-and-stop-job"
# actions:
# - email-admin
# - email-user
# - stop-jobs
# - tag-jobs
# tags:
# - 'stopped-by-alert-manager'
# # the email template for `email-admin` and `email-user `can be chosen from ['general-template', 'kill-low-efficiency-job-alert']
# # if no template specified, 'general-template' will be used.
# email-admin:
# email-user:
# template: 'kill-low-efficiency-job-alert'
# stop-jobs: # no parameters required for stop-jobs action
# tag-jobs:
# tags:
# - 'stopped-by-alert-manager'

# uncomment following if you want to customize prometheus
# prometheus:
Expand Down Expand Up @@ -128,6 +131,10 @@ rest-server:
# uncomment following section if you want to customize the port of log-manager
# log-manager:
# port: 9103
# admin_name: "admin"
# admin_password: "admin"
# jwt_secret: "jwt_secret"
# token_expired_second: 120

# uncomment following section if you want to customize the port of storage-manager
# storage-manager:
Expand Down
Loading

0 comments on commit 540c75c

Please sign in to comment.