Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
add cluster-utilization report doc (#5331)
Browse files Browse the repository at this point in the history
  • Loading branch information
suiguoxin authored Mar 2, 2021
1 parent 1d62a1f commit 23d41e3
Show file tree
Hide file tree
Showing 5 changed files with 65 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ authentication:
# smtp-auth-username: alert-sender@example.com
# smtp-auth-password: password-for-alert-sender
# cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# # for schedule syntex, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# # for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# schedule: "0 0 * * *" # daily report at UTC 00:00
# customized-routes:
# routes:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ rest-server:
# smtp-auth-username: alert-sender@example.com
# smtp-auth-password: password-for-alert-sender
# cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# # for schedule syntex, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# # for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# schedule: "0 0 * * *" # daily report at UTC 00:00
# customized-routes:
# routes:
Expand Down
31 changes: 31 additions & 0 deletions docs/manual/cluster-admin/how-to-use-alert-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,3 +248,34 @@ Remember to re-build and push the docker image, and restart the `alert-manager`
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```

## Cluster GPU Utilization Report

We provide the functionality to send cluster GPU utilization report regularly to admin users.

The report includes the statistics for:
- Cluster GPU utilization
- User GPU utilization
- Job GPU utilization

To enable this feature, you should configure the `alert-manager` field in `services-configuration.yml`.
`pai-bearer-token` & `cluster-utilization`->`schedule` are necessary fields for this feature.
For the syntax of `schedule`, please refer to [Cron Schedule Syntax](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax).
For example, `"0 0 * * *"` means daily report at UTC 00:00.
Please also make sure that the [`email-admin`](#Existing-Actions-and-Matching-Rules) action is enabled.

```yaml
alert-manager:
pai-bearer-token: 'your-application-token-for-pai-rest-server'
cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
schedule: "0 0 * * *" # daily report at UTC 00:00
```
To make your configuration take effect, restart the `alert-manager` service after your modification with the following commands in the dev-box container:

```bash
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```
31 changes: 31 additions & 0 deletions docs_zh_CN/manual/cluster-admin/how-to-use-alert-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,3 +232,34 @@ alert-manager:
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```

## Cluster GPU Utilization Report

我们提供了将群集GPU使用率报告定期发送给管理员用户的功能。

该报告包括以下方面的统计信息:
- 集群GPU利用率
- 用户GPU利用率
- 任务GPU利用率

要启用此功能,您应该在`services-configuration.yml`中配置`alert-manager`字段。
`pai-bearer-token`和`cluster-utilization`->`schedule`是此功能的必要字段。
有关`schedule`字段的语法,请参阅[定时计划语法](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax)。
例如,`"0 0 * * *"`表示每日在UTC 00:00发送报告。
同时请确保已启用[`email-admin`](#Existing-Actions-and-Matching-Rules)处理措施。

```yaml
alert-manager:
pai-bearer-token: 'your-application-token-for-pai-rest-server'
cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
schedule: "0 0 * * *" # daily report at UTC 00:00
```

为使配置生效,请在dev box容器中使用以下命令重启`alert-manager`服务:

```bash
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```
2 changes: 1 addition & 1 deletion examples/cluster-configuration/services-configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ rest-server:
# smtp-auth-username: alert-sender@example.com
# smtp-auth-password: password-for-alert-sender
# cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# # for schedule syntex, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# # for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# schedule: "0 0 * * *" # daily report at UTC 00:00
# customized-routes:
# routes:
Expand Down

0 comments on commit 23d41e3

Please sign in to comment.