Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
Alert Severity (#5055)
Browse files Browse the repository at this point in the history
* define alerts severity

* show severity in email

* introduce alert severity in doc and examples

* show severity in web-portal
  • Loading branch information
suiguoxin authored Nov 9, 2020
1 parent acbec7b commit 6cb7f8d
Show file tree
Hide file tree
Showing 14 changed files with 72 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,8 @@ authentication:
# - alert: PAIJobGpuPercentLowerThan0_3For1h
# expr: avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name) < 0.3
# for: 1h
# labels:
# severity: warn
# annotations:
# summary: "{{$labels.job_name}} has a job gpu percent lower than 30% for 1 hour"
# description: Monitor job level gpu utilization in certain virtual clusters.
Expand Down
2 changes: 2 additions & 0 deletions deployment/quick-start/services-configuration.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@ rest-server:
# - alert: PAIJobGpuPercentLowerThan0_3For1h
# expr: avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name) < 0.3
# for: 1h
# labels:
# severity: warn
# annotations:
# summary: "{{$labels.job_name}} has a job gpu percent lower than 30% for 1 hour"
# description: Monitor job level gpu utilization in certain virtual clusters.
Expand Down
7 changes: 6 additions & 1 deletion docs/manual/cluster-admin/how-to-use-alert-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ To view existing alert rules based on the metrics, you can go to `http(s)://<you

### How to Add Customized Alerts

You can define customized alerts in the `prometheus` field in [`services-configuration.yml`](./basic-management-operations.md#pai-service-management-and-paictl). For example, We can add a customized alert `PAIJobGpuPercentLowerThan0_3For1h` by adding:
You can define customized alerts in the `prometheus` field in [`services-configuration.yml`](./basic-management-operations.md#pai-service-management-and-paictl).
For example, We can add a customized alert `PAIJobGpuPercentLowerThan0_3For1h` by adding:

``` yaml
prometheus:
Expand All @@ -44,12 +45,16 @@ prometheus:
- alert: PAIJobGpuPercentLowerThan0_3For1h
expr: avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name) < 0.3
for: 1h
labels:
severity: warn
annotations:
summary: "{{$labels.job_name}} has a job gpu percent lower than 30% for 1 hour"
description: Monitor job level gpu utilization in certain virtual clusters.
```

The `PAIJobGpuPercentLowerThan0_3For1h` alert will be fired when the job on virtual cluster `default` has a task level average GPU percent lower than `30%` for more than `1 hour`.
The alert severity can be defined as `info`, `warn`, `error` or `fatal` by adding a label.
Here we use `warn`.
Here the metric `task_gpu_percent` is used, which describes the GPU utilization at task level.

Remember to push service config to the cluster and restart the `prometheus` service after your modification with the following commands [in the dev-box container](./basic-management-operations.md#pai-service-management-and-paictl):
Expand Down
2 changes: 2 additions & 0 deletions docs_zh_CN/manual/cluster-admin/how-to-use-alert-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ prometheus:
- alert: PAIJobGpuPercentLowerThan0_3For1h
expr: avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name) < 0.3
for: 1h
labels:
severity: warn
annotations:
summary: "{{$labels.job_name}} has a job gpu percent lower than 30% for 1 hour"
description: Monitor job level gpu utilization in certain virtual clusters.
Expand Down
2 changes: 2 additions & 0 deletions examples/cluster-configuration/services-configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,8 @@ rest-server:
# - alert: PAIJobGpuPercentLowerThan0_3For1h
# expr: avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name) < 0.3
# for: 1h
# labels:
# severity: warn
# annotations:
# summary: "{{$labels.job_name}} has a job gpu percent lower than 30% for 1 hour"
# description: Monitor job level gpu utilization in certain virtual clusters.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ data:
group_wait: 30s
group_interval: 5m
repeat_interval: {{ cluster_cfg["alert-manager"]["repeat-interval"] }}
group_by: [alertname, alertstate]
group_by: [alertname, alertstate, severity]

routes:
- receiver: pai-cordon-nodes
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<%= cluster_id %>:
<%= cluster_id %> <%= groupLabels.severity %>:
<% if (alerts.filter( element => element.status =="firing").length > 0) { %>
[FIRING: <%= alerts.filter( element => element.status =="firing").length %> ]
<% } %>
Expand Down
2 changes: 2 additions & 0 deletions src/prometheus/deploy/alerting/customized.rules.template
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,6 @@

# Rule Syntax Reference: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

# select alert severity from `info`, `warn`, `error` or `fatal`

{{ cluster_cfg['prometheus']['customized-alerts'] }}
16 changes: 16 additions & 0 deletions src/prometheus/deploy/alerting/gpu.rules
Original file line number Diff line number Diff line change
Expand Up @@ -17,47 +17,63 @@

# Rule Syntax Reference: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

# select alert severity from `info`, `warn`, `error` or `fatal`

groups:
- name: gpu_related
rules:
- alert: NvidiaSmiLatencyTooLarge
expr: histogram_quantile(0.95, sum(rate(cmd_nvidia_smi_latency_seconds_bucket[5m])) by (le, instance)) > 40
for: 5m
labels:
severity: warn
annotations:
summary: "95th nvidia-smi call latency is larger than 40s in {{$labels.instance}}, should check the gpu status manually"

- alert: NvidiaSmiDoubleEccError
expr: nvidiasmi_ecc_error_count{type="double"} > 0
for: 5m
labels:
severity: fatal
annotations:
summary: "nvidia card from {{$labels.instance}} minor number {{$labels.minor_number}} has {{$labels.type}} ecc error, count {{$value}}"

- alert: NvidiaMemoryLeak
expr: nvidiasmi_memory_leak_count > 0
for: 5m
labels:
severity: error
annotations:
summary: "found nvidia memory leak from {{$labels.instance}} minor number {{$labels.minor_number}}"

- alert: NvidiaZombieProcess
expr: zombie_process_count{command="nvidia-smi"} > 0
for: 5m
labels:
severity: warn
annotations:
summary: "found nvidia zombie process in {{$labels.instance}}"

- alert: GpuUsedByExternalProcess
expr: gpu_used_by_external_process_count > 0
for: 5m
labels:
severity: warn
annotations:
summary: "found nvidia used by external process in {{$labels.instance}}"

- alert: GpuUsedByZombieContainer
expr: gpu_used_by_zombie_container_count > 0
for: 5m
labels:
severity: warn
annotations:
summary: "found nvidia used by zombie container in {{$labels.instance}}"

- alert: NodeGpuCountChanged
expr: changes(node:gpu_utilization:count[5m]) > 0
labels:
severity: fatal
annotations:
summary: "found gpu count changes in {{$labels.instance}}"

8 changes: 7 additions & 1 deletion src/prometheus/deploy/alerting/jobs.rules
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,22 @@

# Rule Syntax Reference: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

# select alert severity from `info`, `warn`, `error` or `fatal`

groups:
- name: pai-jobs
rules:
- alert: PaiJobsZombie
expr: zombie_container_count > 0
for: 1h # only when it exceed 1 hour
labels:
severity: info
annotations:
summary: "zombie container in {{$labels.instance}}detected"
summary: "zombie container in {{$labels.instance}} detected"
- alert: PaiJobPending
expr: pai_job_pod_count{pod_bound="true", phase="pending"} > 0
for: 30m
labels:
severity: warn
annotations:
summary: "Job {{$labels.job_name}}in pending status detected"
6 changes: 6 additions & 0 deletions src/prometheus/deploy/alerting/k8s.rules
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,23 @@

# Rule Syntax Reference: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

# select alert severity from `info`, `warn`, `error` or `fatal`

groups:
- name: k8s_component
rules:
- alert: k8sApiServerNotOk
expr: k8s_api_server_count{error!="ok"} > 0
for: 10m
labels:
severity: fatal
annotations:
summary: "api server in {{$labels.host_ip}} is {{$labels.error}}"

- alert: k8sDockerDaemonNotOk
expr: docker_daemon_count{error!="ok"} > 0
for: 5m
labels:
severity: fatal
annotations:
summary: "docker daemon in {{$labels.ip}} is {{$labels.error}}"
16 changes: 16 additions & 0 deletions src/prometheus/deploy/alerting/node.rules
Original file line number Diff line number Diff line change
Expand Up @@ -17,47 +17,63 @@

# Rule Syntax Reference: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

# select alert severity from `info`, `warn`, `error` or `fatal`

groups:
- name: node-rules
rules:
- alert: NodeFilesystemUsage
expr: node_filesystem_avail_bytes{mountpoint=~"/host-root.*", device=~"/dev.*"} / node_filesystem_size_bytes * 100 <= 20
for: 5m
labels:
severity: warn
annotations:
summary: "Free space in {{$labels.device}} from {{$labels.instance}} is less than 20% (current value is: {{ $value }})"

- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 95
for: 5m
labels:
severity: warn
annotations:
summary: "Memory usage in {{$labels.instance}} is above 95% (current value is: {{ $value }})"

- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 98
for: 5m
labels:
severity: warn
annotations:
summary: "CPU usage in {{$labels.instance}} is above 98% (current value is: {{ $value }})"

- alert: NodeDiskPressure
expr: pai_node_count{disk_pressure="true"} > 0
for: 10m
labels:
severity: error
annotations:
summary: "{{$labels.name}} is under disk pressure"

- alert: NodeOutOfDisk
expr: pai_node_count{out_of_disk="true"} > 0
for: 10m
labels:
severity: error
annotations:
summary: "{{$labels.name}} is out of disk"

- alert: NodeNotReady
expr: pai_node_count{ready!="true"} > 0
for: 10m
labels:
severity: error
annotations:
summary: "{{$labels.name}} is not ready"

- alert: AzureAgentConsumeTooMuchMem
expr: process_mem_usage_byte{cmd=~".*om[is]agent.*"} > 1073741824 # 1G
for: 5m
labels:
severity: warn
annotations:
summary: "{{$labels.cmd}} with pid {{$labels.pid}} in {{$labels.instance}} consume more than 1G of memory"
8 changes: 8 additions & 0 deletions src/prometheus/deploy/alerting/pai-services.rules
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,17 @@

# Rule Syntax Reference: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

# select alert severity from `info`, `warn`, `error` or `fatal`

groups:
- name: pai-services
rules:
- alert: PaiServicePodNotRunning
expr: pai_pod_count{phase!="running"} > 0
for: 10m
labels:
type: pai_service
severity: error
annotations:
summary: "{{$labels.name}} in {{$labels.host_ip}} not running detected"

Expand All @@ -31,6 +36,7 @@ groups:
for: 10m
labels:
type: pai_service
severity: error
annotations:
summary: "{{$labels.name}} in {{$labels.host_ip}} not ready detected"

Expand All @@ -39,6 +45,7 @@ groups:
for: 5m
labels:
type: pai_service
severity: error
annotations:
summary: "{{$labels.pai_service_name}} in {{$labels.instance}} not up detected"

Expand All @@ -47,5 +54,6 @@ groups:
for: 5m
labels:
type: pai_service
severity: warn
annotations:
summary: "{{$labels.name}} in {{$labels.instance}} hangs detected"
2 changes: 1 addition & 1 deletion src/webportal/src/app/layout/components/alerts.jsx
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ export const NotificationButton = () => {
<div className={classNames.itemCell} data-is-focusable={true}>
{'Issue time: ' + new Date(item.startsAt).toLocaleString()}
<br />
{'Summary: ' + item.annotations.summary}
{item.labels.severity + ':' + item.annotations.summary}
</div>
);
}}
Expand Down

0 comments on commit 6cb7f8d

Please sign in to comment.