Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update grafana pd dashboard (#3343) #3742

Merged
merged 2 commits into from
Jun 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 45 additions & 30 deletions grafana-pd-dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,38 +14,39 @@ aliases: ['/docs-cn/v3.1/reference/key-monitoring-metrics/pd-dashboard/']

以下为 PD Dashboard 监控说明:

## Cluster

- PD role:当前 PD 的角色
- Storage capacity:TiDB 集群总可用数据库空间大小
- Current storage size:TiDB 集群目前已用数据库空间大小
- Current storage usage:TiDB 集群存储空间的使用率
- Normal stores:处于正常状态的节点数目
- Number of Regions:当前集群的 Region 总量
- PD scheduler config:PD 调度配置列表
- Region label isolation level:不同 label 所在的 level 的 Region 数量
- Label distribution:集群中 TiKV 节点的 label 分布情况
- Abnormal stores:处于异常状态的节点数目,正常情况应当为 0
- pd_cluster_metadata:记录集群 ID,时间戳和生成的 ID
- Region health:集群所有 Region 的状态。通常情况下,pending 或 down 的 peer 应该少于 100,miss 的 peer 不能一直大于 0,empty Region 过多需及时打开 Region Merge
- Current peer count:当前集群 peer 的总量
- Region health:每个 Region 的状态,通常情况下,pending 的 peer 应该少于 100,miss 的 peer 不能一直大于 0
![PD Dashboard - Header](/media/pd-dashboard-header-v4.png)

![PD Dashboard - Cluster metrics](/media/pd-dashboard-cluster-v2.png)
## Cluster

- PD scheduler config:PD 调度配置列表
- Cluster ID:集群的 cluster id,唯一标识
- Current TSO:当前分配 TSO 的物理时间戳部分
- Current ID allocation:当前可分配 ID 的最大值
- Region label isolation level:不同 label 所在的 level 的 Region 数量
- Label distribution:集群中 TiKV 节点的 label 分布情况
![PD Dashboard - Cluster metrics](/media/pd-dashboard-cluster-v4.png)

## Operator

- Schedule operator create:新创建的不同 operator 的数量
- Schedule operator check:已检查的 operator 的数量,主要检查是否当前步骤已经执行完成,如果是,则执行下一个步骤
- Schedule operator create:新创建的不同 operator 的数量,单位 opm 代表一分钟内创建的个数
- Schedule operator check:已检查的 operator 的次数,主要检查是否当前步骤已经执行完成,如果是,则执行下一个步骤
- Schedule operator finish:已完成调度的 operator 的数量
- Schedule operator timeout:已超时的 operator 的数量
- Schedule operator replaced or canceled:已取消或者被替换的 operator 的数量
- Schedule operators count by state:不同状态的 operator 的数量
- 99% Operator finish duration:99% 已完成 operator 所花费的最长时间
- 50% Operator finish duration:50% 已完成 operator 所花费的最长时间
- 99% Operator step duration:99% 已完成的 operator 步骤所花费的最长时间
- 50% Operator step duration:50% 已完成的 operator 步骤所花费的最长时间
- Operator finish duration:已完成的 operator 所花费的最长时间
- Operator step duration:已完成的 operator 的步骤所花费的最长时间

![PD Dashboard - Operator metrics](/media/pd-dashboard-operator-v2.png)
![PD Dashboard - Operator metrics](/media/pd-dashboard-operator-v4.png)

## Statistics - Balance

Expand All @@ -61,20 +62,32 @@ aliases: ['/docs-cn/v3.1/reference/key-monitoring-metrics/pd-dashboard/']
- Store leader count:每个 TiKV 实例上所有 leader 的数量
- Store Region count:每个 TiKV 实例上所有 Region 的数量

![PD Dashboard - Balance metrics](/media/pd-dashboard-balance-v2.png)
![PD Dashboard - Balance metrics](/media/pd-dashboard-balance-v4.png)

## Statistics - hot write

- Hot Region's leader distribution:每个 TiKV 实例上成为写入热点的 leader 的数量
- Total written bytes on hot leader Regions:每个 TiKV 实例上所有成为写入热点的 leader 的总的写入流量大小
- Hot write Region's peer distribution:每个 TiKV 实例上成为写入热点的 peer 的数量
- Total written bytes on hot peer Regions:每个 TiKV 实例上所有成为写入热点的 peer 的写入流量大小
- Store Write rate bytes:每个 TiKV 实例总的写入的流量
- Store Write rate keys:每个 TiKV 实例总的写入 keys
- Hot cache write entry number:每个 TiKV 实例进入热点统计模块的 peer 的数量
- Selector events: 热点调度中选择器的事件发生次数
- Direction of hotspot move leader:热点调度中 leader 的调度方向,正数代表调入,负数代表调出
- Direction of hotspot move peer:热点调度中 peer 的调度方向,正数代表调入,负数代表调出

![PD Dashboard - Hot write metrics](/media/pd-dashboard-hotwrite-v4.png)

## Statistics - hotspot
## Statistics - hot read

- Hot write Region's leader distribution:每个 TiKV 实例上是写入热点的 leader 的数量
- Hot write Region's peer distribution:每个 TiKV 实例上是写入热点的 peer 的数量
- Hot write Region's leader written bytes:每个 TiKV 实例上热点的 leader 的写入大小
- Hot write Region's peer written bytes:每个 TiKV 实例上热点的 peer 的写入大小
- Hot read Region's leader distribution:每个 TiKV 实例上是读取热点的 leader 的数量
- Hot read Region's peer distribution:每个 TiKV 实例上是读取热点的 peer 的数量
- Hot read Region's leader read bytes:每个 TiKV 实例上热点的 leader 的读取大小
- Hot read Region's peer read bytes:每个 TiKV 实例上热点的 peer 的读取字节数
- Hot Region's leader distribution:每个 TiKV 实例上成为读取热点的 leader 的数量
- Total read bytes on hot leader Regions:每个 TiKV 实例上所有成为读取热点的 leader 的总的读取流量大小
- Store read rate bytes:每个 TiKV 实例总的读取的流量
- Store read rate keys:每个 TiKV 实例总的读取 keys
- Hot cache read entry number:每个 TiKV 实例进入热点统计模块的 peer 的数量

![PD Dashboard - Hotspot metrics](/media/pd-dashboard-hotspot.png)
![PD Dashboard - Hot read metrics](/media/pd-dashboard-hotread-v4.png)

## Scheduler

Expand All @@ -85,15 +98,15 @@ aliases: ['/docs-cn/v3.1/reference/key-monitoring-metrics/pd-dashboard/']
- Balance Region event:balance Region 的事件数量
- Balance leader scheduler:balance-leader scheduler 的状态
- Balance Region scheduler:balance-region scheduler 的状态
- Namespace checker:namespace checker 的状态
- Replica checker:replica checker 的状态
- Rule checker:rule checker 的状态
- Region merge checker:merge checker 的状态
- Filter target:尝试选择 Store 作为调度 taget 时没有通过 Filter 的计数
- Filter source:尝试选择 Store 作为调度 source 时没有通过 Filter 的计数
- Balance Direction:Store 被选作调度 target 或 source 的次数
- Store Limit:Store 的调度限流状态

![PD Dashboard - Scheduler metrics](/media/pd-dashboard-scheduler-v2.png)
![PD Dashboard - Scheduler metrics](/media/pd-dashboard-scheduler-v4.png)

## gRPC

Expand All @@ -117,20 +130,22 @@ aliases: ['/docs-cn/v3.1/reference/key-monitoring-metrics/pd-dashboard/']

## TiDB

- PD Server TSO handle time and Client recv time:从 PD 开始处理 TSO 请求到 client 端接收到 TSO 的总耗时
- Handle requests count:TiDB 的请求数量
- Handle requests duration:每个请求所花费的时间,99% 的情况下,应该小于 100ms

![PD Dashboard - TiDB metrics](/media/pd-dashboard-tidb-v2.png)
![PD Dashboard - TiDB metrics](/media/pd-dashboard-tidb-v4.png)

## Heartbeat

- Heartbeat region event QPS:心跳处理 region 的 QPS, 包括更新缓存和持久化
- Region heartbeat report:TiKV 向 PD 发送的心跳个数
- Region heartbeat report error:TiKV 向 PD 发送的异常的心跳个数
- Region heartbeat report active:TiKV 向 PD 发送的正常的心跳个数
- Region schedule push:PD 向 TiKV 发送的调度命令的个数
- 99% Region heartbeat latency:99% 的情况下,心跳的延迟

![PD Dashboard - Heartbeat metrics](/media/pd-dashboard-heartbeat-v2.png)
![PD Dashboard - Heartbeat metrics](/media/pd-dashboard-heartbeat-v4.png)

## Region storage

Expand Down
Binary file added media/pd-dashboard-balance-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-cluster-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-header-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-heartbeat-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-hotread-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-hotwrite-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-operator-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-scheduler-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pd-dashboard-tidb-v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.