Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create monitor-deploy.md and tidb-monitoring-api.md #3196

Merged
merged 17 commits into from
May 25, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
+ 监控与告警设置
+ [监控框架概述](/tidb-monitoring-framework.md) @李宋高
+ [监控 API](/tidb-monitoring-api.md) @李宋高
+ [监控部署](/monitor-deploy.md) @李宋高
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ [监控部署](/monitor-deploy.md) @李宋高
+ [监控部署](/deploy-monitoring-services.md) @李宋高

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoyangwang5 If you commit this suggestion, remember to update the corresponding link in #3186 as well, thanks~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

+ [测试验证](/post-installation-check.md) @李仲舒
+ 性能测试报告及重现指南
+ [如何用 Sysbench 测试 TiDB](/benchmark/benchmark-tidb-using-sysbench.md) @周跃跃
Expand Down
273 changes: 273 additions & 0 deletions monitor-deploy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议将文件名改为 deploy-monitoring-services.md,更清晰一些~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

title: 集群监控部署
category: how-to
aliases: ['/docs-cn/stable/how-to/monitor/monitor-deploy/']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 对于新建的文件,不需要加 aliases 这一行,aliases 用于将用户从旧链接引导至新页面。

  • 如果参考下一条 comment 删除了 monitor-a-tidb-cluster.md,则需要添加以下 aliases:

    aliases: ['/docs-cn/dev/monitor-a-tidb-cluster/']

---

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 PR 中的内容与现有的文档内容有重叠,请查看确认 https://pingcap.com/docs-cn/dev/monitor-a-tidb-cluster/ 中是否有还需保留的内容,如果 PR 可以取代原 monitor-a-tidb-cluster.md,请在该 PR 中删除原文件。

# TiDB 集群监控部署

## 部署 Prometheus 和 Grafana

假设 TiDB 的拓扑结构如下:

| 节点 | 主机 IP | 服务 |
| :-- | :-- | :-------------- |
| Node1 | 192.168.199.113| PD1, TiDB, node_export, Prometheus, Grafana |
| Node2 | 192.168.199.114| PD2, node_export |
| Node3 | 192.168.199.115| PD3, node_export |
| Node4 | 192.168.199.116| TiKV1, node_export |
| Node5 | 192.168.199.117| TiKV2, node_export |
| Node6 | 192.168.199.118| TiKV3, node_export |

### 第 1 步:下载二进制包

下载二进制包:

{{< copyable "shell-regular" >}}

```bash
wget https://download.pingcap.org/prometheus-2.8.1.linux-amd64.tar.gz
wget https://download.pingcap.org/node_exporter-0.17.0.linux-amd64.tar.gz
wget https://download.pingcap.org/grafana-6.1.6.linux-amd64.tar.gz
```

解压二进制包:

{{< copyable "shell-regular" >}}

```bash
tar -xzf prometheus-2.8.1.linux-amd64.tar.gz
tar -xzf node_exporter-0.17.0.linux-amd64.tar.gz
tar -xzf grafana-6.1.6.linux-amd64.tar.gz
```

### 第 2 步:在 Node1,Node2,Node3,Node4 上启动 `node_exporter`

{{< copyable "shell-regular" >}}

```bash
cd node_exporter-0.17.0.linux-amd64
```

启动 node_exporter 服务:

{{< copyable "shell-regular" >}}

```bash
./node_exporter --web.listen-address=":9100" \
--log.level="info" &
```

### 第 3 步:在 Node1 上启动 Prometheus

编辑 Prometheus 的配置文件:

{{< copyable "shell-regular" >}}

```bash
cd prometheus-2.8.1.linux-amd64 &&
vi prometheus.yml
```

```ini
...

global:
scrape_interval: 15s
evaluation_interval: 15s
# scrape_timeout 设置为全局默认值 (10s)
external_labels:
cluster: 'test-cluster'
monitor: "prometheus"

scrape_configs:
- job_name: 'overwritten-nodes'
honor_labels: true # 不要覆盖 job 和实例的 label
static_configs:
- targets:
- '192.168.199.113:9100'
- '192.168.199.114:9100'
- '192.168.199.115:9100'
- '192.168.199.116:9100'
- '192.168.199.117:9100'
- '192.168.199.118:9100'

- job_name: 'tidb'
honor_labels: true # 不要覆盖 job 和实例的 label
static_configs:
- targets:
- '192.168.199.113:10080'

- job_name: 'pd'
honor_labels: true # 不要覆盖 job 和实例的 label
static_configs:
- targets:
- '192.168.199.113:2379'
- '192.168.199.114:2379'
- '192.168.199.115:2379'

- job_name: 'tikv'
honor_labels: true # 不要覆盖 job 和实例的 label
static_configs:
- targets:
- '192.168.199.116:20180'
- '192.168.199.117:20180'
- '192.168.199.118:20180'

...
```

启动 Prometheus 服务:

{{< copyable "shell-regular" >}}

```bash
./prometheus \
--config.file="./prometheus.yml" \
--web.listen-address=":9090" \
--web.external-url="http://192.168.199.113:9090/" \
--web.enable-admin-api \
--log.level="info" \
--storage.tsdb.path="./data.metrics" \
--storage.tsdb.retention="15d" &
```

### 第 4 步:在 Node1 上启动 Grafana

编辑 Grafana 的配置文件:

```bash
cd grafana-6.1.6 &&
vi conf/grafana.ini
```

```ini
...

[paths]
data = ./data
logs = ./data/log
plugins = ./data/plugins
[server]
http_port = 3000
domain = 192.168.199.113
[database]
[session]
[analytics]
check_for_updates = true
[security]
admin_user = admin
admin_password = admin
[snapshots]
[users]
[auth.anonymous]
[auth.basic]
[auth.ldap]
[smtp]
[emails]
[log]
mode = file
[log.console]
[log.file]
level = info
format = text
[log.syslog]
[event_publisher]
[dashboards.json]
enabled = false
path = ./data/dashboards
[metrics]
[grafana_net]
url = https://grafana.net

...

```

启动 Grafana 服务:

{{< copyable "shell-regular" >}}

```bash
./bin/grafana-server \
--config="./conf/grafana.ini" &
```

## 配置 Grafana

本小节介绍如何配置 Grafana。

### 第 1 步:添加 Prometheus 数据源

1. 登录 Grafana 界面。

- 默认地址:`http://localhost:3000`
- 默认账户:admin
- 默认密码:admin

> **注意:**
>
> **Change Password** 步骤可以选择 **Skip**。

2. 点击 Grafana 侧边栏菜单 **Configuration** 中的 **Data Source**。

3. 点击 **Add data source**。

4. 指定数据源的相关信息:

- 在 **Name** 处,为数据源指定一个名称。
- 在 **Type** 处,选择 **Prometheus**。
- 在 **URL** 处,指定 Prometheus 的 IP 地址。
- 根据需求指定其它字段。

5. 点击 **Add** 保存新的数据源。

### 第 2 步:导入 Grafana 面板

执行以下步骤,为 PD Server、TiKV Server 和 TiDB Server 分别导入 Grafana 面板:

1. 点击侧边栏的 Grafana 图标。

2. 在侧边栏菜单中,依次点击 **Dashboards** > **Import** 打开 **Import Dashboard** 窗口。

3. 点击 **Upload .json File** 上传对应的 JSON 文件(下载 [TiDB Grafana 配置文件](https://github.com/pingcap/tidb-ansible/tree/master/scripts))。

> **注意:**
>
> TiKV、PD 和 TiDB 面板对应的 JSON 文件分别为 `tikv_summary.json`,`tikv_details.json`,`tikv_trouble_shooting.json`,`pd.json`,`tidb.json`,`tidb_summary.json`。

4. 点击 **Load**。

5. 选择一个 Prometheus 数据源。

6. 点击 **Import**,Prometheus 面板即导入成功。

## 查看组件 metrics

在顶部菜单中,点击 **New dashboard**,选择要查看的面板。

![view dashboard](/media/view-dashboard.png)

可查看以下集群组件信息:

+ **TiDB Server:**
+ query 处理时间,可以看到延迟和吞吐
+ ddl 过程监控
+ TiKV client 相关的监控
+ PD client 相关的监控

+ **PD Server:**
+ 命令执行的总次数
+ 某个命令执行失败的总次数
+ 某个命令执行成功的耗时统计
+ 某个命令执行失败的耗时统计
+ 某个命令执行完成并返回结果的耗时统计

+ **TiKV Server:**
+ GC 监控
+ 执行 KV 命令的总次数
+ Scheduler 执行命令的耗时统计
+ Raft propose 命令的总次数
+ Raft 执行命令的耗时统计
+ Raft 执行命令失败的总次数
+ Raft 处理 ready 状态的总次数