Skip to content

Commit

Permalink
add 3 monitoring documents (pingcap#2791)
Browse files Browse the repository at this point in the history
* add 2 monitoring documents

* Update TOC.md

* align pingcap#3195

* fix ci

* address comments from lilian

Co-authored-by: Lilian Lee <lilin@pingcap.com>

* remove extra information

Co-authored-by: Lilian Lee <lilin@pingcap.com>
  • Loading branch information
TomShawn and lilin90 authored Jun 17, 2020
1 parent bf258db commit 81d016b
Show file tree
Hide file tree
Showing 10 changed files with 126 additions and 92 deletions.
5 changes: 3 additions & 2 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,9 @@
+ [Maintain TiDB Using TiUP](/maintain-tidb-using-tiup.md)
+ [Maintain TiDB Using Ansible](/maintain-tidb-using-ansible.md)
+ Monitor and Alert
+ [Monitoring Framework](/tidb-monitoring-framework.md)
+ [Monitor a TiDB Cluster](/monitor-a-tidb-cluster.md)
+ [Monitoring Framework Overview](/tidb-monitoring-framework.md)
+ [Monitoring API](/tidb-monitoring-api.md)
+ [Deploy Monitoring Services](/deploy-monitoring-services.md)
+ [TiDB Cluster Alert Rules](/alert-rules.md)
+ [TiFlash Alert Rules](/tiflash/tiflash-alert-rules.md)
+ Troubleshoot
Expand Down
101 changes: 15 additions & 86 deletions monitor-a-tidb-cluster.md → deploy-monitoring-services.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,17 @@
---
title: Monitor a TiDB Cluster
summary: Learn how to monitor the state of a TiDB cluster.
title: Deploy Monitoring Services for the TiDB Cluster
summary: Learn how to deploy monitoring services for the TiDB cluster.
category: how-to
aliases: ['/docs/dev/how-to/monitor/monitor-a-cluster/']
aliases: ['/docs/dev/how-to/monitor/monitor-a-cluster/','/docs/dev/monitor-a-tidb-cluster/']
---

# Monitor a TiDB Cluster
# Deploy Monitoring Services for the TiDB Cluster

You can use the following two types of interfaces to monitor the TiDB cluster state:
This document is intended for users who want to manually deploy TiDB monitoring and alert services.

- [The state interface](#use-the-state-interface): this interface uses the HTTP interface to get the component information.
- [The metrics interface](#use-the-metrics-interface): this interface uses Prometheus to record the detailed information of the various operations in components and views these metrics using Grafana.
If you deploy the TiDB cluster using TiUP, the monitoring and alert services are automatically deployed, and no manual deployment is needed.

## Use the state interface

The state interface monitors the basic information of a specific component in the TiDB cluster. It can also act as the monitor interface for Keepalive messages. In addition, the state interface for the Placement Driver (PD) can get the details of the entire TiKV cluster.

### TiDB server

- TiDB API address: `http://${host}:${port}`
- Default port: `10080`
- Details about API names: see [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md)

The following example uses `http://${host}:${port}/status` to get the current state of the TiDB server and to determine whether the server is alive. The result is returned in JSON format.

```bash
curl http://127.0.0.1:10080/status
{
connections: 0, # The current number of clients connected to the TiDB server.
version: "5.7.25-TiDB-v3.0.0-beta-250-g778c3f4a5", # The TiDB version number.
git_hash: "778c3f4a5a716880bcd1d71b257c8165685f0d70" # The Git Hash of the current TiDB code.
}
```

### PD server

- PD API address: `http://${host}:${port}/pd/api/v1/${api_name}`
- Default port: `2379`
- Details about API names: see [PD API doc](https://download.pingcap.com/pd-api-v1.html)

The PD interface provides the state of all the TiKV servers and the information about load balancing. See the following example for the information about a single-node TiKV cluster:

```bash
curl http://127.0.0.1:2379/pd/api/v1/stores
{
"count": 1, # The number of TiKV nodes.
"stores": [ # The list of TiKV nodes.
# The details about the single TiKV node.
{
"store": {
"id": 1,
"address": "127.0.0.1:20160",
"version": "3.0.0-beta",
"state_name": "Up"
},
"status": {
"capacity": "20 GiB", # The total capacity.
"available": "16 GiB", # The available capacity.
"leader_count": 17,
"leader_weight": 1,
"leader_score": 17,
"leader_size": 17,
"region_count": 17,
"region_weight": 1,
"region_score": 17,
"region_size": 17,
"start_ts": "2019-03-21T14:09:32+08:00", # The starting timestamp.
"last_heartbeat_ts": "2019-03-21T14:14:22.961171958+08:00", # The timestamp of the last heartbeat.
"uptime": "4m50.961171958s"
}
}
]
```
## Use the metrics interface
The metrics interface monitors the state and performance of the entire TiDB cluster.
- If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system (Prometheus and Grafana) is deployed at the same time.
- If you use other deployment ways, [deploy Prometheus and Grafana](#deploy-prometheus-and-grafana) before using this interface.
After Prometheus and Grafana are successfully deployed, [configure Grafana](#configure-grafana).
### Deploy Prometheus and Grafana
## Deploy Prometheus and Grafana

Assume that the TiDB cluster topology is as follows:

Expand All @@ -95,7 +24,7 @@ Assume that the TiDB cluster topology is as follows:
| Node5 | 192.168.199.117| TiKV2, node_export |
| Node6 | 192.168.199.118| TiKV3, node_export |

#### Step 1: Download the binary package
### Step 1: Download the binary package

{{< copyable "shell-regular" >}}

Expand All @@ -115,7 +44,7 @@ tar -xzf node_exporter-0.17.0.linux-amd64.tar.gz
tar -xzf grafana-6.1.6.linux-amd64.tar.gz
```

#### Step 2: Start `node_exporter` on Node1, Node2, Node3, and Node4
### Step 2: Start `node_exporter` on Node1, Node2, Node3, and Node4

{{< copyable "shell-regular" >}}

Expand All @@ -127,7 +56,7 @@ $ ./node_exporter --web.listen-address=":9100" \
--log.level="info" &
```

#### Step 3: Start Prometheus on Node1
### Step 3: Start Prometheus on Node1

Edit the Prometheus configuration file:

Expand Down Expand Up @@ -200,7 +129,7 @@ $ ./prometheus \
--storage.tsdb.retention="15d" &
```

#### Step 4: Start Grafana on Node1
### Step 4: Start Grafana on Node1

Edit the Grafana configuration file:

Expand Down Expand Up @@ -259,11 +188,11 @@ $ ./bin/grafana-server \
--config="./conf/grafana.ini" &
```

### Configure Grafana
## Configure Grafana

This section describes how to configure Grafana.

#### Step 1: Add a Prometheus data source
### Step 1: Add a Prometheus data source

1. Log in to the Grafana Web interface.

Expand All @@ -288,7 +217,7 @@ This section describes how to configure Grafana.

5. Click **Add** to save the new data source.

#### Step 2: Import a Grafana dashboard
### Step 2: Import a Grafana dashboard

To import a Grafana dashboard for the PD server, the TiKV server, and the TiDB server, take the following steps respectively:

Expand All @@ -308,7 +237,7 @@ To import a Grafana dashboard for the PD server, the TiKV server, and the TiDB s

6. Click **Import**. A Prometheus dashboard is imported.

### View component metrics
## View component metrics

Click **New dashboard** in the top menu and choose the dashboard you want to view.

Expand Down
2 changes: 1 addition & 1 deletion grafana-overview-dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,4 +75,4 @@ System Info | IO Util | the disk usage ratio, 100% at a maximum; generally you n

## Interface of the Overview dashboard

![Overview Dashboard](/media/overview.png)
![overview](/media/grafana-monitor-overview.png)
Binary file added media/grafana-monitor-overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/grafana-monitored-groups.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed media/overview.png
Binary file not shown.
2 changes: 1 addition & 1 deletion mysql-compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ mysql> select _tidb_rowid, id from t;

### Performance schema

TiDB uses a combination of Prometheus and Grafana to store and query the performance monitoring metrics. Some performance schema tables return empty results in TiDB.
TiDB uses a combination of [Prometheus and Grafana](/tidb-monitoring-api.md) to store and query the performance monitoring metrics. Some performance schema tables return empty results in TiDB.

### Query Execution Plan

Expand Down
2 changes: 1 addition & 1 deletion production-deployment-from-binary-tarball.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,4 +205,4 @@ Follow the steps below to start PD, TiKV, and TiDB:
> - To tune TiKV, see [Performance Tuning for TiKV](/tune-tikv-performance.md).
> - If you use `nohup` to start the cluster in the production environment, write the startup commands in a script and then run the script. If not, the `nohup` process might abort because it receives exceptions when the Shell command exits. For more information, see [The TiDB/TiKV/PD process aborts unexpectedly](/troubleshoot-tidb-cluster.md#the-tidbtikvpd-process-aborts-unexpectedly).
For the deployment and use of TiDB monitoring services, see [Monitor a TiDB Cluster](/monitor-a-tidb-cluster.md).
For the deployment and use of TiDB monitoring services, see [Deploy Monitoring Services for the TiDB Cluster](/deploy-monitoring-services.md) and [TiDB Monitoring API](/tidb-monitoring-api.md).
81 changes: 81 additions & 0 deletions tidb-monitoring-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
title: TiDB Monitoring API
summary: Learn the API of TiDB monitoring services.
category: how-to
---

# TiDB Monitoring API

You can use the following two types of interfaces to monitor the TiDB cluster state:

- [The state interface](#use-the-state-interface): this interface uses the HTTP interface to get the component information.
- [The metrics interface](#use-the-metrics-interface): this interface uses Prometheus to record the detailed information of the various operations in components and views these metrics using Grafana.

## Use the state interface

The state interface monitors the basic information of a specific component in the TiDB cluster. It can also act as the monitor interface for Keepalive messages. In addition, the state interface for the Placement Driver (PD) can get the details of the entire TiKV cluster.

### TiDB server

- TiDB API address: `http://${host}:${port}`
- Default port: `10080`

The following example uses `http://${host}:${port}/status` to get the current state of the TiDB server and to determine whether the server is alive. The result is returned in JSON format.

```bash
curl http://127.0.0.1:10080/status
{
connections: 0, # The current number of clients connected to the TiDB server.
version: "5.7.25-TiDB-v3.0.0-beta-250-g778c3f4a5", # The TiDB version number.
git_hash: "778c3f4a5a716880bcd1d71b257c8165685f0d70" # The Git Hash of the current TiDB code.
}
```

### PD server

- PD API address: `http://${host}:${port}/pd/api/v1/${api_name}`
- Default port: `2379`
- Details about API names: see [PD API doc](https://download.pingcap.com/pd-api-v1.html)

The PD interface provides the state of all the TiKV servers and the information about load balancing. See the following example for the information about a single-node TiKV cluster:

```bash
curl http://127.0.0.1:2379/pd/api/v1/stores
{
"count": 1, # The number of TiKV nodes.
"stores": [ # The list of TiKV nodes.
# The details about the single TiKV node.
{
"store": {
"id": 1,
"address": "127.0.0.1:20160",
"version": "3.0.0-beta",
"state_name": "Up"
},
"status": {
"capacity": "20 GiB", # The total capacity.
"available": "16 GiB", # The available capacity.
"leader_count": 17,
"leader_weight": 1,
"leader_score": 17,
"leader_size": 17,
"region_count": 17,
"region_weight": 1,
"region_score": 17,
"region_size": 17,
"start_ts": "2019-03-21T14:09:32+08:00", # The starting timestamp.
"last_heartbeat_ts": "2019-03-21T14:14:22.961171958+08:00", # The timestamp of the last heartbeat.
"uptime": "4m50.961171958s"
}
}
]
```
## Use the metrics interface
The metrics interface monitors the state and performance of the entire TiDB cluster.
- If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system (Prometheus and Grafana) is deployed at the same time.
- If you use other deployment ways, [deploy Prometheus and Grafana](/deploy-monitoring-services.md) before using this interface.
After Prometheus and Grafana are successfully deployed, [configure Grafana](/deploy-monitoring-services.md#configure-grafana).
25 changes: 24 additions & 1 deletion tidb-monitoring-framework.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,27 @@ The diagram is as follows:

Grafana is an open source project for analyzing and visualizing metrics. TiDB uses Grafana to display the performance metrics as follows:

![screenshot](/media/grafana-screenshot.png)
![Grafana monitored_groups](/media/grafana-monitored-groups.png)

- {TiDB_Cluster_name}-Backup-Restore: Monitoring metrics related to backup and restore.
- {TiDB_Cluster_name}-Binlog: Monitoring metrics related to TiDB Binlog.
- {TiDB_Cluster_name}-Blackbox_exporter: Monitoring metrics related to network probe.
- {TiDB_Cluster_name}-Disk-Performance: Monitoring metrics related to disk performance.
- {TiDB_Cluster_name}-Kafka-Overview: Monitoring metrics related to Kafka.
- {TiDB_Cluster_name}-Lightning: Monitoring metrics related to TiDB Lightning.
- {TiDB_Cluster_name}-Node_exporter: Monitoring metrics related to the operating system.
- {TiDB_Cluster_name}-Overview: Monitoring overview related to important components.
- {TiDB_Cluster_name}-PD: Monitoring metrics related to the PD server.
- {TiDB_Cluster_name}-Performance-Read: Monitoring metrics related to read performance.
- {TiDB_Cluster_name}-Performance-Write: Monitoring metrics related to write performance.
- {TiDB_Cluster_name}-TiDB: Detailed monitoring metrics related to the TiDB server.
- {TiDB_Cluster_name}-TiDB-Summary: Monitoring overview related to TiDB.
- {TiDB_Cluster_name}-TiFlash-Proxy-Summary: Monitoring overview of the proxy server that is used to replicate data to TiFlash.
- {TiDB_Cluster_name}-TiFlash-Summary: Monitoring overview related to TiFlash.
- {TiDB_Cluster_name}-TiKV-Details: Detailed monitoring metrics related to the TiKV server.
- {TiDB_Cluster_name}-TiKV-Summary: Monitoring overview related to the TiKV server.
- {TiDB_Cluster_name}-TiKV-Trouble-Shooting: Monitoring metrics related to the TiKV error diagnostics.

Each group has multiple panel labels of monitoring metrics, and each panel contains detailed information of multiple monitoring metrics. For example, the **Overview** monitoring group has five panel labels, and each labels corresponds to a monitoring panel. See the following UI:

![Grafana Overview](/media/grafana-monitor-overview.png)

0 comments on commit 81d016b

Please sign in to comment.