add 3 monitoring documents (pingcap#2791)

* add 2 monitoring documents * Update TOC.md * align pingcap#3195 * fix ci * address comments from lilian Co-authored-by: Lilian Lee <lilin@pingcap.com> * remove extra information Co-authored-by: Lilian Lee <lilin@pingcap.com>
lilin90 · Jun 17, 2020 · 81d016b · 81d016b
1 parent bf258db
commit 81d016b
Show file tree

Hide file tree

Showing 10 changed files with 126 additions and 92 deletions.
diff --git a/TOC.md b/TOC.md
@@ -73,8 +73,9 @@
   + [Maintain TiDB Using TiUP](/maintain-tidb-using-tiup.md)
   + [Maintain TiDB Using Ansible](/maintain-tidb-using-ansible.md)
 + Monitor and Alert
-  + [Monitoring Framework](/tidb-monitoring-framework.md)
-  + [Monitor a TiDB Cluster](/monitor-a-tidb-cluster.md)
+  + [Monitoring Framework Overview](/tidb-monitoring-framework.md)
+  + [Monitoring API](/tidb-monitoring-api.md)
+  + [Deploy Monitoring Services](/deploy-monitoring-services.md)
   + [TiDB Cluster Alert Rules](/alert-rules.md)
   + [TiFlash Alert Rules](/tiflash/tiflash-alert-rules.md)
 + Troubleshoot

diff --git a/monitor-a-tidb-cluster.md → deploy-monitoring-services.md b/monitor-a-tidb-cluster.md → deploy-monitoring-services.md
@@ -1,88 +1,17 @@
 ---
-title: Monitor a TiDB Cluster
-summary: Learn how to monitor the state of a TiDB cluster.
+title: Deploy Monitoring Services for the TiDB Cluster
+summary: Learn how to deploy monitoring services for the TiDB cluster.
 category: how-to
-aliases: ['/docs/dev/how-to/monitor/monitor-a-cluster/']
+aliases: ['/docs/dev/how-to/monitor/monitor-a-cluster/','/docs/dev/monitor-a-tidb-cluster/']
 ---
 
-# Monitor a TiDB Cluster
+# Deploy Monitoring Services for the TiDB Cluster
 
-You can use the following two types of interfaces to monitor the TiDB cluster state:
+This document is intended for users who want to manually deploy TiDB monitoring and alert services.
 
-- [The state interface](#use-the-state-interface): this interface uses the HTTP interface to get the component information.
-- [The metrics interface](#use-the-metrics-interface): this interface uses Prometheus to record the detailed information of the various operations in components and views these metrics using Grafana.
+If you deploy the TiDB cluster using TiUP, the monitoring and alert services are automatically deployed, and no manual deployment is needed.
 
-## Use the state interface
-
-The state interface monitors the basic information of a specific component in the TiDB cluster. It can also act as the monitor interface for Keepalive messages. In addition, the state interface for the Placement Driver (PD) can get the details of the entire TiKV cluster.
-
-### TiDB server
-
-- TiDB API address: `http://${host}:${port}`
-- Default port: `10080`
-- Details about API names: see [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md)
-
-The following example uses `http://${host}:${port}/status` to get the current state of the TiDB server and to determine whether the server is alive. The result is returned in JSON format.
-
-```bash
-curl http://127.0.0.1:10080/status
-{
-    connections: 0,  # The current number of clients connected to the TiDB server.
-    version: "5.7.25-TiDB-v3.0.0-beta-250-g778c3f4a5",  # The TiDB version number.
-    git_hash: "778c3f4a5a716880bcd1d71b257c8165685f0d70"  # The Git Hash of the current TiDB code.
-}
-```
-
-### PD server
-
-- PD API address: `http://${host}:${port}/pd/api/v1/${api_name}`
-- Default port: `2379`
-- Details about API names: see [PD API doc](https://download.pingcap.com/pd-api-v1.html)
-
-The PD interface provides the state of all the TiKV servers and the information about load balancing. See the following example for the information about a single-node TiKV cluster:
-
-```bash
-curl http://127.0.0.1:2379/pd/api/v1/stores
-{
-  "count": 1,  # The number of TiKV nodes.
-  "stores": [  # The list of TiKV nodes.
-    # The details about the single TiKV node.
-    {
-      "store": {
-        "id": 1,
-        "address": "127.0.0.1:20160",
-        "version": "3.0.0-beta",
-        "state_name": "Up"
-      },
-      "status": {
-        "capacity": "20 GiB",  # The total capacity.
-        "available": "16 GiB",  # The available capacity.
-        "leader_count": 17,
-        "leader_weight": 1,
-        "leader_score": 17,
-        "leader_size": 17,
-        "region_count": 17,
-        "region_weight": 1,
-        "region_score": 17,
-        "region_size": 17,
-        "start_ts": "2019-03-21T14:09:32+08:00",  # The starting timestamp.
-        "last_heartbeat_ts": "2019-03-21T14:14:22.961171958+08:00",  # The timestamp of the last heartbeat.
-        "uptime": "4m50.961171958s"
-      }
-    }
-  ]
-```
-
-## Use the metrics interface
-
-The metrics interface monitors the state and performance of the entire TiDB cluster.
-
-- If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system (Prometheus and Grafana) is deployed at the same time.
-- If you use other deployment ways, [deploy Prometheus and Grafana](#deploy-prometheus-and-grafana) before using this interface.
-
-After Prometheus and Grafana are successfully deployed, [configure Grafana](#configure-grafana).
-
-### Deploy Prometheus and Grafana
+## Deploy Prometheus and Grafana
 
 Assume that the TiDB cluster topology is as follows:
 
@@ -95,7 +24,7 @@ Assume that the TiDB cluster topology is as follows:
 | Node5 | 192.168.199.117| TiKV2, node_export |
 | Node6 | 192.168.199.118| TiKV3, node_export |
 
-#### Step 1: Download the binary package
+### Step 1: Download the binary package
 
 {{< copyable "shell-regular" >}}
 
@@ -115,7 +44,7 @@ tar -xzf node_exporter-0.17.0.linux-amd64.tar.gz
 tar -xzf grafana-6.1.6.linux-amd64.tar.gz
 ```
 
-#### Step 2: Start `node_exporter` on Node1, Node2, Node3, and Node4
+### Step 2: Start `node_exporter` on Node1, Node2, Node3, and Node4
 
 {{< copyable "shell-regular" >}}
 
@@ -127,7 +56,7 @@ $ ./node_exporter --web.listen-address=":9100" \
     --log.level="info" &
 ```
 
-#### Step 3: Start Prometheus on Node1
+### Step 3: Start Prometheus on Node1
 
 Edit the Prometheus configuration file:
 
@@ -200,7 +129,7 @@ $ ./prometheus \
     --storage.tsdb.retention="15d" &
 ```
 
-#### Step 4: Start Grafana on Node1
+### Step 4: Start Grafana on Node1
 
 Edit the Grafana configuration file:
 
@@ -259,11 +188,11 @@ $ ./bin/grafana-server \
     --config="./conf/grafana.ini" &
 ```
 
-### Configure Grafana
+## Configure Grafana
 
 This section describes how to configure Grafana.
 
-#### Step 1: Add a Prometheus data source
+### Step 1: Add a Prometheus data source
 
 1. Log in to the Grafana Web interface.
 
@@ -288,7 +217,7 @@ This section describes how to configure Grafana.
 
 5. Click **Add** to save the new data source.
 
-#### Step 2: Import a Grafana dashboard
+### Step 2: Import a Grafana dashboard
 
 To import a Grafana dashboard for the PD server, the TiKV server, and the TiDB server, take the following steps respectively:
 
@@ -308,7 +237,7 @@ To import a Grafana dashboard for the PD server, the TiKV server, and the TiDB s
 
 6. Click **Import**. A Prometheus dashboard is imported.
 
-### View component metrics
+## View component metrics
 
 Click **New dashboard** in the top menu and choose the dashboard you want to view.
 

diff --git a/grafana-overview-dashboard.md b/grafana-overview-dashboard.md
@@ -75,4 +75,4 @@ System Info | IO Util | the disk usage ratio, 100% at a maximum; generally you n
 
 ## Interface of the Overview dashboard
 
-![Overview Dashboard](/media/overview.png)
+![overview](/media/grafana-monitor-overview.png)
diff --git a/media/grafana-monitor-overview.png b/media/grafana-monitor-overview.png
diff --git a/media/grafana-monitored-groups.png b/media/grafana-monitored-groups.png
diff --git a/media/overview.png b/media/overview.png
diff --git a/mysql-compatibility.md b/mysql-compatibility.md
@@ -76,7 +76,7 @@ mysql> select _tidb_rowid, id from t;
 
 ### Performance schema
 
-TiDB uses a combination of Prometheus and Grafana to store and query the performance monitoring metrics. Some performance schema tables return empty results in TiDB.
+TiDB uses a combination of [Prometheus and Grafana](/tidb-monitoring-api.md) to store and query the performance monitoring metrics. Some performance schema tables return empty results in TiDB.
 
 ### Query Execution Plan
 

diff --git a/production-deployment-from-binary-tarball.md b/production-deployment-from-binary-tarball.md
@@ -205,4 +205,4 @@ Follow the steps below to start PD, TiKV, and TiDB:
 > - To tune TiKV, see [Performance Tuning for TiKV](/tune-tikv-performance.md).
 > - If you use `nohup` to start the cluster in the production environment, write the startup commands in a script and then run the script. If not, the `nohup` process might abort because it receives exceptions when the Shell command exits. For more information, see [The TiDB/TiKV/PD process aborts unexpectedly](/troubleshoot-tidb-cluster.md#the-tidbtikvpd-process-aborts-unexpectedly).
 
-For the deployment and use of TiDB monitoring services, see [Monitor a TiDB Cluster](/monitor-a-tidb-cluster.md).
+For the deployment and use of TiDB monitoring services, see [Deploy Monitoring Services for the TiDB Cluster](/deploy-monitoring-services.md) and [TiDB Monitoring API](/tidb-monitoring-api.md).
diff --git a/tidb-monitoring-api.md b/tidb-monitoring-api.md
@@ -0,0 +1,81 @@
+---
+title: TiDB Monitoring API
+summary: Learn the API of TiDB monitoring services.
+category: how-to
+---
+
+# TiDB Monitoring API
+
+You can use the following two types of interfaces to monitor the TiDB cluster state:
+
+- [The state interface](#use-the-state-interface): this interface uses the HTTP interface to get the component information.
+- [The metrics interface](#use-the-metrics-interface): this interface uses Prometheus to record the detailed information of the various operations in components and views these metrics using Grafana.
+
+## Use the state interface
+
+The state interface monitors the basic information of a specific component in the TiDB cluster. It can also act as the monitor interface for Keepalive messages. In addition, the state interface for the Placement Driver (PD) can get the details of the entire TiKV cluster.
+
+### TiDB server
+
+- TiDB API address: `http://${host}:${port}`
+- Default port: `10080`
+
+The following example uses `http://${host}:${port}/status` to get the current state of the TiDB server and to determine whether the server is alive. The result is returned in JSON format.
+
+```bash
+curl http://127.0.0.1:10080/status
+{
+    connections: 0,  # The current number of clients connected to the TiDB server.
+    version: "5.7.25-TiDB-v3.0.0-beta-250-g778c3f4a5",  # The TiDB version number.
+    git_hash: "778c3f4a5a716880bcd1d71b257c8165685f0d70"  # The Git Hash of the current TiDB code.
+}
+```
+
+### PD server
+
+- PD API address: `http://${host}:${port}/pd/api/v1/${api_name}`
+- Default port: `2379`
+- Details about API names: see [PD API doc](https://download.pingcap.com/pd-api-v1.html)
+
+The PD interface provides the state of all the TiKV servers and the information about load balancing. See the following example for the information about a single-node TiKV cluster:
+
+```bash
+curl http://127.0.0.1:2379/pd/api/v1/stores
+{
+  "count": 1,  # The number of TiKV nodes.
+  "stores": [  # The list of TiKV nodes.
+    # The details about the single TiKV node.
+    {
+      "store": {
+        "id": 1,
+        "address": "127.0.0.1:20160",
+        "version": "3.0.0-beta",
+        "state_name": "Up"
+      },
+      "status": {
+        "capacity": "20 GiB",  # The total capacity.
+        "available": "16 GiB",  # The available capacity.
+        "leader_count": 17,
+        "leader_weight": 1,
+        "leader_score": 17,
+        "leader_size": 17,
+        "region_count": 17,
+        "region_weight": 1,
+        "region_score": 17,
+        "region_size": 17,
+        "start_ts": "2019-03-21T14:09:32+08:00",  # The starting timestamp.
+        "last_heartbeat_ts": "2019-03-21T14:14:22.961171958+08:00",  # The timestamp of the last heartbeat.
+        "uptime": "4m50.961171958s"
+      }
+    }
+  ]
+```
+
+## Use the metrics interface
+
+The metrics interface monitors the state and performance of the entire TiDB cluster.
+
+- If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system (Prometheus and Grafana) is deployed at the same time.
+- If you use other deployment ways, [deploy Prometheus and Grafana](/deploy-monitoring-services.md) before using this interface.
+
+After Prometheus and Grafana are successfully deployed, [configure Grafana](/deploy-monitoring-services.md#configure-grafana).
diff --git a/tidb-monitoring-framework.md b/tidb-monitoring-framework.md
@@ -27,4 +27,27 @@ The diagram is as follows:
 
 Grafana is an open source project for analyzing and visualizing metrics. TiDB uses Grafana to display the performance metrics as follows:
 
-![screenshot](/media/grafana-screenshot.png)
+![Grafana monitored_groups](/media/grafana-monitored-groups.png)
+
+- {TiDB_Cluster_name}-Backup-Restore: Monitoring metrics related to backup and restore.
+- {TiDB_Cluster_name}-Binlog: Monitoring metrics related to TiDB Binlog.
+- {TiDB_Cluster_name}-Blackbox_exporter: Monitoring metrics related to network probe.
+- {TiDB_Cluster_name}-Disk-Performance: Monitoring metrics related to disk performance.
+- {TiDB_Cluster_name}-Kafka-Overview: Monitoring metrics related to Kafka.
+- {TiDB_Cluster_name}-Lightning: Monitoring metrics related to TiDB Lightning.
+- {TiDB_Cluster_name}-Node_exporter: Monitoring metrics related to the operating system.
+- {TiDB_Cluster_name}-Overview: Monitoring overview related to important components.
+- {TiDB_Cluster_name}-PD: Monitoring metrics related to the PD server.
+- {TiDB_Cluster_name}-Performance-Read: Monitoring metrics related to read performance.
+- {TiDB_Cluster_name}-Performance-Write: Monitoring metrics related to write performance.
+- {TiDB_Cluster_name}-TiDB: Detailed monitoring metrics related to the TiDB server.
+- {TiDB_Cluster_name}-TiDB-Summary: Monitoring overview related to TiDB.
+- {TiDB_Cluster_name}-TiFlash-Proxy-Summary: Monitoring overview of the proxy server that is used to replicate data to TiFlash.
+- {TiDB_Cluster_name}-TiFlash-Summary: Monitoring overview related to TiFlash.
+- {TiDB_Cluster_name}-TiKV-Details: Detailed monitoring metrics related to the TiKV server.
+- {TiDB_Cluster_name}-TiKV-Summary: Monitoring overview related to the TiKV server.
+- {TiDB_Cluster_name}-TiKV-Trouble-Shooting: Monitoring metrics related to the TiKV error diagnostics.
+
+Each group has multiple panel labels of monitoring metrics, and each panel contains detailed information of multiple monitoring metrics. For example, the **Overview** monitoring group has five panel labels, and each labels corresponds to a monitoring panel. See the following UI:
+
+![Grafana Overview](/media/grafana-monitor-overview.png)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -75,4 +75,4 @@ System Info \| IO Util \| the disk usage ratio, 100% at a maximum; generally you n

		## Interface of the Overview dashboard

		![Overview Dashboard](/media/overview.png)
		![overview](/media/grafana-monitor-overview.png)