docs: Guide for configuring and accessing metrics

This is a general rewrite of the Metrics page. The page is moved from the "Concepts" section to "Installation and Configuration" - it barely touches on the concept of metrics, but it does guide the use through the metrics configuration, so it makes more sense there. The page covers: * The purpose of metrics and a link to the metrics reference * How to enable/disable metrics in a Kubernetes and non-Kubernetes deployments * How to verify that metrics are exposed * How to configure labels on events metrics * How to enable ServiceMonitor and scrape metrics Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
cilium · May 10, 2024 · e435da7 · e435da7
1 parent 048b164
commit e435da7
Show file tree

Hide file tree

Showing 3 changed files with 119 additions and 103 deletions.
diff --git a/docs/content/en/docs/concepts/metrics.md b/docs/content/en/docs/concepts/metrics.md
diff --git a/docs/content/en/docs/installation/configuration.md b/docs/content/en/docs/installation/configuration.md
@@ -1,7 +1,7 @@
 ---
 title: "Configure Tetragon"
 linkTitle: "Configuration"
-weight: 5
+weight: 6
 ---
 
 Depending on your deployment mode, Tetragon configuration can be changed by:

diff --git a/docs/content/en/docs/installation/metrics.md b/docs/content/en/docs/installation/metrics.md
@@ -0,0 +1,118 @@
+---
+title: "Metrics"
+weight: 7
+description: "Learn how to configure and access Prometheus metrics."
+aliases: ["/docs/concepts/metrics"]
+---
+
+Tetragon exposes a number of Prometheus metrics that can be used for two main purposes:
+
+1. Monitoring the health of Tetragon itself
+2. Monitoring the activity of processes observed by Tetragon
+
+For the full list, refer to [metrics reference]({{< ref "/docs/reference/metrics" >}}).
+
+## Enable/Disable Metrics
+
+### Kubernetes
+
+In a [Kubernetes installation]({{< ref "/docs/installation/kubernetes" >}}), metrics are enabled by default and exposed
+via `tetragon` service at endpoint `/metrics` on port `2112`.
+
+You can change the port via Helm values:
+
+```yaml
+tetragon:
+  prometheus:
+    port: 2222 # default is 2112
+```
+
+Or entirely disable the metrics server:
+
+```yaml
+tetragon:
+  prometheus:
+    enabled: false # default is true
+```
+
+### Non-Kubernetes
+
+In a non-Kubernetes installation, metrics are disabled by default. You can enable them by setting the metrics server
+address, for example `:2112`, via the `--metrics-server` flag.
+
+If using [systemd]({{< ref "/docs/installation/package" >}}), set the `metrics-address` entry in a file under the
+`/etc/tetragon/tetragon.conf.d/` directory.
+
+## Verify that metrics are exposed
+
+To verify that the metrics server has started, check the logs of the Tetragon Agent.
+In Kubernetes, run:
+
+```shell
+kubectl -n kube-system logs ds/tetragon
+```
+
+The logs should contain a line similar to the following:
+```
+time="2023-09-22T23:16:24+05:30" level=info msg="Starting metrics server" addr="localhost:2112"
+```
+
+To see what metrics are exposed, you can access the metrics endpoint directly.
+In Kubernetes, forward the metrics port:
+
+```shell
+kubectl -n kube-system port-forward svc/tetragon 2112:2112
+```
+
+Access `localhost:2112/metrics` endpoint either in a browser or for example using `curl`.
+You should see a list of metrics similar to the following:
+```
+# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
+# TYPE promhttp_metric_handler_errors_total counter
+promhttp_metric_handler_errors_total{cause="encoding"} 0
+promhttp_metric_handler_errors_total{cause="gathering"} 0
+# HELP tetragon_errors_total The total number of Tetragon errors. For internal use only.
+# TYPE tetragon_errors_total counter
+[...]
+```
+
+## Configure labels on events metrics
+
+Depending on the workloads running in the environment, [Events Metrics]({{< ref "/docs/reference/metrics#tetragon-events-metrics" >}})
+may have very high cardinality. This is particularly likely in Kubernetes environments, where each pod creates
+a separate timeseries. To avoid overwhelming Prometheus, Tetragon provides an option to choose which labels are
+populated in these metrics.
+
+You can configure the labels via Helm values or the `--metrics-label-filter` flag. Set the value to a comma-separated
+list of enabled labels:
+
+```yaml
+tetragon:
+  prometheus:
+    metricsLabelFilter: "namespace,workload,binary" # "pod" label is disabled
+```
+
+## Scrape metrics
+
+Typically, metrics are scraped by Prometheus or another compatible agent (for example OpenTelemetry Collector), stored
+in Prometheus or another compatible database, then queried and visualized for example using Grafana.
+
+In Kubernetes, you can install Prometheus and Grafana using the `kube-prometheus-stack` Helm chart:
+
+```shell
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
+  --namespace monitoring  --create-namespace \
+  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
+```
+
+The `kube-prometheus-stack` Helm chart includes [Prometheus Operator](https://prometheus-operator.dev/), which allows
+you to configure Prometheus via Kubernetes custom resources. Tetragon comes with a default `ServiceMonitor` resource
+containing the scrape confguration. You can enable it via Helm values:
+
+```yaml
+tetragon:
+  prometheus:
+    serviceMonitor:
+      enabled: true
+```