Skip to content

Commit

Permalink
Add the jaeger-mixin for monitoring (#1668)
Browse files Browse the repository at this point in the history
* Add the jaeger-mixin for monitoring

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Moved mixin to monitoring, removed Cassandra-specific alerts/panels

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
  • Loading branch information
gouthamve authored and jpkrohling committed Aug 7, 2019
1 parent dcaea9d commit aa76cf0
Show file tree
Hide file tree
Showing 6 changed files with 352 additions and 0 deletions.
95 changes: 95 additions & 0 deletions monitoring/jaeger-mixin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Prometheus monitoring mixin for Jaeger

The Prometheus monitoring mixin for Jaeger provides a starting point for people wanting to monitor Jaeger using Prometheus, Alertmanager, and Grafana. To use it, you'll need [`jsonnet`](https://github.com/google/go-jsonnet) and [`jb` (jsonnet-bundler)](https://github.com/jsonnet-bundler/jsonnet-bundler). They can be installed using `go get`, as follows:

```console
$ go get github.com/google/go-jsonnet/cmd/jsonnet
$ go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb
```

Your monitoring mixin can then be initialized as follows:

```console
$ jb init
$ jb install \
github.com/jaegertracing/jaeger/monitoring/jaeger-mixin@master \
github.com/grafana/jsonnet-libs/grafana-builder@master \
github.com/coreos/kube-prometheus/jsonnet/kube-prometheus@master
```

In the directory where your mixin was initialized, create a new `monitoring-setup.jsonnet`, specifying how your monitoring stack should look like: this file is yours, any customizations to Prometheus, Grafana, or Alertmanager should take place here. A simple example providing only the Jaeger dashboard for Grafana would be:

```jsonnet
local jaegerDashboard = (import 'jaeger-mixin/mixin.libsonnet').grafanaDashboards;
{ ['dashboards-jaeger.json']: jaegerDashboard['jaeger.json'] }
```

The manifest files can be generated via the `jsonnet` command below. Once the command finishes, the file `manifests/dashboards-jaeger.json` should be available and can be loaded directly into Grafana.

```console
$ jsonnet -J vendor -cm manifests/ monitoring-setup.jsonnet
```

An example producing the manifests for a complete monitoring stack is located in this directory, as `monitoring-setup.example.jsonnet`. The manifests include Prometheus, Grafana, and Alertmanager managed via the Prometheus Operator for Kubernetes.

```jsonnet
local jaegerAlerts = (import 'jaeger-mixin/alerts.libsonnet').prometheusAlerts;
local jaegerDashboard = (import 'jaeger-mixin/mixin.libsonnet').grafanaDashboards;
local kp =
(import 'kube-prometheus/kube-prometheus.libsonnet') +
{
_config+:: {
namespace: 'monitoring',
},
grafanaDashboards+:: {
'jaeger.json': jaegerDashboard['jaeger.json'],
},
prometheusAlerts+:: jaegerAlerts,
};
{ ['00namespace-' + name + '.json']: kp.kubePrometheus[name] for name in std.objectFields(kp.kubePrometheus) } +
{ ['0prometheus-operator-' + name + '.json']: kp.prometheusOperator[name] for name in std.objectFields(kp.prometheusOperator) } +
{ ['node-exporter-' + name + '.json']: kp.nodeExporter[name] for name in std.objectFields(kp.nodeExporter) } +
{ ['kube-state-metrics-' + name + '.json']: kp.kubeStateMetrics[name] for name in std.objectFields(kp.kubeStateMetrics) } +
{ ['alertmanager-' + name + '.json']: kp.alertmanager[name] for name in std.objectFields(kp.alertmanager) } +
{ ['prometheus-' + name + '.json']: kp.prometheus[name] for name in std.objectFields(kp.prometheus) } +
{ ['prometheus-adapter-' + name + '.json']: kp.prometheusAdapter[name] for name in std.objectFields(kp.prometheusAdapter) } +
{ ['grafana-' + name + '.json']: kp.grafana[name] for name in std.objectFields(kp.grafana) }
```

The manifest files can be generated via `jsonnet` and passed directly to `kubectl`:

```console
$ jsonnet -J vendor -cm manifests/ monitoring-setup.jsonnet
$ kubectl apply -f manifests/
```

The resulting manifests will include everything that is needed to have a Prometheus, Alertmanager, and Grafana instances. Whenever a new alert rule is needed, or a new dashboard has to be defined, change your `monitoring-setup.jsonnet`, re-generate and re-apply the manifests.

Make sure your Prometheus setup is properly scraping the Jaeger components, either by creating a `ServiceMonitor` (and the backing `Service` objects), or via `PodMonitor` resources, like:

```console
$ kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: tracing
namespace: monitoring
spec:
podMetricsEndpoints:
- interval: 5s
targetPort: 14269
selector:
matchLabels:
app: jaeger
EOF
```

This `PodMonitor` tells Prometheus to scrape the port `14269` from all pods containing the label `app: jaeger`. If you have the Jaeger Collector, Agent, and Query in different pods, you might need to adjust or create further `PodMonitor` resources to scrape metrics from the other ports.

This mixin was originally developed by [Grafana Labs](https://github.com/grafana/jsonnet-libs/tree/master/jaeger-mixin).

## Background

* For more information about monitoring mixins, see this [design doc](https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/view).
116 changes: 116 additions & 0 deletions monitoring/jaeger-mixin/alerts.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
local percentErrs(metric, errSelectors) = '100 * sum(rate(%(metric)s{%(errSelectors)s}[1m])) by (instance, job, namespace) / sum(rate(%(metric)s[1m])) by (instance, job, namespace)' % {
metric: metric,
errSelectors: errSelectors,
};

local percentErrsWithTotal(metric_errs, metric_total) = '100 * sum(rate(%(metric_errs)s[1m])) by (instance, job, namespace) / sum(rate(%(metric_total)s[1m])) by (instance, job, namespace)' % {
metric_errs: metric_errs,
metric_total: metric_total,
};

{
prometheusAlerts+:: {
groups+: [
{
name: 'jaeger_alerts',
rules: [{
alert: 'JaegerHTTPServerErrs',
expr: percentErrsWithTotal('jaeger_agent_http_server_errors_total', 'jaeger_agent_http_server_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% HTTP errors.
|||,
},
}, {
alert: 'JaegerRPCRequestsErrors',
expr: percentErrs('jaeger_client_jaeger_rpc_http_requests', 'status_code=~"4xx|5xx"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% RPC HTTP errors.
|||,
},
}, {
alert: 'JaegerClientSpansDropped',
expr: percentErrs('jaeger_reporter_spans', 'result=~"dropped|err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
service {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerAgentSpansDropped',
expr: percentErrsWithTotal('jaeger_agent_reporter_batches_failures_total', 'jaeger_agent_reporter_batches_submitted_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
agent {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerCollectorDroppingSpans',
expr: percentErrsWithTotal('jaeger_collector_spans_dropped_total', 'jaeger_collector_spans_received_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
collector {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerSamplingUpdateFailing',
expr: percentErrs('jaeger_sampler_queries', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is failing {{ printf "%.2f" $value }}% in updating sampling policies.
|||,
},
}, {
alert: 'JaegerThrottlingUpdateFailing',
expr: percentErrs('jaeger_throttler_updates', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is failing {{ printf "%.2f" $value }}% in updating throttling policies.
|||,
},
}, {
alert: 'JaegerQueryReqsFailing',
expr: percentErrs('jaeger_query_requests_total', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is seeing {{ printf "%.2f" $value }}% query errors on {{ $labels.operation }}.
|||,
},
}],
},
],
},
}
102 changes: 102 additions & 0 deletions monitoring/jaeger-mixin/dashboards.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
local g = (import 'grafana-builder/grafana.libsonnet') + {
qpsPanelErrTotal(selectorErr, selectorTotal):: {
local expr(selector) = 'sum(rate(' + selector + '[1m]))',

aliasColors: {
success: '#7EB26D',
'error': '#E24D42',
},
targets: [
{
expr: expr(selectorErr),
format: 'time_series',
intervalFactor: 2,
legendFormat: 'error',
refId: 'A',
step: 10,
},
{
expr: expr(selectorTotal) + ' - ' + expr(selectorErr),
format: 'time_series',
intervalFactor: 2,
legendFormat: 'success',
refId: 'B',
step: 10,
},
],
} + $.stack,
};

{
grafanaDashboards+: {
'jaeger.json':
g.dashboard('Jaeger')
.addRow(
g.row('Services')
.addPanel(
g.panel('span creation rate') +
g.qpsPanelErrTotal('jaeger_reporter_spans{result=~"dropped|err"}', 'jaeger_reporter_spans') +
g.stack
)
.addPanel(
g.panel('% spans dropped') +
g.queryPanel('sum(rate(jaeger_reporter_spans{result=~"dropped|err"}[1m])) by (namespace) / sum(rate(jaeger_reporter_spans[1m])) by (namespace)', '{{namespace}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Agent')
.addPanel(
g.panel('batch ingest rate') +
g.qpsPanelErrTotal('jaeger_agent_reporter_batches_failures_total', 'jaeger_agent_reporter_batches_submitted_total') +
g.stack
)
.addPanel(
g.panel('% batches dropped') +
g.queryPanel('sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (cluster) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (cluster)', '{{cluster}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Collector')
.addPanel(
g.panel('span ingest rate') +
g.qpsPanelErrTotal('jaeger_collector_spans_dropped_total', 'jaeger_collector_spans_received_total') +
g.stack
)
.addPanel(
g.panel('% spans dropped') +
g.queryPanel('sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance)', '{{instance}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Collector Queue')
.addPanel(
g.panel('span queue length') +
g.queryPanel('jaeger_collector_queue_length', '{{instance}}') +
g.stack
)
.addPanel(
g.panel('span queue time - 95 percentile') +
g.queryPanel('histogram_quantile(0.95, sum(rate(jaeger_collector_in_queue_latency_bucket[1m])) by (le, instance))', '{{instance}}')
)
)
.addRow(
g.row('Query')
.addPanel(
g.panel('qps') +
g.qpsPanelErrTotal('jaeger_query_requests_total{result="err"}', 'jaeger_query_requests_total') +
g.stack
)
.addPanel(
g.panel('latency - 99 percentile') +
g.queryPanel('histogram_quantile(0.99, sum(rate(jaeger_query_latency_bucket[1m])) by (le, instance))', '{{instance}}') +
g.stack
)
),
},
}
14 changes: 14 additions & 0 deletions monitoring/jaeger-mixin/jsonnetfile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"dependencies": [
{
"name": "grafana-builder",
"source": {
"git": {
"remote": "https://github.com/grafana/jsonnet-libs",
"subdir": "grafana-builder"
}
},
"version": "master"
}
]
}
2 changes: 2 additions & 0 deletions monitoring/jaeger-mixin/mixin.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
(import 'dashboards.libsonnet') +
(import 'alerts.libsonnet')
23 changes: 23 additions & 0 deletions monitoring/jaeger-mixin/monitoring-setup.example.jsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
local jaegerAlerts = (import 'jaeger-mixin/alerts.libsonnet').prometheusAlerts;
local jaegerDashboard = (import 'jaeger-mixin/mixin.libsonnet').grafanaDashboards;

local kp =
(import 'kube-prometheus/kube-prometheus.libsonnet') +
{
_config+:: {
namespace: 'monitoring',
},
grafanaDashboards+:: {
'jaeger.json': jaegerDashboard['jaeger.json'],
},
prometheusAlerts+:: jaegerAlerts,
};

{ ['00namespace-' + name + '.json']: kp.kubePrometheus[name] for name in std.objectFields(kp.kubePrometheus) } +
{ ['0prometheus-operator-' + name + '.json']: kp.prometheusOperator[name] for name in std.objectFields(kp.prometheusOperator) } +
{ ['node-exporter-' + name + '.json']: kp.nodeExporter[name] for name in std.objectFields(kp.nodeExporter) } +
{ ['kube-state-metrics-' + name + '.json']: kp.kubeStateMetrics[name] for name in std.objectFields(kp.kubeStateMetrics) } +
{ ['alertmanager-' + name + '.json']: kp.alertmanager[name] for name in std.objectFields(kp.alertmanager) } +
{ ['prometheus-' + name + '.json']: kp.prometheus[name] for name in std.objectFields(kp.prometheus) } +
{ ['prometheus-adapter-' + name + '.json']: kp.prometheusAdapter[name] for name in std.objectFields(kp.prometheusAdapter) } +
{ ['grafana-' + name + '.json']: kp.grafana[name] for name in std.objectFields(kp.grafana) }

0 comments on commit aa76cf0

Please sign in to comment.