fix(docs): combine patterns into queries doc

ssube · Feb 14, 2020 · efb91ef · efb91ef
1 parent 16001d8
commit efb91ef
Show file tree

Hide file tree

Showing 4 changed files with 159 additions and 91 deletions.
diff --git a/docs/alert.md b/docs/alert.md
diff --git a/docs/dashboards.md b/docs/dashboards.md
@@ -0,0 +1,19 @@
+# Dashboards
+
+## How To Use
+
+### Release Annotations
+
+### Variables
+
+**Note:** variables cannot be used with alerts. Combining the two would require a mapping of variable
+values to their appropriate thresholds and, combined with repeated panels, could cause an explosion of
+alert queries on the backend.
+
+## Best Practices
+
+- export and commit dashboards often
+- limit the number of panels per dashboard to 8-12
+  - each panel is a JSON request and browsers will only make 6-8 requests per hostname. Large dashboards
+    will spin as panels wait to load, and slow panels may block faster ones.
+  - with 12 panels and a 30 second query timeout, dashboards should take no longer than 60 seconds to load.
diff --git a/docs/patterns.md b/docs/patterns.md
diff --git a/docs/query.md → docs/queries.md b/docs/query.md → docs/queries.md
@@ -1,10 +1,36 @@
-# Getting Started
+# Queries
 
 ## Contents
 
-- Where did that plugin go?
-
-## Basics
+- [Queries](#queries)
+  - [Contents](#contents)
+  - [Getting Started](#getting-started)
+    - [Sampling a Single Series](#sampling-a-single-series)
+    - [Filtering By Labels](#filtering-by-labels)
+    - [Calculating a Delta](#calculating-a-delta)
+    - [Joining Multiple Timeseries](#joining-multiple-timeseries)
+  - [Details](#details)
+    - [Labels Table](#labels-table)
+    - [Samples Table](#samples-table)
+    - [Metrics View](#metrics-view)
+  - [Best Practices](#best-practices)
+    - [Continuous Aggregate Patterns](#continuous-aggregate-patterns)
+      - [Aggregate Intervals](#aggregate-intervals)
+    - [Query Patterns](#query-patterns)
+      - [CASE Statements](#case-statements)
+      - [Duplicate Samples](#duplicate-samples)
+      - [JSON Columns](#json-columns)
+      - [Quote Sensitivity](#quote-sensitivity)
+      - [Time Filter](#time-filter)
+      - [Window Functions](#window-functions)
+    - [Grafana Practices](#grafana-practices)
+      - [Grafana Errors](#grafana-errors)
+        - [JSON Body Marshal](#json-body-marshal)
+        - [Data Points Outside Time Range](#data-points-outside-time-range)
+      - [Grafana Macros](#grafana-macros)
+      - [Result Column Names](#result-column-names)
+
+## Getting Started
 
 ### Sampling a Single Series
 
@@ -159,9 +185,89 @@ View definition:
   WHERE s."time" > (now() - '06:00:00'::interval);
 ```
 
-## Gotchas
+## Best Practices
+
+### Continuous Aggregate Patterns
+
+#### Aggregate Intervals
+
+Continuous aggregates are calculated on an interval and will have a delay (gap on the right/recent edge of the graph).
+The delay will be about `1.5 * (interval + lag)`, giving time for each bucket to fill completely; expect a 15 minute
+delay for `5m` interval/lag and up to 45 minutes for `15m`.
+
+### Query Patterns
+
+#### CASE Statements
+
+Using `CASE` statements to join multiple timeseries tend to be extremely slow, since it has to fetch
+each series individually or miss the `lid` index, then sort and correlate points between the two.
+
+Avoid this pattern at all costs:
+
+```sql
+SELECT
+  AVG(bars) / MAX(bins) -- with any aggregate functions, not just AVG/MAX
+FROM (
+  SELECT
+    ...
+    MAX(CASE WHEN name = 'foo_bar' THEN value ELSE null END) AS bars,
+    MAX(CASE WHEN name = 'foo_bin' THEN value ELSE null END) AS bins
+  FROM metrics
+  WHERE
+    $__timeFilter("time")
+    AND name IN (
+      'foo_bar',
+      'foo_bin'
+    )
+  GROUP BY timeGroup, metric
+)
+```
+
+Matching points between the disparate series causes a repeated quicksort, which can quickly exceed memory
+limits and write to disk.
+
+#### Duplicate Samples
+
+Grouping duplicate metrics is necessary when running Prometheus in high-availability sets. Since each Prometheus
+instance labels metrics with its own ID, they appear as individual timeseries with different `lid`s, but can be
+grouped by time/bucket and metric.
+
+For queries with a single value, momentary or rate:
+
+```sql
+SELECT
+  labels->>'some_label' AS "metric",
+  $__timeGroup(time, ${__interval}) AS "time",
+  MAX(value) AS "value"
+FROM metrics
+WHERE
+  $__timeFilter(time) AND
+  name = 'some_metric'
+GROUP BY $__timeGroup(time, ${__interval}), metric
+ORDER BY 1, 2
+```
+
+For queries with multiple values or a complex aggregate:
 
-### Schema Gotchas
+```sql
+SELECT
+  metric,
+  time,
+  MAX(value) AS "value"
+FROM (
+  SELECT
+    labels->>'some_label' AS "metric",
+    $__timeGroup("time", ${__interval}),
+    value
+  FROM metrics
+  WHERE
+    $__timeFilter(time) AND
+    name = 'some_metric'
+  GROUP BY $__timeGroup("time", ${__interval}), labels->>'other_label'
+) AS s
+GROUP BY metric, time
+ORDER BY 1, 2
+```
 
 #### JSON Columns
 
@@ -202,12 +308,36 @@ Within the `WHERE` clause, keep your filters in order by volume of data:
 
 > https://www.postgresql.org/docs/10/tutorial-window.html
 
-Windows allow Postgres to compare the same metric, filtered by labels, across time.
+Window functions allow aggregates to view more than one row at a time, often using the previous and/or next row
+to calculate change over time. Larger windows may calculate ordered aggregates like percentiles, but also need to
+look at a larger set of rows and may not scale well.
+
+Some utility functions are provided in [the `rate_` family](../schema/utils/rate.sql) to calculate change between
+samples, handle counter resets, and adjust for time.
+
+Time buckets should be grouped in a sub-select before the window function:
 
-Any filters and most renaming should be done within an inner select, with the other select used to handle the
-window and counter resets.
+```sql
+SELECT
+  metric,
+  bucket AS time,
+  rate_time(value, lag(value) OVER w, '${__interval}') AS value
+FROM (
+  SELECT
+    CONCAT(labels->>'instance', labels->>'job') AS metric,
+    $__timeGroup("time", ${__interval}) AS bucket,
+    MAX(value) AS value
+  FROM metrics
+  WHERE
+    $__timeFilter("time") AND
+    name = 'node_disk_write_time_seconds_total'
+  GROUP BY metric, bucket
+) AS m
+WINDOW w AS (PARTITION BY metric ORDER BY bucket)
+ORDER BY metric, time;
+```
 
-### Grafana
+### Grafana Practices
 
 #### Grafana Errors