-
Notifications
You must be signed in to change notification settings - Fork 136
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added documentation for metrics feature. The README is getting quite long, so while anticipating our package breakout I split the metrics README into the metrics directory and linked to it from the front page. I expect we'll do something along these lines for other sections in the future.
- Loading branch information
Showing
2 changed files
with
151 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
# Containerbuddy Metrics | ||
|
||
If a `metrics` option is provided, Containerbuddy will expose a [Prometheus](http://prometheus.io) HTTP client interface that can be used to scrape performance metrics. The metrics interface is advertised as a service to the discovery service similar to services configured via the `services` block. Each `sensor` for the metrics service will run periodically and record values in the [Prometheus client library](https://github.com/prometheus/client_golang). A Prometheus server can then make HTTP requests to the metrics endpoint. | ||
|
||
### Configuring Metrics | ||
|
||
The top-level metrics configuration defines the metrics HTTP endpoint. This endpoint will be advertised to Consul (or other discovery service) just as a typical Containerbuddy `service` block is. The metrics service will send periodic heartbeats to the discovery service to identify that it is still operating. Unlike a typical Containerbuddy service, there is no user-defined health check for the metrics service endpoint. | ||
|
||
A minimal configuration for Containerbuddy including metrics might look like this: | ||
|
||
```json | ||
{ | ||
"consul": "consul:8500", | ||
"metrics": { | ||
"name": "metrics_service_name", | ||
"url": "/metrics", | ||
"port": 8000, | ||
"ttl": 30, | ||
"poll": 10, | ||
"interfaces": ["eth0"], | ||
"tags": ["tag1"], | ||
"sensors": [ | ||
{ | ||
"namespace": "my_namespace", | ||
"subsystem": "my_subsystem", | ||
"name": "my_events_count", | ||
"help": "help text", | ||
"type": "counter", | ||
"poll": 5, | ||
"check": ["/bin/sensor.sh"] | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
The fields are as follows: | ||
|
||
- `name` is the name of the metrics service as it will appear in the discovery service. Each instance of the service will have a unique ID made up from `name`+hostname of the container. The Prometheus server should use this name to discover containers exposing metrics. | ||
- `url` is the path to use for the metrics service. A scrape against any other URL will return 404, and the default value is `/metrics`. | ||
- `port` is the port the metrics service will advertise to the discovery service. | ||
- `interfaces` is an optional single or array of interface specifications. If given, the IP of the service will be obtained from the first interface specification that matches. (Default value is `["eth0:inet"]`) | ||
- `poll` is the time in seconds between sending heartbeats to the discovery service. | ||
- `ttl` is the time-to-live of a heartbeat to the discovery service. This should be longer than the polling rate so that the polling process and the TTL aren't racing; otherwise Consul will mark the service as unhealthy. | ||
- `tags` is an optional array of tags. If the discovery service supports it (Consul does), the service will register itself with these tags. | ||
- `sensor` is an optional array of sensor configurations (see below). If no sensors are provided, then the metrics endpoint will still be exposed and will show only metrics about Containerbuddy internals. | ||
|
||
### Configuring Sensors | ||
|
||
The `sensors` field is a list of user-defined sensors that the metrics service will use to collect metrics. Each time a sensor is polled, the user-defined `check` executable will be run. If the value that the `check` returns from stdout can be parsed as a 64-bit float, then the metrics collector will receive that value. | ||
|
||
The fields for a sensor are as follows: | ||
|
||
- `namespace`, `subsystem`, and `name` are the names that the Prometheus client library will use to construct the name for the metrics. These three names are concatenated with underscores `_` to become the final name that is scraped recorded by Prometheus. In the example above the metric recorded would be named `my_namespace_my_subsystem_my_event_count`. Please see the [Prometheus documents on naming](http://prometheus.io/docs/practices/naming/) for best practices on how to name your metrics. | ||
- `help` is the help text that will be associated with the metric recorded by Prometheus. This is useful for debugging by giving a more verbose description. | ||
- `type` is the type of collector Prometheus will use (one of `counter`, `gauge`, `histogram` or `summary`). See [below](#Collector_types) for details. | ||
- `poll` is the time in seconds between running the `check`. | ||
- `check` is the executable (and its arguments) that is called when it is time to perform a metrics collection. | ||
|
||
The check executable is expected to return via stdout a value that can be parsed as a single 64-bit float number. Whitespace will be trimmed, but any other text in the stdout of the executable will cause the metric to be dropped. If you need to return additional information for logging, you should return this via stderr (which Containerbuddy will pass along to the Docker engine). | ||
|
||
For example, a `check` field like `"check": ["/usr/bin/free"]` is not a working check because the output contains multiple fields as well as text. | ||
|
||
An example of a good check script might be: | ||
|
||
```bash | ||
#!/bin/bash | ||
# check free memory | ||
echo "checked free memory sensor" 1>&2 | ||
free | awk -F' +' '/Mem/{print $3}' | ||
``` | ||
|
||
This check script will return exactly one numeric value on stdout, and sends additional logging info to stderr where it can be safely handled. | ||
|
||
### Collector types | ||
|
||
Containerbuddy supports all four of the [metric types](http://prometheus.io/docs/concepts/metric_types/) available in the Prometheus API. Briefly these are: | ||
|
||
*Counter* | ||
|
||
A cumulative metric that represents a single numerical value that only ever goes up. A typical use case for a counter is a count of the number of of certain events. The value returned by the sensor will be added to the counter for that metric. | ||
|
||
*Gauge* | ||
|
||
A metric that represents a single numerical value that can arbitrarily go up and down. A typical use case for a gauge might be a measurement of the current memory usage. The value returned by the sensor script will be set as the new value for the gauge metric. | ||
|
||
*Histogram* | ||
|
||
A count of observations in "buckets", along with the sum of all observed values. A typical use case might be request durations or response sizes. When the Prometheus server scrapes this metrics endpoint, it will receive a list of buckets and their counts. For example: | ||
|
||
``` | ||
namespace_subsystem_response_bucket{le="1"} 0 | ||
namespace_subsystem_response_bucket{le="2.5"} 0 | ||
namespace_subsystem_response_bucket{le="5"} 1 | ||
namespace_subsystem_response_bucket{le="10"} 2 | ||
namespace_subsystem_response_bucket{le="+Inf"} 2 | ||
``` | ||
|
||
This indicates that the collector has seen 2 events in total. One event had a value less than 5 (`le="5"`), whereas a second was less than 10. | ||
|
||
*Summary* | ||
|
||
A summary is similar to a histogram, but while it also provides a total count of observations and a sum of all observed values, it calculates quantiles over a sliding time window. For example: | ||
|
||
``` | ||
namespace_subsystem_response_seconds_summary{quantile="0.5"} 0.3 | ||
namespace_subsystem_response_seconds_summary{quantile="0.9"} 0.5 | ||
namespace_subsystem_response_seconds_summary{quantile="0.99"} 2 | ||
``` | ||
|
||
This indicates that the 50th percentile response time is 0.3 seconds, the 90th percentile is 0.5 seconds, and the 99th percentile is 2 seconds. | ||
|
||
Please see the Prometheus docs on [histograms](http://prometheus.io/docs/practices/histograms/) for best practices on when you should choose histograms vs summaries. |