Added metrics README

Added documentation for metrics feature. The README is getting quite long, so while anticipating our package breakout I split the metrics README into the metrics directory and linked to it from the front page. I expect we'll do something along these lines for other sections in the future.
TritonDataCenter · Mar 31, 2016 · 2783e5e · 2783e5e
1 parent f24d13f
commit 2783e5e
Show file tree

Hide file tree

Showing 2 changed files with 151 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ Using local scripts to test health or act on backend changes means that we can r
 
 Containerbuddy is explicitly *not* a supervisor process. Although it can act as PID1 inside a container, if the shimmed process dies, so does Containerbuddy (and therefore the container itself). Containerbuddy will return the exit code of its shimmed process back to the Docker Engine or Triton, so that it appears as expected when you run `docker ps -a` and look for your exit codes. Containerbuddy also attaches stdout/stderr from your application to stdout/stderr of the container, so that `docker logs` works as expected.
 
-### Configuring Containerbuddy
+## Configuring Containerbuddy
 
 Containerbuddy takes a single file argument (or a JSON string) as its configuration. All trailing arguments will be treated as the executable to shim and that executable's arguments.
 
@@ -90,11 +90,31 @@ The format of the JSON file configuration is as follows:
       "poll": 10,
       "onChange": "/opt/containerbuddy/reload-app.sh"
     }
-  ]
+  ],
+  "metrics": {
+	"name": "metrics_service_name",
+	"url": "/metrics",
+	"port": 8000,
+	"ttl": 30,
+	"poll": 10,
+	"interfaces": ["eth0"],
+    "tags": ["tag1"],
+	"sensors": [
+       {
+		"namespace": "metrics_namespace",
+		"subsystem": "metrics_subsystem",
+		"name": "metric_id",
+		"help": "help text",
+		"type": "counter",
+		"poll": 5,
+		"check": ["/bin/sensor.sh"]
+	  }
+	]
+  }
 }
 ```
 
-Service fields:
+#### Service fields:
 
 - `name` is the name of the service as it will appear in Consul. Each instance of the service will have a unique ID made up from `name`+hostname of the container.
 - `port` is the port the service will advertise to Consul.
@@ -104,13 +124,13 @@ Service fields:
 - `ttl` is the time-to-live of a successful health check. This should be longer than the polling rate so that the polling process and the TTL aren't racing; otherwise Consul will mark the service as unhealthy.
 - `tags` is an optional array of tags. If the discovery service supports it (Consul does), the service will register itself with these tags.
 
-Backend fields:
+#### Backend fields:
 
 - `name` is the name of a backend service that this container depends on, as it will appear in Consul.
 - `poll` is the time in seconds between polling for changes.
 - `onChange` is the executable (and its arguments) that is called when there is a change in the list of IPs and ports for this backend.
 
-Service Discovery Backends:
+#### Service Discovery Backends:
 
 Must supply only one of the following
 
@@ -134,7 +154,7 @@ Must supply only one of the following
     - `endpoints` is the list of etcd nodes in your cluster
     - `prefix` is the path that will be prefixed to all service discovery keys. This key is optional. (Default: `/containerbuddy`)
 
-Logging Config (Optional):
+#### Logging Config (Optional):
 
 The logging config adjust the output format and verbosity of Containerbuddy logs.
 
@@ -181,7 +201,14 @@ exit status 1
 {"level":"fatal","msg":"The ice breaks!","number":100,"omg":true,"time":"2014-03-10 19:57:38.562543128 -0400 EDT"}
 ```
 
-Other fields:
+#### Metrics (Optional):
+
+If a `metrics` option is provided, Containerbuddy will expose a [Prometheus](http://prometheus.io) HTTP client interface that can be used to scrape performance metrics. The metrics interface is advertised as a service to the discovery service similar to services configured via the `services` block. Each `sensor` for the metrics service will run periodically and record values in the [Prometheus client library](https://github.com/prometheus/client_golang). A Prometheus server can then make HTTP requests to the metrics endpoint.
+
+Details of how to configure the metrics endpoint and how the metrics endpoint works can be found in the [metrics README](https://github.com/joyent/containerbuddy/blob/master/metrics/README.md).
+
+
+#### Other fields:
 
 - `onStart` is the executable (and its arguments) that will be called immediately prior to starting the shimmed application. This field is optional. If the `onStart` handler returns a non-zero exit code, Containerbuddy will exit.
 - `preStop` is the executable (and its arguments) that will be called immediately **before** the shimmed application exits. This field is optional. Containerbuddy will wait until this program exits before terminating the shimmed application.
@@ -232,7 +259,7 @@ All executable fields, such as `onStart` and `onChange`, accept both a string or
 ]
 ```
 
-### Template Configuration
+#### Template Configuration
 
 Containerbuddy configuration has template support. If you have an environment variable such as `FOO=BAR` then you can use `{{.FOO}}` in your configuration file and it will be substituted with `BAR`.
 
@@ -247,7 +274,7 @@ Containerbuddy configuration has template support. If you have an environment va
 
 _Note:  If you need more than just variable interpolation, check out the [Go text/template Docs](https://golang.org/pkg/text/template/)._
 
-### Operating Containerbuddy
+## Operating Containerbuddy
 
 Containerbuddy accepts POSIX signals to change its runtime behavior. Currently, Containerbuddy accepts the following signals:
 
@@ -266,11 +293,11 @@ Docker will automatically deliver a `SIGTERM` with `docker stop`, not when using
 
 *Caveat*: If Containerbuddy is wrapped as a shell command, such as: `/bin/sh -c '/opt/containerbuddy .... '` then `SIGTERM` will not reach Containerbuddy from `docker stop`.  This is important for systems like Mesos which may use a shell command as the entrypoint under default configuration.
 
-### Contributing
+## Contributing
 
 Please report any issues you encounter with Containerbuddy or its documentation by [opening a Github issue](https://github.com/joyent/containerbuddy/issues). Roadmap items will be maintained as [enhancements](https://github.com/joyent/containerbuddy/issues?q=is%3Aopen+is%3Aissue+label%3Aenhancement). PRs are welcome on any issue.
 
-### Examples
+## Examples
 
 We've published a number of example applications demonstrating how Containerbuddy works.
 

diff --git a/metrics/README.md b/metrics/README.md
@@ -0,0 +1,113 @@
+# Containerbuddy Metrics
+
+If a `metrics` option is provided, Containerbuddy will expose a [Prometheus](http://prometheus.io) HTTP client interface that can be used to scrape performance metrics. The metrics interface is advertised as a service to the discovery service similar to services configured via the `services` block. Each `sensor` for the metrics service will run periodically and record values in the [Prometheus client library](https://github.com/prometheus/client_golang). A Prometheus server can then make HTTP requests to the metrics endpoint.
+
+### Configuring Metrics
+
+The top-level metrics configuration defines the metrics HTTP endpoint. This endpoint will be advertised to Consul (or other discovery service) just as a typical Containerbuddy `service` block is. The metrics service will send periodic heartbeats to the discovery service to identify that it is still operating. Unlike a typical Containerbuddy service, there is no user-defined health check for the metrics service endpoint.
+
+A minimal configuration for Containerbuddy including metrics might look like this:
+
+```json
+{
+  "consul": "consul:8500",
+  "metrics": {
+	"name": "metrics_service_name",
+	"url": "/metrics",
+	"port": 8000,
+	"ttl": 30,
+	"poll": 10,
+	"interfaces": ["eth0"],
+    "tags": ["tag1"],
+	"sensors": [
+       {
+		"namespace": "my_namespace",
+		"subsystem": "my_subsystem",
+		"name": "my_events_count",
+		"help": "help text",
+		"type": "counter",
+		"poll": 5,
+		"check": ["/bin/sensor.sh"]
+	  }
+	]
+  }
+}
+```
+
+The fields are as follows:
+
+- `name` is the name of the metrics service as it will appear in the discovery service. Each instance of the service will have a unique ID made up from `name`+hostname of the container. The Prometheus server should use this name to discover containers exposing metrics.
+- `url` is the path to use for the metrics service. A scrape against any other URL will return 404, and the default value is `/metrics`.
+- `port` is the port the metrics service will advertise to the discovery service.
+- `interfaces` is an optional single or array of interface specifications. If given, the IP of the service will be obtained from the first interface specification that matches. (Default value is `["eth0:inet"]`)
+- `poll` is the time in seconds between sending heartbeats to the discovery service.
+- `ttl` is the time-to-live of a heartbeat to the discovery service. This should be longer than the polling rate so that the polling process and the TTL aren't racing; otherwise Consul will mark the service as unhealthy.
+- `tags` is an optional array of tags. If the discovery service supports it (Consul does), the service will register itself with these tags.
+- `sensor` is an optional array of sensor configurations (see below). If no sensors are provided, then the metrics endpoint will still be exposed and will show only metrics about Containerbuddy internals.
+
+### Configuring Sensors
+
+The `sensors` field is a list of user-defined sensors that the metrics service will use to collect metrics. Each time a sensor is polled, the user-defined `check` executable will be run. If the value that the `check` returns from stdout can be parsed as a 64-bit float, then the metrics collector will receive that value.
+
+The fields for a sensor are as follows:
+
+- `namespace`, `subsystem`, and `name` are the names that the Prometheus client library will use to construct the name for the metrics. These three names are concatenated with underscores `_` to become the final name that is scraped recorded by Prometheus. In the example above the metric recorded would be named `my_namespace_my_subsystem_my_event_count`. Please see the [Prometheus documents on naming](http://prometheus.io/docs/practices/naming/) for best practices on how to name your metrics.
+- `help` is the help text that will be associated with the metric recorded by Prometheus. This is useful for debugging by giving a more verbose description.
+- `type` is the type of collector Prometheus will use (one of `counter`, `gauge`, `histogram` or `summary`). See [below](#Collector_types) for details.
+- `poll` is the time in seconds between running the `check`.
+- `check` is the executable (and its arguments) that is called when it is time to perform a metrics collection.
+
+The check executable is expected to return via stdout a value that can be parsed as a single 64-bit float number. Whitespace will be trimmed, but any other text in the stdout of the executable will cause the metric to be dropped. If you need to return additional information for logging, you should return this via stderr (which Containerbuddy will pass along to the Docker engine).
+
+For example, a `check` field like `"check": ["/usr/bin/free"]` is not a working check because the output contains multiple fields as well as text.
+
+An example of a good check script might be:
+
+```bash
+#!/bin/bash
+# check free memory
+echo "checked free memory sensor" 1>&2
+free | awk -F' +' '/Mem/{print $3}'
+```
+
+This check script will return exactly one numeric value on stdout, and sends additional logging info to stderr where it can be safely handled.
+
+### Collector types
+
+Containerbuddy supports all four of the [metric types](http://prometheus.io/docs/concepts/metric_types/) available in the Prometheus API. Briefly these are:
+
+*Counter*
+
+A cumulative metric that represents a single numerical value that only ever goes up. A typical use case for a counter is a count of the number of of certain events. The value returned by the sensor will be added to the counter for that metric.
+
+*Gauge*
+
+A metric that represents a single numerical value that can arbitrarily go up and down. A typical use case for a gauge might be a measurement of the current memory usage. The value returned by the sensor script will be set as the new value for the gauge metric.
+
+*Histogram*
+
+A count of observations in "buckets", along with the sum of all observed values. A typical use case might be request durations or response sizes. When the Prometheus server scrapes this metrics endpoint, it will receive a list of buckets and their counts. For example:
+
+```
+namespace_subsystem_response_bucket{le="1"} 0
+namespace_subsystem_response_bucket{le="2.5"} 0
+namespace_subsystem_response_bucket{le="5"} 1
+namespace_subsystem_response_bucket{le="10"} 2
+namespace_subsystem_response_bucket{le="+Inf"} 2
+```
+
+This indicates that the collector has seen 2 events in total. One event had a value less than 5 (`le="5"`), whereas a second was less than 10.
+
+*Summary*
+
+A summary is similar to a histogram, but while it also provides a total count of observations and a sum of all observed values, it calculates quantiles over a sliding time window. For example:
+
+```
+namespace_subsystem_response_seconds_summary{quantile="0.5"} 0.3
+namespace_subsystem_response_seconds_summary{quantile="0.9"} 0.5
+namespace_subsystem_response_seconds_summary{quantile="0.99"} 2
+```
+
+This indicates that the 50th percentile response time is 0.3 seconds, the 90th percentile is 0.5 seconds, and the 99th percentile is 2 seconds.
+
+Please see the Prometheus docs on [histograms](http://prometheus.io/docs/practices/histograms/) for best practices on when you should choose histograms vs summaries.