[no merge] Initial implementation of telemetry #119

tgross · 2016-03-28T18:49:47Z

This is my initial pass at #27. ~~Currently integration tests will pass but the metrics feature has no tests itself so those tests just verify that the whole thing builds and won't break if we're not including metrics.~~ (update: has tests now!) So this isn't ready to merge but it gives an idea of the approach.

The Telemetry server will create itself a new ServiceConfig and register with discovery that way.
Sensors expect only a return value of a string that can be parsed as a float64. Anything else on stdout will cause the metric to be rejected. Stderr will get passed to the container's stderr as per usual
This is the out-of-the-box Prometheus client config, which includes all the metrics of Containerbuddy itself and is quite a lot of data. We may want to look into that.
wrote lots and lots of unit tests (and fixed whatever results from that)

Next items TODO:

~~maybe some integration tests?~~ going to work up a prometheus blueprint first so I'm ok with leaving that off from this PR.

cc @justenwalker and @misterbisson
Note that this PR includes all of #118 currently, so when that's merged this PR should be cleaner.

tgross · 2016-03-28T19:16:19Z

Awesome (er, "awesome")... our zombie reaping code is breaking in the integration test. But not on my machine (or in https://travis-ci.org/tgross/containerbuddy/builds/119049690 with the identical branch) even after make clean.

justenwalker · 2016-03-29T01:10:01Z

Seems like a race. could be that the threshold needs to be increased or the test needs to be tightened up some more. The assumption is that the zombies will be reaped in a timely manner - which may not be the case on travis (and may not be deterministic).

Actually - It looks like we should check just # of zombies owned by PID1 - since that's the reaping process. It can't reap processes that it doesn't own:

STAT PPID  PID   COMMAND
S        0     1 /bin/containerbuddy -config=file:///app.json tail -f
S        1    10 tail -f
Z        1    16 [sleep]
S        1    17 /bin/sleep 2
Z       17    19 [sleep]
R        0    25 ps -o stat,ppid,pid,args

The additional zombie hasn't been re-parented, so of course it won't get reaped.

tgross · 2016-03-29T12:46:57Z

The additional zombie hasn't been re-parented, so of course it won't get reaped.

Good catch, and I think your suggested fix for checking zombies owned only by PID1 makes sense. I'll open a new PR for that and rebase this PR on that so we can review it cleanly.

My next question is how the zombie wasn't reparented, but that's probably out-of-scope for us (this is a kernel mechanism as far as I know).

Making PollAction a method of Pollable will allow us to avoid a circular import when we split out packages from the main containerbuddy package.

Packages that include Pollables are going to want to validation their own configuration, so by moving the parsing and IPs functions into a utils package we can do so without any circular references.

Created a metrics package that leans on the official Prometheus client to serve metrics. Adds a new Pollable for Sensors that record metrics received from user- defined functions.

justenwalker · 2016-03-29T15:47:54Z

My next question is how the zombie wasn't reparented, but that's probably out-of-scope for us (this is a kernel mechanism as far as I know).

I think it hasn't been re-parented because the parent already exists - still sleeping

S        1    17 /bin/sleep 2
Z       17    19 [sleep]

MustRegister panics rather than returning an error (thanks!) if we register an invalid collector, even just an invalid name. This would give end-users a big ugly golang stack trace that they have to dig the error message out of. So we'll catch the panic so we can return an error.

Also moved the similar reset we do for Service.CheckHealth and Backend.OnChange into defers so that we can ensure they happen even if the function errors-out. Includes tests and new testdata script for metrics package.

Unit tests of recording metrics for each of the four collector types and fetching them via the web API. Cleaned up some of the existing tests so that metrics names are more directly tied-back to test cases. Fixed a bug in `Sensor.record()` where the type switch doesn't work because the prometheus collector structs are actually interfaces. So we end up having to do some ugly string checking.

tgross · 2016-03-30T20:30:32Z

I've added a whole mess of unit tests, and fixed two minor bugs along the way with making sure metrics check commands reset and not noticing some odd prometheus API choices with their collectors. Next big step is to add a README.

tgross · 2016-03-31T18:34:45Z

@misterbisson I've added README updates in this commit 2783e5e

Added documentation for metrics feature. The README is getting quite long, so while anticipating our package breakout I split the metrics README into the metrics directory and linked to it from the front page. I expect we'll do something along these lines for other sections in the future.

justenwalker · 2016-04-01T01:02:12Z

@tgross Haven't looked yet.

But to the README point - it seems like that is getting quite big and will continue to grow. Is there call for generating some kind of webpage and splitting out the concepts into a more organized format? Would be good to include such a 'website' in the same repo and generate the page somehow. (isn't that what github.io does? I haven't tried it)

tgross · 2016-04-01T12:32:52Z

Opened #124 to discuss the documentation issue.

tgross · 2016-04-05T15:26:46Z

Had a sidebar conversation with @misterbisson and we're going to rename this feature to "telemetry." Will have a commit to update this PR today.

tgross · 2016-04-05T19:42:47Z

We're going to ship this branch as a release candidate for 1.4.0, so that we can use the implementation in an autopilot pattern we're planning on publishing this week. I'd like to get this and our refactor merged in prior to the actual release.

justenwalker · 2016-04-05T20:22:53Z

@tgross I only was able to spare a couple minutes today, but I did a look over of the code and It all looked good. If you're waiting for my feedback, I'd say it's a good RC just to get something out there and find the actual bugs in the implementation - if any.

In that respect, do you have an example containerbuddy.json that I could use to tweak and test locally? With the example shell scripts and program - I guess a docker container on hub would cover it :)

tgross · 2016-04-05T20:56:37Z

I'd say it's a good RC just to get something out there and find the actual bugs in the implementation - if any.

What I'd like to do then is wait until I've got at least one known-working example (see below) and then merge this to master.

In that respect, do you have an example containerbuddy.json that I could use to tweak and test locally? With the example shell scripts and program - I guess a docker container on hub would cover it :)

That's what I'm working on now, so I should definitely have something for you tomorrow. I've got https://github.com/autopilotpattern/prometheus as a collector and I'm adding the use case to https://github.com/autopilotpattern/touchbase now.

Sensors were polling but the telemetry server was blocking its thread because we didn't put it into its own goroutine. Also needed to take a pointer for PollAction on Sensor so that we can update the exec/Cmd struct after execution. Added an integration test exercising these.

tgross · 2016-04-06T19:44:47Z

I've pushed up an integration test that exercises a couple of bugs I found. (ref https://github.com/joyent/containerbuddy/releases/tag/1.4.0-rc2)

tgross · 2016-04-07T12:28:39Z

Comments from @misterbisson in autopilotpattern/touchbase#18:

Should the URL and name be user-configured options? My reason for asking is how discovery would work in an environment where there may be different for each application.

Proposal: remove name, url, poll, and ttl as user-configurable options, so that the telemetry stanza is just the following:
  "telemetry": {
    "port": 9090,
    "sensors": [
      {...}
    ]
  }

Just so that it's clear, the name, poll, and ttl fields are necessary to advertise telemetry as a service. But I think you're asking whether we should define these values and not allow the user to configure them. While appreciate a desire to simplify the configuration syntax, forcing a particular choice on the user is going to limit the usefulness of the feature. The major use case I'm thinking of here is having separate collectors for separate dev/prod/business-unit stacks (Prometheus won't listen for Consul "tags"). Polling requirements are likely to be application specific. Philosophically speaking, it's the responsibility of the application development team to determine the appropriate values for configuring the telemetry service for their application.

That being said, having sane defaults couldn't hurt. In the common case someone can just default to advertising a telemetry service on the URL /telemetry on port 9090 and having it heartbeat every 10 seconds for 30 sec TTL.

Along these same lines, we've mirrored the Prometheus golang API in terms of the sensor naming definitions. They give you the option to have namespace, subsystem, and name for each metric so we copied that dutifully:

    "sensors": [
       {
         "namespace": "tb",
         "subsystem": "nginx",
         "name": "connections_active",
         "help": "Current number of active client connections",
         "type": "gauge",
         "poll": 5,
         "check": ["/opt/containerbuddy/sensor.sh", "connections_active"]
       },

But the Prometheus API doesn't actually care about that, and this would be the same thing as leaving out the namespace and subsystem fields and just naming it tb_nginx_connections_active. The golang client's BuildFQName just concatenates the values with _ anyways. Presumably they include this to encourage good practices and to make it easier when you're calling the code as a library. But our end users aren't calling into the golang API as a library, they're just giving us strings.

It looks like it's entirely safe with our existing implementation here for an end user to leave off namespace and subsystem. No code changes required. Maybe we encourage that in the docs (and demonstration application) Leaving the option to have separate namespace and subsystem fields might make automatically generating Containerbuddy configurations in the CI pipeline easier. So I'd like to leave that in.

tgross · 2016-04-07T14:19:05Z

It looks like it's entirely safe with our existing implementation here for an end user to leave off namespace and subsystem. No code changes required.

I've verified that was the case in this new unit test: 168df0d

misterbisson · 2016-04-07T15:32:54Z

Regarding configuration, we started that discussion in #27 (comment), where I proposed the following in the context of our typical service definition:

  "services": [
    {
      "name": "containerbuddy-telemetry",
      "port": <user defined>,
      "health": <not a user-configurable value>,
      "interfaces": <user defined>,
      "poll": <not a user-configurable value>,
      "ttl": <not a user-configurable value>,
      "tags": <maybe user defined>
    }
  ],

The reason I I feel strongly that name should not be user configurable (and why I didn't even suggest url) is that having this be identical across all applications is critical for anybody's ability to reasonably build monitoring tooling. @tgross outlines the argument against this:

The major use case I'm thinking of here is having separate collectors for separate dev/prod/business-unit stacks

However, I believe that argument is inconsistent with what we're arguing elsewhere, which is to isolate dev from prod, etc, lower level solutions and not sharing the discovery catalog between environments or using tags if shared discovery catalogs are required.

Prometheus won't listen for Consul "tags"

I would explicitly not design for that limitation. Prometheus' Consul discovery can be improved.

This eliminates any remaining parsing in config.go and moves it into the telemetry package by passing json.RawMessage into a NewTelemetry function. Also fix the integration tests.

tgross · 2016-04-08T13:26:12Z

@misterbisson I've updated per your comments. Do me a favor and review the two READMEs. I'm going to fold these changes back into #123 as well.

tgross · 2016-04-08T17:45:46Z

Marking this [no merge]. Ref #123 (comment)

misterbisson · 2016-04-12T14:50:17Z

🏡 🚶

tgross force-pushed the gh27_metrics branch 2 times, most recently from dff75bd to f9f4ccb Compare March 28, 2016 18:59

tgross mentioned this pull request Mar 29, 2016

Avoid racing reparent in zombie-reaping integration test #120

Merged

tgross added 3 commits March 29, 2016 11:09

Refactor of Pollable for modularization effort

087af28

Making PollAction a method of Pollable will allow us to avoid a circular import when we split out packages from the main containerbuddy package.

Move parsing and IPs functions into a utils package

7049768

Packages that include Pollables are going to want to validation their own configuration, so by moving the parsing and IPs functions into a utils package we can do so without any circular references.

Initial implementation of metrics (Prometheus support)

ba491f8

Created a metrics package that leans on the official Prometheus client to serve metrics. Adds a new Pollable for Sensors that record metrics received from user- defined functions.

tgross force-pushed the gh27_metrics branch from f9f4ccb to ba491f8 Compare March 29, 2016 15:10

tgross added 2 commits March 29, 2016 15:21

Support getting test coverage from multiple packages

6ff5adc

tgross force-pushed the gh27_metrics branch from 9459fee to 7060416 Compare March 29, 2016 19:22

tgross added 5 commits March 29, 2016 16:38

Tests for parsing metrics and sensors

3ec95ec

Make running single package of tests easier via makefile

ddc7134

Make sure metrics check commands are reset after use

1a3b21b

Also moved the similar reset we do for Service.CheckHealth and Backend.OnChange into defers so that we can ensure they happen even if the function errors-out. Includes tests and new testdata script for metrics package.

Use Register instead of MustRegister to avoid having to handle panic

27fb2d6

tgross force-pushed the gh27_metrics branch from f7c613d to fb5a8d8 Compare March 31, 2016 18:36

tgross force-pushed the gh27_metrics branch from fb5a8d8 to 2783e5e Compare March 31, 2016 18:39

tgross mentioned this pull request Apr 1, 2016

Documentation improvements #124

Closed

tgross mentioned this pull request Apr 1, 2016

Breakout packages #123

Merged

Rename metrics to telemetry

0cd21ed

tgross changed the title ~~Initial implementation of metrics~~ Initial implementation of telemetry Apr 5, 2016

Remove metrics import from top-level main.go

afd7405

Update README

7539c0b

tgross force-pushed the gh27_metrics branch from 8da278e to 12a5910 Compare April 6, 2016 19:36

tgross mentioned this pull request Apr 7, 2016

Demonstrate telemetry autopilotpattern/touchbase#18

Merged

Unit test to verify default sensor field names

168df0d

tgross added 2 commits April 8, 2016 08:03

Hard-code telemetry service name and path

4e9d94d

Refactor telemetry config to move all parsing into package

1d83185

This eliminates any remaining parsing in config.go and moves it into the telemetry package by passing json.RawMessage into a NewTelemetry function. Also fix the integration tests.

Telemetry README improvements

84f08c8

tgross changed the title ~~Initial implementation of telemetry~~ [no merge] Initial implementation of telemetry Apr 8, 2016

misterbisson mentioned this pull request Apr 11, 2016

allow setting the service discovery endpoint #108

Merged

tgross closed this Apr 12, 2016

misterbisson mentioned this pull request Apr 3, 2017

Ability to configure telemetry service name #292

Closed

tgross deleted the gh27_metrics branch April 4, 2017 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[no merge] Initial implementation of telemetry #119

[no merge] Initial implementation of telemetry #119

tgross commented Mar 28, 2016

tgross commented Mar 28, 2016

justenwalker commented Mar 29, 2016

tgross commented Mar 29, 2016

justenwalker commented Mar 29, 2016

tgross commented Mar 30, 2016

tgross commented Mar 31, 2016

justenwalker commented Apr 1, 2016

tgross commented Apr 1, 2016

tgross commented Apr 5, 2016

tgross commented Apr 5, 2016

justenwalker commented Apr 5, 2016

tgross commented Apr 5, 2016

tgross commented Apr 6, 2016

tgross commented Apr 7, 2016

tgross commented Apr 7, 2016

misterbisson commented Apr 7, 2016

tgross commented Apr 8, 2016

tgross commented Apr 8, 2016

misterbisson commented Apr 12, 2016

[no merge] Initial implementation of telemetry #119

[no merge] Initial implementation of telemetry #119

Conversation

tgross commented Mar 28, 2016

tgross commented Mar 28, 2016

justenwalker commented Mar 29, 2016

tgross commented Mar 29, 2016

justenwalker commented Mar 29, 2016

tgross commented Mar 30, 2016

tgross commented Mar 31, 2016

justenwalker commented Apr 1, 2016

tgross commented Apr 1, 2016

tgross commented Apr 5, 2016

tgross commented Apr 5, 2016

justenwalker commented Apr 5, 2016

tgross commented Apr 5, 2016

tgross commented Apr 6, 2016

tgross commented Apr 7, 2016

tgross commented Apr 7, 2016

misterbisson commented Apr 7, 2016

tgross commented Apr 8, 2016

tgross commented Apr 8, 2016

misterbisson commented Apr 12, 2016