Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[no merge] Initial implementation of telemetry #119

Closed
wants to merge 19 commits into from

Conversation

tgross
Copy link
Contributor

@tgross tgross commented Mar 28, 2016

This is my initial pass at #27. Currently integration tests will pass but the metrics feature has no tests itself so those tests just verify that the whole thing builds and won't break if we're not including metrics. (update: has tests now!) So this isn't ready to merge but it gives an idea of the approach.

  • The Telemetry server will create itself a new ServiceConfig and register with discovery that way.
  • Sensors expect only a return value of a string that can be parsed as a float64. Anything else on stdout will cause the metric to be rejected. Stderr will get passed to the container's stderr as per usual
  • This is the out-of-the-box Prometheus client config, which includes all the metrics of Containerbuddy itself and is quite a lot of data. We may want to look into that.
  • wrote lots and lots of unit tests (and fixed whatever results from that)

Next items TODO:

  • maybe some integration tests? going to work up a prometheus blueprint first so I'm ok with leaving that off from this PR.

cc @justenwalker and @misterbisson
Note that this PR includes all of #118 currently, so when that's merged this PR should be cleaner.

@tgross tgross force-pushed the gh27_metrics branch 2 times, most recently from dff75bd to f9f4ccb Compare March 28, 2016 18:59
@tgross
Copy link
Contributor Author

tgross commented Mar 28, 2016

Awesome (er, "awesome")... our zombie reaping code is breaking in the integration test. But not on my machine (or in https://travis-ci.org/tgross/containerbuddy/builds/119049690 with the identical branch) even after make clean.

@justenwalker
Copy link
Contributor

Seems like a race. could be that the threshold needs to be increased or the test needs to be tightened up some more. The assumption is that the zombies will be reaped in a timely manner - which may not be the case on travis (and may not be deterministic).

Actually - It looks like we should check just # of zombies owned by PID1 - since that's the reaping process. It can't reap processes that it doesn't own:

STAT PPID  PID   COMMAND
S        0     1 /bin/containerbuddy -config=file:///app.json tail -f
S        1    10 tail -f
Z        1    16 [sleep]
S        1    17 /bin/sleep 2
Z       17    19 [sleep]
R        0    25 ps -o stat,ppid,pid,args

The additional zombie hasn't been re-parented, so of course it won't get reaped.

@tgross
Copy link
Contributor Author

tgross commented Mar 29, 2016

The additional zombie hasn't been re-parented, so of course it won't get reaped.

Good catch, and I think your suggested fix for checking zombies owned only by PID1 makes sense. I'll open a new PR for that and rebase this PR on that so we can review it cleanly.

My next question is how the zombie wasn't reparented, but that's probably out-of-scope for us (this is a kernel mechanism as far as I know).

Making PollAction a method of Pollable will allow us to avoid a circular
import when we split out packages from the main containerbuddy package.
Packages that include Pollables are going to want to validation their own
configuration, so by moving the parsing and IPs functions into a utils package we
can do so without any circular references.
Created a metrics package that leans on the official Prometheus client to serve
metrics. Adds a new Pollable for Sensors that record metrics received from user-
defined functions.
@justenwalker
Copy link
Contributor

My next question is how the zombie wasn't reparented, but that's probably out-of-scope for us (this is a kernel mechanism as far as I know).

I think it hasn't been re-parented because the parent already exists - still sleeping

S        1    17 /bin/sleep 2
Z       17    19 [sleep]

MustRegister panics rather than returning an error (thanks!) if we
register an invalid collector, even just an invalid name. This would give
end-users a big ugly golang stack trace that they have to dig the error
message out of. So we'll catch the panic so we can return an error.
Also moved the similar reset we do for Service.CheckHealth and Backend.OnChange
into defers so that we can ensure they happen even if the function errors-out.

Includes tests and new testdata script for metrics package.
Unit tests of recording metrics for each of the four collector types and
fetching them via the web API. Cleaned up some of the existing tests so that
metrics names are more directly tied-back to test cases.

Fixed a bug in `Sensor.record()` where the type switch doesn't work because
the prometheus collector structs are actually interfaces. So we end up having
to do some ugly string checking.
@tgross
Copy link
Contributor Author

tgross commented Mar 30, 2016

I've added a whole mess of unit tests, and fixed two minor bugs along the way with making sure metrics check commands reset and not noticing some odd prometheus API choices with their collectors. Next big step is to add a README.

@tgross
Copy link
Contributor Author

tgross commented Mar 31, 2016

@misterbisson I've added README updates in this commit 2783e5e

Added documentation for metrics feature. The README is getting quite long, so
while anticipating our package breakout I split the metrics README into the
metrics directory and linked to it from the front page. I expect we'll do
something along these lines for other sections in the future.
@justenwalker
Copy link
Contributor

@tgross Haven't looked yet.

But to the README point - it seems like that is getting quite big and will continue to grow. Is there call for generating some kind of webpage and splitting out the concepts into a more organized format? Would be good to include such a 'website' in the same repo and generate the page somehow. (isn't that what github.io does? I haven't tried it)

@tgross tgross mentioned this pull request Apr 1, 2016
@tgross
Copy link
Contributor Author

tgross commented Apr 1, 2016

Opened #124 to discuss the documentation issue.

@tgross tgross mentioned this pull request Apr 1, 2016
@tgross
Copy link
Contributor Author

tgross commented Apr 5, 2016

Had a sidebar conversation with @misterbisson and we're going to rename this feature to "telemetry." Will have a commit to update this PR today.

@tgross tgross changed the title Initial implementation of metrics Initial implementation of telemetry Apr 5, 2016
@tgross
Copy link
Contributor Author

tgross commented Apr 5, 2016

We're going to ship this branch as a release candidate for 1.4.0, so that we can use the implementation in an autopilot pattern we're planning on publishing this week. I'd like to get this and our refactor merged in prior to the actual release.

@justenwalker
Copy link
Contributor

@tgross I only was able to spare a couple minutes today, but I did a look over of the code and It all looked good. If you're waiting for my feedback, I'd say it's a good RC just to get something out there and find the actual bugs in the implementation - if any.

In that respect, do you have an example containerbuddy.json that I could use to tweak and test locally? With the example shell scripts and program - I guess a docker container on hub would cover it :)

@tgross
Copy link
Contributor Author

tgross commented Apr 5, 2016

I'd say it's a good RC just to get something out there and find the actual bugs in the implementation - if any.

What I'd like to do then is wait until I've got at least one known-working example (see below) and then merge this to master.

In that respect, do you have an example containerbuddy.json that I could use to tweak and test locally? With the example shell scripts and program - I guess a docker container on hub would cover it :)

That's what I'm working on now, so I should definitely have something for you tomorrow. I've got https://github.com/autopilotpattern/prometheus as a collector and I'm adding the use case to https://github.com/autopilotpattern/touchbase now.

Sensors were polling but the telemetry server was blocking its thread
because we didn't put it into its own goroutine. Also needed to take
a pointer for PollAction on Sensor so that we can update the exec/Cmd
struct after execution.

Added an integration test exercising these.
@tgross
Copy link
Contributor Author

tgross commented Apr 6, 2016

I've pushed up an integration test that exercises a couple of bugs I found. (ref https://github.com/joyent/containerbuddy/releases/tag/1.4.0-rc2)

@tgross
Copy link
Contributor Author

tgross commented Apr 7, 2016

Comments from @misterbisson in autopilotpattern/touchbase#18:

Should the URL and name be user-configured options? My reason for asking is how discovery would work in an environment where there may be different for each application.

Proposal: remove name, url, poll, and ttl as user-configurable options, so that the telemetry stanza is just the following:

  "telemetry": {
    "port": 9090,
    "sensors": [
      {...}
    ]
  }

Just so that it's clear, the name, poll, and ttl fields are necessary to advertise telemetry as a service. But I think you're asking whether we should define these values and not allow the user to configure them. While appreciate a desire to simplify the configuration syntax, forcing a particular choice on the user is going to limit the usefulness of the feature. The major use case I'm thinking of here is having separate collectors for separate dev/prod/business-unit stacks (Prometheus won't listen for Consul "tags"). Polling requirements are likely to be application specific. Philosophically speaking, it's the responsibility of the application development team to determine the appropriate values for configuring the telemetry service for their application.

That being said, having sane defaults couldn't hurt. In the common case someone can just default to advertising a telemetry service on the URL /telemetry on port 9090 and having it heartbeat every 10 seconds for 30 sec TTL.


Along these same lines, we've mirrored the Prometheus golang API in terms of the sensor naming definitions. They give you the option to have namespace, subsystem, and name for each metric so we copied that dutifully:

    "sensors": [
       {
         "namespace": "tb",
         "subsystem": "nginx",
         "name": "connections_active",
         "help": "Current number of active client connections",
         "type": "gauge",
         "poll": 5,
         "check": ["/opt/containerbuddy/sensor.sh", "connections_active"]
       },

But the Prometheus API doesn't actually care about that, and this would be the same thing as leaving out the namespace and subsystem fields and just naming it tb_nginx_connections_active. The golang client's BuildFQName just concatenates the values with _ anyways. Presumably they include this to encourage good practices and to make it easier when you're calling the code as a library. But our end users aren't calling into the golang API as a library, they're just giving us strings.

It looks like it's entirely safe with our existing implementation here for an end user to leave off namespace and subsystem. No code changes required. Maybe we encourage that in the docs (and demonstration application) Leaving the option to have separate namespace and subsystem fields might make automatically generating Containerbuddy configurations in the CI pipeline easier. So I'd like to leave that in.

@tgross
Copy link
Contributor Author

tgross commented Apr 7, 2016

It looks like it's entirely safe with our existing implementation here for an end user to leave off namespace and subsystem. No code changes required.

I've verified that was the case in this new unit test: 168df0d

@misterbisson
Copy link
Contributor

Regarding configuration, we started that discussion in #27 (comment), where I proposed the following in the context of our typical service definition:

  "services": [
    {
      "name": "containerbuddy-telemetry",
      "port": <user defined>,
      "health": <not a user-configurable value>,
      "interfaces": <user defined>,
      "poll": <not a user-configurable value>,
      "ttl": <not a user-configurable value>,
      "tags": <maybe user defined>
    }
  ],

The reason I I feel strongly that name should not be user configurable (and why I didn't even suggest url) is that having this be identical across all applications is critical for anybody's ability to reasonably build monitoring tooling. @tgross outlines the argument against this:

The major use case I'm thinking of here is having separate collectors for separate dev/prod/business-unit stacks

However, I believe that argument is inconsistent with what we're arguing elsewhere, which is to isolate dev from prod, etc, lower level solutions and not sharing the discovery catalog between environments or using tags if shared discovery catalogs are required.

Prometheus won't listen for Consul "tags"

I would explicitly not design for that limitation. Prometheus' Consul discovery can be improved.

This eliminates any remaining parsing in config.go and moves it into
the telemetry package by passing json.RawMessage into a NewTelemetry
function.

Also fix the integration tests.
@tgross
Copy link
Contributor Author

tgross commented Apr 8, 2016

@misterbisson I've updated per your comments. Do me a favor and review the two READMEs. I'm going to fold these changes back into #123 as well.

@tgross tgross changed the title Initial implementation of telemetry [no merge] Initial implementation of telemetry Apr 8, 2016
@tgross
Copy link
Contributor Author

tgross commented Apr 8, 2016

Marking this [no merge]. Ref #123 (comment)

@misterbisson
Copy link
Contributor

🏡 🚶

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants