-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[no merge] Initial implementation of telemetry #119
Conversation
dff75bd
to
f9f4ccb
Compare
Awesome (er, "awesome")... our zombie reaping code is breaking in the integration test. But not on my machine (or in https://travis-ci.org/tgross/containerbuddy/builds/119049690 with the identical branch) even after |
Seems like a race. could be that the threshold needs to be increased or the test needs to be tightened up some more. The assumption is that the zombies will be reaped in a timely manner - which may not be the case on travis (and may not be deterministic). Actually - It looks like we should check just # of zombies owned by PID1 - since that's the reaping process. It can't reap processes that it doesn't own:
The additional zombie hasn't been re-parented, so of course it won't get reaped. |
Good catch, and I think your suggested fix for checking zombies owned only by PID1 makes sense. I'll open a new PR for that and rebase this PR on that so we can review it cleanly. My next question is how the zombie wasn't reparented, but that's probably out-of-scope for us (this is a kernel mechanism as far as I know). |
Making PollAction a method of Pollable will allow us to avoid a circular import when we split out packages from the main containerbuddy package.
Packages that include Pollables are going to want to validation their own configuration, so by moving the parsing and IPs functions into a utils package we can do so without any circular references.
Created a metrics package that leans on the official Prometheus client to serve metrics. Adds a new Pollable for Sensors that record metrics received from user- defined functions.
I think it hasn't been re-parented because the parent already exists - still sleeping
|
MustRegister panics rather than returning an error (thanks!) if we register an invalid collector, even just an invalid name. This would give end-users a big ugly golang stack trace that they have to dig the error message out of. So we'll catch the panic so we can return an error.
Also moved the similar reset we do for Service.CheckHealth and Backend.OnChange into defers so that we can ensure they happen even if the function errors-out. Includes tests and new testdata script for metrics package.
Unit tests of recording metrics for each of the four collector types and fetching them via the web API. Cleaned up some of the existing tests so that metrics names are more directly tied-back to test cases. Fixed a bug in `Sensor.record()` where the type switch doesn't work because the prometheus collector structs are actually interfaces. So we end up having to do some ugly string checking.
I've added a whole mess of unit tests, and fixed two minor bugs along the way with making sure metrics check commands reset and not noticing some odd prometheus API choices with their collectors. Next big step is to add a README. |
@misterbisson I've added README updates in this commit 2783e5e |
Added documentation for metrics feature. The README is getting quite long, so while anticipating our package breakout I split the metrics README into the metrics directory and linked to it from the front page. I expect we'll do something along these lines for other sections in the future.
@tgross Haven't looked yet. But to the README point - it seems like that is getting quite big and will continue to grow. Is there call for generating some kind of webpage and splitting out the concepts into a more organized format? Would be good to include such a 'website' in the same repo and generate the page somehow. (isn't that what github.io does? I haven't tried it) |
Opened #124 to discuss the documentation issue. |
Had a sidebar conversation with @misterbisson and we're going to rename this feature to "telemetry." Will have a commit to update this PR today. |
We're going to ship this branch as a release candidate for 1.4.0, so that we can use the implementation in an autopilot pattern we're planning on publishing this week. I'd like to get this and our refactor merged in prior to the actual release. |
@tgross I only was able to spare a couple minutes today, but I did a look over of the code and It all looked good. If you're waiting for my feedback, I'd say it's a good RC just to get something out there and find the actual bugs in the implementation - if any. In that respect, do you have an example |
What I'd like to do then is wait until I've got at least one known-working example (see below) and then merge this to master.
That's what I'm working on now, so I should definitely have something for you tomorrow. I've got https://github.com/autopilotpattern/prometheus as a collector and I'm adding the use case to https://github.com/autopilotpattern/touchbase now. |
Sensors were polling but the telemetry server was blocking its thread because we didn't put it into its own goroutine. Also needed to take a pointer for PollAction on Sensor so that we can update the exec/Cmd struct after execution. Added an integration test exercising these.
I've pushed up an integration test that exercises a couple of bugs I found. (ref https://github.com/joyent/containerbuddy/releases/tag/1.4.0-rc2) |
Comments from @misterbisson in autopilotpattern/touchbase#18:
Just so that it's clear, the That being said, having sane defaults couldn't hurt. In the common case someone can just default to advertising a Along these same lines, we've mirrored the Prometheus golang API in terms of the sensor naming definitions. They give you the option to have "sensors": [
{
"namespace": "tb",
"subsystem": "nginx",
"name": "connections_active",
"help": "Current number of active client connections",
"type": "gauge",
"poll": 5,
"check": ["/opt/containerbuddy/sensor.sh", "connections_active"]
}, But the Prometheus API doesn't actually care about that, and this would be the same thing as leaving out the It looks like it's entirely safe with our existing implementation here for an end user to leave off |
I've verified that was the case in this new unit test: 168df0d |
Regarding configuration, we started that discussion in #27 (comment), where I proposed the following in the context of our typical service definition: "services": [
{
"name": "containerbuddy-telemetry",
"port": <user defined>,
"health": <not a user-configurable value>,
"interfaces": <user defined>,
"poll": <not a user-configurable value>,
"ttl": <not a user-configurable value>,
"tags": <maybe user defined>
}
], The reason I I feel strongly that
However, I believe that argument is inconsistent with what we're arguing elsewhere, which is to isolate dev from prod, etc, lower level solutions and not sharing the discovery catalog between environments or using tags if shared discovery catalogs are required.
I would explicitly not design for that limitation. Prometheus' Consul discovery can be improved. |
This eliminates any remaining parsing in config.go and moves it into the telemetry package by passing json.RawMessage into a NewTelemetry function. Also fix the integration tests.
@misterbisson I've updated per your comments. Do me a favor and review the two READMEs. I'm going to fold these changes back into #123 as well. |
Marking this |
🏡 🚶 |
This is my initial pass at #27.
Currently integration tests will pass but the metrics feature has no tests itself so those tests just verify that the whole thing builds and won't break if we're not including metrics.(update: has tests now!) So this isn't ready to merge but it gives an idea of the approach.Next items TODO:
maybe some integration tests?going to work up a prometheus blueprint first so I'm ok with leaving that off from this PR.cc @justenwalker and @misterbisson
Note that this PR includes all of #118 currently, so when that's merged this PR should be cleaner.