Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etcd metrics, Prometheus scrapes, and Grafana dash #175

Merged
merged 1 commit into from
Apr 4, 2018

Conversation

dghubble
Copy link
Member

@dghubble dghubble commented Mar 29, 2018

  • Use etcd v3.3 --listen-metrics-urls to expose only metrics data via http://0.0.0.0:2381 on controllers
  • Add Prometheus discovery for etcd peers on controller nodes
  • Enables etcd related alerts and populates the etcd Grafana dashboard
  • Note: These benefits require the optional Prometheus and Grafana addons be applied.

Hold off on allowing workers firewall access (can't think of any concrete concern with workloads seeing this). Move Prometheus to a controller node for a while (maybe drop). Adjust firewall rules now that Prometheus can run on a controller, rather than a worker

Made possible by:

screenshot from 2018-03-28 22-30-52

Closes #114

@dghubble
Copy link
Member Author

dghubble commented Mar 30, 2018

The HighNumberOfFailedGRPCRequests Prometheus rule triggers noisy alerts for etcdserverpb.Watch with grpc_code Unavailable. The rule gets triggered across platforms, even with very fast SSDs and I can manually use etcdctl to watch keys without issue. That rule probably hasn't been exercised as much since few distros monitor etcd yet.

#176 was to address an alert that fired specifically on AWS because of slow disks.

@dghubble dghubble force-pushed the etcd-monitoring branch 2 times, most recently from 5adc367 to 7eb1f0a Compare March 31, 2018 06:14
@brancz
Copy link

brancz commented Apr 3, 2018

We've seen that a couple times before, and I believe that it's because of etcd-io/etcd#9166 for which doesn't seem like the "fix" landed in 3.3. I believe even then it will require a change to the alerting rules to ignore cancelled connections.

@dghubble
Copy link
Member Author

dghubble commented Apr 4, 2018

For now I think I'll drop the two noisy alerts. Even without them, this change adds etcd alerts which weren't active before and populates the one page in Grafana that was empty before.

Its a good feedback loop too! - using the new etcd alerts motivated switching to faster disks on AWS for the v1.10 release (other platforms were fast enough).

* Use etcd v3.3 --listen-metrics-urls to expose only metrics
data via http://0.0.0.0:2381 on controllers
* Add Prometheus discovery for etcd peers on controller nodes
* Temporarily drop two noisy Prometheus alerts
@dghubble dghubble merged commit d770393 into master Apr 4, 2018
@dghubble dghubble deleted the etcd-monitoring branch April 5, 2018 04:09
dghubble added a commit that referenced this pull request Apr 8, 2018
* Expose etcd metrics to workers so Prometheus can
run on a worker, rather than a controller
* Drop temporary firewall rules allowing Prometheus
to run on a controller and scrape targes
* Related to #175
dghubble added a commit that referenced this pull request Apr 8, 2018
* Expose etcd metrics to workers so Prometheus can
run on a worker, rather than a controller
* Drop temporary firewall rules allowing Prometheus
to run on a controller and scrape targes
* Related to #175
dghubble-robot pushed a commit to poseidon/terraform-aws-kubernetes that referenced this pull request Apr 8, 2018
* Expose etcd metrics to workers so Prometheus can
run on a worker, rather than a controller
* Drop temporary firewall rules allowing Prometheus
to run on a controller and scrape targes
* Related to poseidon/typhoon#175
dghubble-robot pushed a commit to poseidon/terraform-google-kubernetes that referenced this pull request Apr 8, 2018
* Expose etcd metrics to workers so Prometheus can
run on a worker, rather than a controller
* Drop temporary firewall rules allowing Prometheus
to run on a controller and scrape targes
* Related to poseidon/typhoon#175
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants