Add etcd metrics, Prometheus scrapes, and Grafana dash #175

dghubble · 2018-03-29T05:05:47Z

Use etcd v3.3 --listen-metrics-urls to expose only metrics data via http://0.0.0.0:2381 on controllers
Add Prometheus discovery for etcd peers on controller nodes
Enables etcd related alerts and populates the etcd Grafana dashboard
Note: These benefits require the optional Prometheus and Grafana addons be applied.

Hold off on allowing workers firewall access (can't think of any concrete concern with workloads seeing this). Move Prometheus to a controller node for a while (maybe drop). Adjust firewall rules now that Prometheus can run on a controller, rather than a worker

Made possible by:

Closes #114

dghubble · 2018-03-30T03:58:36Z

The HighNumberOfFailedGRPCRequests Prometheus rule triggers noisy alerts for etcdserverpb.Watch with grpc_code Unavailable. The rule gets triggered across platforms, even with very fast SSDs and I can manually use etcdctl to watch keys without issue. That rule probably hasn't been exercised as much since few distros monitor etcd yet.

#176 was to address an alert that fired specifically on AWS because of slow disks.

brancz · 2018-04-03T08:13:38Z

We've seen that a couple times before, and I believe that it's because of etcd-io/etcd#9166 for which doesn't seem like the "fix" landed in 3.3. I believe even then it will require a change to the alerting rules to ignore cancelled connections.

dghubble · 2018-04-04T03:27:17Z

For now I think I'll drop the two noisy alerts. Even without them, this change adds etcd alerts which weren't active before and populates the one page in Grafana that was empty before.

Its a good feedback loop too! - using the new etcd alerts motivated switching to faster disks on AWS for the v1.10 release (other platforms were fast enough).

* Use etcd v3.3 --listen-metrics-urls to expose only metrics data via http://0.0.0.0:2381 on controllers * Add Prometheus discovery for etcd peers on controller nodes * Temporarily drop two noisy Prometheus alerts

* Expose etcd metrics to workers so Prometheus can run on a worker, rather than a controller * Drop temporary firewall rules allowing Prometheus to run on a controller and scrape targes * Related to #175

* Expose etcd metrics to workers so Prometheus can run on a worker, rather than a controller * Drop temporary firewall rules allowing Prometheus to run on a controller and scrape targes * Related to poseidon/typhoon#175

dghubble force-pushed the etcd-monitoring branch from 6d7bd60 to 79493cd Compare March 29, 2018 06:06

dghubble mentioned this pull request Mar 29, 2018

Add disk_type variable for EBS volume type on AWS #176

Merged

dghubble force-pushed the etcd-monitoring branch 2 times, most recently from 5adc367 to 7eb1f0a Compare March 31, 2018 06:14

Add etcd metrics, Prometheus scrapes, and Grafana dash

d770393

* Use etcd v3.3 --listen-metrics-urls to expose only metrics data via http://0.0.0.0:2381 on controllers * Add Prometheus discovery for etcd peers on controller nodes * Temporarily drop two noisy Prometheus alerts

dghubble force-pushed the etcd-monitoring branch from 7eb1f0a to d770393 Compare April 4, 2018 03:32

dghubble merged commit d770393 into master Apr 4, 2018

dghubble deleted the etcd-monitoring branch April 5, 2018 04:09

dghubble mentioned this pull request Apr 8, 2018

Return Prometheus deployment to be a worker workload #179

Merged

bendrucker mentioned this pull request Apr 10, 2018

Merge with upstream Typhoon TakeScoop/typhoon#3

Merged

zot24 mentioned this pull request Mar 19, 2019

EtcdHighNumberOfFailedGRPCRequests openshift/cluster-monitoring-operator#248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add etcd metrics, Prometheus scrapes, and Grafana dash #175

Add etcd metrics, Prometheus scrapes, and Grafana dash #175

dghubble commented Mar 29, 2018 •

edited

Loading

dghubble commented Mar 30, 2018 •

edited

Loading

brancz commented Apr 3, 2018 •

edited

Loading

dghubble commented Apr 4, 2018

Add etcd metrics, Prometheus scrapes, and Grafana dash #175

Add etcd metrics, Prometheus scrapes, and Grafana dash #175

Conversation

dghubble commented Mar 29, 2018 • edited Loading

dghubble commented Mar 30, 2018 • edited Loading

brancz commented Apr 3, 2018 • edited Loading

dghubble commented Apr 4, 2018

dghubble commented Mar 29, 2018 •

edited

Loading

dghubble commented Mar 30, 2018 •

edited

Loading

brancz commented Apr 3, 2018 •

edited

Loading