monitoring: cadvisor observables review #17239

bobheadxi · 2021-01-13T11:33:57Z

Closes https://github.com/sourcegraph/sourcegraph/issues/17158

I did some flyby changes of sorts while checking. tl;dr: no disk category metrics from https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics

Remove container fs inodes: disk metrics are not supported in OCI it seems (Disk usage metrics for containerd google/cadvisor#2785), and the metrics it reports in docker-compose feels rather dubious at times. Instead, I've made ContainerIOUsage a shared observable, and the services that had relevant uses for the inodes monitoring now have this instead.
Docs changes and improvements, including a new cAdvisor page in the docsite
Reworked container restart to something that I think works using cAdvisor metrics - demo. Left graph is new restart query, right graph is version "changes" - it seems that the same pod reporting a new version indicates a gap in the data indicating a restart. At 7:17 UTC pod sourcegraph-frontend-54c86475d6-76vfb was reported OOMKilled, at 7:18 UTC the pod re-reports the same version (see right graph), and the new query reports a non-zero value at the same time (see left graph - note that it doesn't report non-zero for the initial pod startup at 5:27 UTC). I'm pretty optimistic this works, and it'll let us get rid of the somewhat gnarly recording rule used now (that doesn't work in k8s). The alert is just a warning anyway, so can adjust as needed. closes https://github.com/sourcegraph/sourcegraph/issues/9793
Titles are now shared for consistency and ease of editing

sourcegraph-bot · 2021-01-13T11:36:38Z

Notifying subscribers in CODENOTIFY files for diff 020c4f2...1615ba6.

Notify	File(s)
@christinaforney	doc/admin/observability/alert_solutions.md doc/admin/observability/dashboards.md doc/dev/background-information/observability/cadvisor.md doc/dev/background-information/observability/index.md
@slimsag	monitoring/definitions/executor_queue.go monitoring/definitions/frontend.go monitoring/definitions/git_server.go monitoring/definitions/github_proxy.go monitoring/definitions/postgres.go monitoring/definitions/precise_code_intel_indexer.go monitoring/definitions/precise_code_intel_worker.go monitoring/definitions/prometheus.go monitoring/definitions/query_runner.go monitoring/definitions/repo_updater.go monitoring/definitions/searcher.go monitoring/definitions/shared/container.go monitoring/definitions/shared/go.go monitoring/definitions/shared/kubernetes.go monitoring/definitions/shared/provisioning.go monitoring/definitions/shared/shared.go monitoring/definitions/symbols.go monitoring/definitions/syntect_server.go monitoring/definitions/zoekt_index_server.go monitoring/definitions/zoekt_web_server.go monitoring/monitoring/README.md monitoring/monitoring/monitoring.go
@sourcegraph/distribution	monitoring/definitions/executor_queue.go monitoring/definitions/frontend.go monitoring/definitions/git_server.go monitoring/definitions/github_proxy.go monitoring/definitions/postgres.go monitoring/definitions/precise_code_intel_indexer.go monitoring/definitions/precise_code_intel_worker.go monitoring/definitions/prometheus.go monitoring/definitions/query_runner.go monitoring/definitions/repo_updater.go monitoring/definitions/searcher.go monitoring/definitions/shared/container.go monitoring/definitions/shared/go.go monitoring/definitions/shared/kubernetes.go monitoring/definitions/shared/provisioning.go monitoring/definitions/shared/shared.go monitoring/definitions/symbols.go monitoring/definitions/syntect_server.go monitoring/definitions/zoekt_index_server.go monitoring/definitions/zoekt_web_server.go monitoring/monitoring/README.md monitoring/monitoring/monitoring.go

codecov · 2021-01-13T12:31:42Z

Codecov Report

Merging #17239 (1615ba6) into main (d3d3646) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #17239      +/-   ##
==========================================
- Coverage   51.98%   51.98%   -0.01%     
==========================================
  Files        1707     1707              
  Lines       84937    84937              
  Branches     7519     7665     +146     
==========================================
- Hits        44155    44153       -2     
- Misses      36879    36880       +1     
- Partials     3903     3904       +1

Flag	Coverage Δ
go	`51.03% <ø> (-0.01%)`	⬇️
integration	`30.53% <ø> (ø)`
storybook	`30.05% <ø> (ø)`
typescript	`54.29% <ø> (ø)`
unit	`34.80% <ø> (ø)`

Impacted Files	Coverage Δ
.../internal/codeintel/resolvers/graphql/locations.go	`83.50% <0.00%> (-2.07%)`	⬇️

pecigonzalo · 2021-01-13T12:31:57Z

I did not get a chance to fully review it yet but have you checked https://awesome-prometheus-alerts.grep.to/rules#docker-containers?

time() - container_last_seen > 60

bobheadxi · 2021-01-13T12:36:18Z

I did not get a chance to fully review it yet but have you checked awesome-prometheus-alerts.grep.to/rules#docker-containers?

time() - container_last_seen > 60

@pecigonzalo demo - this seems to return some false positives :( neat website though!

pecigonzalo · 2021-01-13T12:37:47Z

@bobheadxi how do you know its a false positive? In #bots-production I can see the OOM.

bobheadxi · 2021-01-13T12:40:45Z

Oh hm, I was assuming the version change would catch all restarts so was aligning the two, but I think the query on this site might be more accurate

bobheadxi · 2021-01-13T12:43:44Z

Oh, the event at the left seems to be a false positive - I don't see an OOM event in #bots-production for sourcegraph-frontend-54c86475d6-f6g5s at 5:30 UTC, and I believe it's an initial deploy, but the above query indicates the container is missing:

Might be something to that though 🤔 Or maybe Slack search is bad?

pecigonzalo · 2021-01-13T12:50:15Z

Checking https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_pod%22%20resource.labels.location%3D%22us-central1-f%22%20resource.labels.cluster_name%3D%22cloud%22%20frontend;timeRange=2021-01-13T05:46:51.764Z%2F2021-01-13T12:46:51.764441Z?project=sourcegraph-dev I can match many of the spikes to liveness probe failures.

I would also consider up or container_last_seen metric to be more accurate than diff versioning as you can have downtime without version changes.

bobheadxi · 2021-01-13T12:53:27Z

I think you're right - switched in 7570ecb

I would also consider up or container_last_seen metric to be more accurate than diff versioning as you can have downtime without version changes.

It would be without the gauge disappearing, but you're right regardless in that the presence of the gauge is not a reliable indicator

Seems we'll be getting a lot of alerts :)

pecigonzalo · 2021-01-13T13:18:47Z

Seems we'll be getting a lot of alerts :)

Yep. It might be good to add an unless deploying sort of check there. I cant recall how I did this in the past, but Ill check if I can find any references. There is something to be said here for not alerting on missing pods unless its happening over a period of time. As long as the deployment is healthy, if a pod dies every now and then, it might not be relevant. Something like kube_pod_container_status_restarts_total for crashlooping. eg. https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet and 5.1.17. in the previous link.

There is some alert grouping we could do as well (in the future as its a bit more complex) to ensure that eg: if a pod is crashlooping, you dont get a healthcheck alert as well, or the other way around.

On a side-note I really like the idea of alerts like the one in KubernetesStatefulsetUpdateNotRolledOut on that linked website.

bobheadxi · 2021-01-13T14:05:11Z

if a pod dies every now and then, it might not be relevant.

Fits the description of a warning then!

Something like kube_pod_container_status_restarts_total for crashlooping

I wonder if we can get this with Critical: monitoring.Alert().GreaterOrEqual(1).For(10 * time.Minute),?

bobheadxi · 2021-01-13T14:11:43Z

There is some alert grouping we could do as well (in the future as its a bit more complex) to ensure that eg: if a pod is crashlooping, you dont get a healthcheck alert as well, or the other way around.

maybe https://github.com/sourcegraph/sourcegraph/issues/13975?

On a side-note I really like the idea of alerts like the one in KubernetesStatefulsetUpdateNotRolledOut on that linked website.

https://github.com/sourcegraph/sourcegraph/issues/16309

pecigonzalo · 2021-01-13T14:26:25Z

Like 13975, but not just for .com down or up.

daxmc99 · 2021-01-13T18:20:30Z

I wonder if adding https://github.com/kubernetes/kube-state-metrics would make sense to allow for determining when pods have a frequent number of restarts.

Working container restart alerts were recently added for Kubernetes deployments as part of #17239 However, it seems that periods of container restarts during deployments are expected for certain services. This change removes the critical alert and warns on extended restarts instead.

bobheadxi added 10 commits January 13, 2021 17:14

remove alert from inodes used

68381da

add goroutine leak interpretation

ece4ba0

add better docs for observable types

ae0f86d

rework restarts panel to use cadvisor container_last_seen

018156a

generate

69378b6

fix docs

8f956f8

condense wording of kubernetes monitoring

8cea2b6

remove fs inodes

bafdbee

make ContainerIOUsage shared

18fdef8

make titles shared

b1a8127

bobheadxi requested a review from a team January 13, 2021 11:33

bobheadxi added 2 commits January 13, 2021 19:40

scope variable

1dca2f0

add cadvisor documentation

b58302c

bobheadxi force-pushed the container-monitoring branch from 54c4573 to b58302c Compare January 13, 2021 12:00

add more details

9d1cc26

use simpler container restart query

7570ecb

count restarts

4b13e97

bobheadxi force-pushed the container-monitoring branch from 27d7022 to 4b13e97 Compare January 13, 2021 13:03

add critical alert if container restarts is persistent

1615ba6

pecigonzalo approved these changes Jan 13, 2021

View reviewed changes

bobheadxi merged commit 7a84129 into main Jan 13, 2021

bobheadxi deleted the container-monitoring branch January 13, 2021 14:56

sourcegraph-bot mentioned this pull request Jan 13, 2021

Update Sourcegraph Docker insiders images sourcegraph/deploy-sourcegraph#1630

Merged

1 task

bobheadxi mentioned this pull request Jan 15, 2021

monitoring: restart alerts for services w/o 0-downtime deploys #17299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitoring: cadvisor observables review #17239

monitoring: cadvisor observables review #17239

bobheadxi commented Jan 13, 2021 •

edited

Loading

sourcegraph-bot commented Jan 13, 2021 •

edited

Loading

codecov bot commented Jan 13, 2021 •

edited

Loading

pecigonzalo commented Jan 13, 2021 •

edited

Loading

bobheadxi commented Jan 13, 2021

pecigonzalo commented Jan 13, 2021

bobheadxi commented Jan 13, 2021

bobheadxi commented Jan 13, 2021 •

edited

Loading

pecigonzalo commented Jan 13, 2021

bobheadxi commented Jan 13, 2021 •

edited

Loading

pecigonzalo commented Jan 13, 2021 •

edited

Loading

bobheadxi commented Jan 13, 2021 •

edited

Loading

bobheadxi commented Jan 13, 2021 •

edited

Loading

pecigonzalo commented Jan 13, 2021

daxmc99 commented Jan 13, 2021

monitoring: cadvisor observables review #17239

monitoring: cadvisor observables review #17239

Conversation

bobheadxi commented Jan 13, 2021 • edited Loading

sourcegraph-bot commented Jan 13, 2021 • edited Loading

codecov bot commented Jan 13, 2021 • edited Loading

Codecov Report

pecigonzalo commented Jan 13, 2021 • edited Loading

bobheadxi commented Jan 13, 2021

pecigonzalo commented Jan 13, 2021

bobheadxi commented Jan 13, 2021

bobheadxi commented Jan 13, 2021 • edited Loading

pecigonzalo commented Jan 13, 2021

bobheadxi commented Jan 13, 2021 • edited Loading

pecigonzalo commented Jan 13, 2021 • edited Loading

bobheadxi commented Jan 13, 2021 • edited Loading

bobheadxi commented Jan 13, 2021 • edited Loading

pecigonzalo commented Jan 13, 2021

daxmc99 commented Jan 13, 2021

bobheadxi commented Jan 13, 2021 •

edited

Loading

sourcegraph-bot commented Jan 13, 2021 •

edited

Loading

codecov bot commented Jan 13, 2021 •

edited

Loading

pecigonzalo commented Jan 13, 2021 •

edited

Loading

bobheadxi commented Jan 13, 2021 •

edited

Loading

bobheadxi commented Jan 13, 2021 •

edited

Loading

pecigonzalo commented Jan 13, 2021 •

edited

Loading

bobheadxi commented Jan 13, 2021 •

edited

Loading

bobheadxi commented Jan 13, 2021 •

edited

Loading