Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

monitoring: cadvisor observables review #17239

Merged
merged 16 commits into from
Jan 13, 2021
Merged

monitoring: cadvisor observables review #17239

merged 16 commits into from
Jan 13, 2021

Conversation

bobheadxi
Copy link
Member

@bobheadxi bobheadxi commented Jan 13, 2021

Closes https://github.com/sourcegraph/sourcegraph/issues/17158

I did some flyby changes of sorts while checking. tl;dr: no disk category metrics from https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics

  • Remove container fs inodes: disk metrics are not supported in OCI it seems (Disk usage metrics for containerd google/cadvisor#2785), and the metrics it reports in docker-compose feels rather dubious at times. Instead, I've made ContainerIOUsage a shared observable, and the services that had relevant uses for the inodes monitoring now have this instead.
  • Docs changes and improvements, including a new cAdvisor page in the docsite
  • Reworked container restart to something that I think works using cAdvisor metrics - demo. Left graph is new restart query, right graph is version "changes" - it seems that the same pod reporting a new version indicates a gap in the data indicating a restart. At 7:17 UTC pod sourcegraph-frontend-54c86475d6-76vfb was reported OOMKilled, at 7:18 UTC the pod re-reports the same version (see right graph), and the new query reports a non-zero value at the same time (see left graph - note that it doesn't report non-zero for the initial pod startup at 5:27 UTC). I'm pretty optimistic this works, and it'll let us get rid of the somewhat gnarly recording rule used now (that doesn't work in k8s). The alert is just a warning anyway, so can adjust as needed. closes https://github.com/sourcegraph/sourcegraph/issues/9793
  • Titles are now shared for consistency and ease of editing

@bobheadxi bobheadxi requested a review from a team January 13, 2021 11:33
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jan 13, 2021

Notifying subscribers in CODENOTIFY files for diff 020c4f2...1615ba6.

Notify File(s)
@christinaforney doc/admin/observability/alert_solutions.md
doc/admin/observability/dashboards.md
doc/dev/background-information/observability/cadvisor.md
doc/dev/background-information/observability/index.md
@slimsag monitoring/definitions/executor_queue.go
monitoring/definitions/frontend.go
monitoring/definitions/git_server.go
monitoring/definitions/github_proxy.go
monitoring/definitions/postgres.go
monitoring/definitions/precise_code_intel_indexer.go
monitoring/definitions/precise_code_intel_worker.go
monitoring/definitions/prometheus.go
monitoring/definitions/query_runner.go
monitoring/definitions/repo_updater.go
monitoring/definitions/searcher.go
monitoring/definitions/shared/container.go
monitoring/definitions/shared/go.go
monitoring/definitions/shared/kubernetes.go
monitoring/definitions/shared/provisioning.go
monitoring/definitions/shared/shared.go
monitoring/definitions/symbols.go
monitoring/definitions/syntect_server.go
monitoring/definitions/zoekt_index_server.go
monitoring/definitions/zoekt_web_server.go
monitoring/monitoring/README.md
monitoring/monitoring/monitoring.go
@sourcegraph/distribution monitoring/definitions/executor_queue.go
monitoring/definitions/frontend.go
monitoring/definitions/git_server.go
monitoring/definitions/github_proxy.go
monitoring/definitions/postgres.go
monitoring/definitions/precise_code_intel_indexer.go
monitoring/definitions/precise_code_intel_worker.go
monitoring/definitions/prometheus.go
monitoring/definitions/query_runner.go
monitoring/definitions/repo_updater.go
monitoring/definitions/searcher.go
monitoring/definitions/shared/container.go
monitoring/definitions/shared/go.go
monitoring/definitions/shared/kubernetes.go
monitoring/definitions/shared/provisioning.go
monitoring/definitions/shared/shared.go
monitoring/definitions/symbols.go
monitoring/definitions/syntect_server.go
monitoring/definitions/zoekt_index_server.go
monitoring/definitions/zoekt_web_server.go
monitoring/monitoring/README.md
monitoring/monitoring/monitoring.go

@bobheadxi bobheadxi force-pushed the container-monitoring branch from 54c4573 to b58302c Compare January 13, 2021 12:00
@codecov
Copy link

codecov bot commented Jan 13, 2021

Codecov Report

Merging #17239 (1615ba6) into main (d3d3646) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #17239      +/-   ##
==========================================
- Coverage   51.98%   51.98%   -0.01%     
==========================================
  Files        1707     1707              
  Lines       84937    84937              
  Branches     7519     7665     +146     
==========================================
- Hits        44155    44153       -2     
- Misses      36879    36880       +1     
- Partials     3903     3904       +1     
Flag Coverage Δ
go 51.03% <ø> (-0.01%) ⬇️
integration 30.53% <ø> (ø)
storybook 30.05% <ø> (ø)
typescript 54.29% <ø> (ø)
unit 34.80% <ø> (ø)
Impacted Files Coverage Δ
.../internal/codeintel/resolvers/graphql/locations.go 83.50% <0.00%> (-2.07%) ⬇️

@pecigonzalo
Copy link
Contributor

pecigonzalo commented Jan 13, 2021

I did not get a chance to fully review it yet but have you checked https://awesome-prometheus-alerts.grep.to/rules#docker-containers?

time() - container_last_seen > 60

@bobheadxi
Copy link
Member Author

I did not get a chance to fully review it yet but have you checked awesome-prometheus-alerts.grep.to/rules#docker-containers?

time() - container_last_seen > 60

@pecigonzalo demo - this seems to return some false positives :( neat website though!

@pecigonzalo
Copy link
Contributor

@bobheadxi how do you know its a false positive? In #bots-production I can see the OOM.

@bobheadxi
Copy link
Member Author

Oh hm, I was assuming the version change would catch all restarts so was aligning the two, but I think the query on this site might be more accurate

@bobheadxi
Copy link
Member Author

bobheadxi commented Jan 13, 2021

Oh, the event at the left seems to be a false positive - I don't see an OOM event in #bots-production for sourcegraph-frontend-54c86475d6-f6g5s at 5:30 UTC, and I believe it's an initial deploy, but the above query indicates the container is missing:

image

Might be something to that though 🤔 Or maybe Slack search is bad?

@pecigonzalo
Copy link
Contributor

Checking https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_pod%22%20resource.labels.location%3D%22us-central1-f%22%20resource.labels.cluster_name%3D%22cloud%22%20frontend;timeRange=2021-01-13T05:46:51.764Z%2F2021-01-13T12:46:51.764441Z?project=sourcegraph-dev I can match many of the spikes to liveness probe failures.

I would also consider up or container_last_seen metric to be more accurate than diff versioning as you can have downtime without version changes.

@bobheadxi
Copy link
Member Author

bobheadxi commented Jan 13, 2021

I think you're right - switched in 7570ecb

I would also consider up or container_last_seen metric to be more accurate than diff versioning as you can have downtime without version changes.

It would be without the gauge disappearing, but you're right regardless in that the presence of the gauge is not a reliable indicator

Seems we'll be getting a lot of alerts :)

@bobheadxi bobheadxi force-pushed the container-monitoring branch from 27d7022 to 4b13e97 Compare January 13, 2021 13:03
@pecigonzalo
Copy link
Contributor

pecigonzalo commented Jan 13, 2021

Seems we'll be getting a lot of alerts :)

Yep. It might be good to add an unless deploying sort of check there. I cant recall how I did this in the past, but Ill check if I can find any references. There is something to be said here for not alerting on missing pods unless its happening over a period of time. As long as the deployment is healthy, if a pod dies every now and then, it might not be relevant. Something like kube_pod_container_status_restarts_total for crashlooping. eg. https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet and 5.1.17. in the previous link.

There is some alert grouping we could do as well (in the future as its a bit more complex) to ensure that eg: if a pod is crashlooping, you dont get a healthcheck alert as well, or the other way around.

On a side-note I really like the idea of alerts like the one in KubernetesStatefulsetUpdateNotRolledOut on that linked website.

@bobheadxi
Copy link
Member Author

bobheadxi commented Jan 13, 2021

if a pod dies every now and then, it might not be relevant.

Fits the description of a warning then!

Something like kube_pod_container_status_restarts_total for crashlooping

I wonder if we can get this with Critical: monitoring.Alert().GreaterOrEqual(1).For(10 * time.Minute),?

@bobheadxi
Copy link
Member Author

bobheadxi commented Jan 13, 2021

There is some alert grouping we could do as well (in the future as its a bit more complex) to ensure that eg: if a pod is crashlooping, you dont get a healthcheck alert as well, or the other way around.

maybe https://github.com/sourcegraph/sourcegraph/issues/13975?

On a side-note I really like the idea of alerts like the one in KubernetesStatefulsetUpdateNotRolledOut on that linked website.

https://github.com/sourcegraph/sourcegraph/issues/16309

@pecigonzalo
Copy link
Contributor

Like 13975, but not just for .com down or up.

@daxmc99
Copy link
Contributor

daxmc99 commented Jan 13, 2021

I wonder if adding https://github.com/kubernetes/kube-state-metrics would make sense to allow for determining when pods have a frequent number of restarts.

bobheadxi added a commit that referenced this pull request Jan 15, 2021
Working container restart alerts were recently added for Kubernetes deployments as part of #17239

However, it seems that periods of container restarts during deployments are expected for certain services. This change removes the critical alert and warns on extended restarts instead.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants