-
Notifications
You must be signed in to change notification settings - Fork 1.3k
monitoring: cadvisor observables review #17239
Conversation
Notifying subscribers in CODENOTIFY files for diff 020c4f2...1615ba6.
|
54c4573
to
b58302c
Compare
Codecov Report
@@ Coverage Diff @@
## main #17239 +/- ##
==========================================
- Coverage 51.98% 51.98% -0.01%
==========================================
Files 1707 1707
Lines 84937 84937
Branches 7519 7665 +146
==========================================
- Hits 44155 44153 -2
- Misses 36879 36880 +1
- Partials 3903 3904 +1
|
I did not get a chance to fully review it yet but have you checked https://awesome-prometheus-alerts.grep.to/rules#docker-containers?
|
@pecigonzalo demo - this seems to return some false positives :( neat website though! |
@bobheadxi how do you know its a false positive? In #bots-production I can see the OOM. |
Oh hm, I was assuming the version change would catch all restarts so was aligning the two, but I think the query on this site might be more accurate |
Oh, the event at the left seems to be a false positive - I don't see an OOM event in #bots-production for Might be something to that though 🤔 Or maybe Slack search is bad? |
Checking https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_pod%22%20resource.labels.location%3D%22us-central1-f%22%20resource.labels.cluster_name%3D%22cloud%22%20frontend;timeRange=2021-01-13T05:46:51.764Z%2F2021-01-13T12:46:51.764441Z?project=sourcegraph-dev I can match many of the spikes to liveness probe failures. I would also consider |
I think you're right - switched in 7570ecb
It would be without the gauge disappearing, but you're right regardless in that the presence of the gauge is not a reliable indicator Seems we'll be getting a lot of alerts :) |
27d7022
to
4b13e97
Compare
Yep. It might be good to add an There is some alert grouping we could do as well (in the future as its a bit more complex) to ensure that eg: if a pod is crashlooping, you dont get a healthcheck alert as well, or the other way around. On a side-note I really like the idea of alerts like the one in |
Fits the description of a
I wonder if we can get this with |
maybe https://github.com/sourcegraph/sourcegraph/issues/13975?
|
Like 13975, but not just for .com down or up. |
I wonder if adding https://github.com/kubernetes/kube-state-metrics would make sense to allow for determining when pods have a frequent number of restarts. |
Working container restart alerts were recently added for Kubernetes deployments as part of #17239 However, it seems that periods of container restarts during deployments are expected for certain services. This change removes the critical alert and warns on extended restarts instead.
Closes https://github.com/sourcegraph/sourcegraph/issues/17158
I did some flyby changes of sorts while checking. tl;dr: no
disk
category metrics from https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metricsContainerIOUsage
a shared observable, and the services that had relevant uses for the inodes monitoring now have this instead.sourcegraph-frontend-54c86475d6-76vfb
was reported OOMKilled, at 7:18 UTC the pod re-reports the same version (see right graph), and the new query reports a non-zero value at the same time (see left graph - note that it doesn't report non-zero for the initial pod startup at 5:27 UTC). I'm pretty optimistic this works, and it'll let us get rid of the somewhat gnarly recording rule used now (that doesn't work in k8s). The alert is just a warning anyway, so can adjust as needed. closes https://github.com/sourcegraph/sourcegraph/issues/9793