Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Remove redundant alerts #5052

Merged
merged 4 commits into from
Nov 10, 2020
Merged

Remove redundant alerts #5052

merged 4 commits into from
Nov 10, 2020

Conversation

suiguoxin
Copy link
Member

  • PaiServicePodNotRunning / PaiServicePodNotRunning inhibits NodeNotReady ;
  • PaiServiceNotUp detects the status of job-exporter, node-exporter & watchdog. As services job-exporter, node-exporter are already covered by PaiServicePodNotRunning / PaiServicePodNotRunning, only watchdog need to be monitored in this alert.

@suiguoxin suiguoxin requested a review from Binyang2014 November 4, 2020 05:03
@coveralls
Copy link

coveralls commented Nov 4, 2020

Coverage Status

Coverage remained the same at 34.223% when pulling 0315bea on suiguoxin:alert-inhibit into 6cb7f8d on microsoft:master.

src/prometheus/deploy/alerting/pai-services.rules Outdated Show resolved Hide resolved
@@ -23,7 +23,7 @@ Solutions:
This is a kind of alert from alert manager, and is reported by watchdog service. Watchdog gets such metrics from Kubernetes API. Example metrics is like:

```
pai_node_count{disk_pressure="false",instance="10.0.0.1:9101",job="pai_serivce_exporter",memory_pressure="false",name="10.0.0.2",out_of_disk="false",pai_service_name="watchdog",ready="true",scraped_from="watchdog-5ddd945975-kwhpr"}
pai_node_count{disk_pressure="false",instance="10.0.0.1:9101",job="pai_serivce_exporter",memory_pressure="false",host_ip="10.0.0.2",out_of_disk="false",pai_service_name="watchdog",ready="true",scraped_from="watchdog-5ddd945975-kwhpr"}
Copy link
Contributor

@Binyang2014 Binyang2014 Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change to node_name?. Which is more easy to understand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change to node_name?. Which is more easy to understand

Added node_name as another label. Used node_name instead in inhibit rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add node_name in this doc?

@suiguoxin suiguoxin force-pushed the alert-inhibit branch 4 times, most recently from b532fd5 to b177fdc Compare November 9, 2020 09:55
@suiguoxin suiguoxin merged commit 8d4ab70 into microsoft:master Nov 10, 2020
@suiguoxin suiguoxin deleted the alert-inhibit branch November 10, 2020 01:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants