Skip to content

Commit

Permalink
Merge branch 'main' into NR-89144-Apache-Hadoop
Browse files Browse the repository at this point in the history
  • Loading branch information
pkudikyala authored Jun 30, 2023
2 parents bc18f18 + 7ec2621 commit bade645
Show file tree
Hide file tree
Showing 249 changed files with 25,399 additions and 6,377 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Active Directory Replication Failures
description: |+
This alert is triggered when the Attempt timestamp != the Success timestamp, indicating a failure in replication between domain contollers.
type: STATIC
nrql:
query: "FROM activeDirectoryReplicationPartners SELECT count(*) FACET server, partner WHERE lastReplicationSuccess != lastReplicationAttempt"

valueFunction: SINGLE_VALUE
terms:
- priority: CRITICAL
operator: ABOVE
threshold: 0
thresholdDuration: 120
thresholdOccurrences: ALL

expiration:
closeViolationsOnExpiration: false
openViolationOnExpiration: false
expirationDuration: null

signal:
aggregationDelay: 120
aggregationMethod: EVENT_FLOW
aggregationTimer: null
aggregationWindow: 60
fillOption: NONE
fillValue: null
slideBy: null

violationTimeLimitSeconds: 86400
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Active Directory Windows Services
description: |+
This alert is triggered when any of the targeted Windows Services are in a state other than "running".
The scope of this alert is Windows Services using the 'label.primary_app = active_directory' decoration.
type: STATIC
nrql:
query: "FROM Metric SELECT count(*) FACET hostname, entity.name WHERE metricName = 'windows_service_state' AND state != 'running' AND label.primary_app = 'active_directory'"

valueFunction: SINGLE_VALUE
terms:
- priority: CRITICAL
operator: ABOVE
threshold: 0
thresholdDuration: 300
thresholdOccurrences: ALL

expiration:
closeViolationsOnExpiration: false
openViolationOnExpiration: false
expirationDuration: null

signal:
aggregationDelay: 120
aggregationMethod: EVENT_FLOW
aggregationTimer: null
aggregationWindow: 60
fillOption: NONE
fillValue: null
slideBy: null

violationTimeLimitSeconds: 86400
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High MagneticStoreRejectedUploadSystemFailures

description: |+
This alert is triggered when the MagneticStoreRejectedUploadSystemFailures is above 100 in 10 minutes.
type: STATIC
nrql:
query: "SELECT count(`aws.timestream.MagneticStoreRejectedUploadSystemFailures`) as 'Query' FROM Metric"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 100
# Time in seconds; 120 - 3600
thresholdDuration: 600
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High MagneticStoreRejectedUploadUserFailures

description: |+
This alert is triggered when the MagneticStoreRejectedUploadUserFailures is above 100 in 10 minutes.
type: STATIC
nrql:
query: "SELECT count(`aws.timestream.MagneticStoreRejectedUploadUserFailures`) as 'Query' FROM Metric"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 100
# Time in seconds; 120 - 3600
thresholdDuration: 600
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
27 changes: 27 additions & 0 deletions alert-policies/amazon-timestream/SystemErrors.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High SystemErrors

description: |+
This alert is triggered when the system errors is above 100 in 10 minutes.
type: STATIC
nrql:
query: "SELECT count(`aws.timestream.SystemErrors`) as 'Query' FROM Metric"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 100
# Time in seconds; 120 - 3600
thresholdDuration: 600
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
27 changes: 27 additions & 0 deletions alert-policies/amazon-timestream/UserErrors.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High UserErrors

description: |+
This alert is triggered when the user errors is above 100 in 10 minutes.
type: STATIC
nrql:
query: "SELECT count(`aws.timestream.UserErrors`) as 'Query' FROM Metric"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 100
# Time in seconds; 120 - 3600
thresholdDuration: 600
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
2 changes: 1 addition & 1 deletion alert-policies/aws-chatbot/MessageDeliveryFailure.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: High Message Delivery Failure - Query
name: High MessageDeliveryFailure

description: |+
This alert is triggered when the message delivery failure is above 100 in 10 minutes.
Expand Down
40 changes: 40 additions & 0 deletions alert-policies/aws-lex/HighRuntimeSystemErrors.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Name of the alert
name: High Runtime System Errors

# Description and details
description: |+
This alert occurs when the number of system errors are more than 10 in 300sec
# Type of alert
type: STATIC

# NRQL query
nrql:
query: "SELECT count(aws.lex.RuntimeSystemErrors) from Metric"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation; float value
threshold: 10
# Time in seconds; 120 - 3600
thresholdDuration: 300
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Adding a Warning threshold is optional
- priority: WARNING
operator: ABOVE
threshold: 8
thresholdDuration: 300
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400

43 changes: 43 additions & 0 deletions alert-policies/aws-lex/LatencyInResponse.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Name of the alert
name: Latency In Response

# Description and details
description: |+
The latency for successful requests between the time that the request was made and the response was passed back
# Type of alert
type: STATIC

# NRQL query
nrql:
query: "SELECT average(aws.lex.RuntimeSuccessfulRequestLatency) from Metric "

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation; float value
threshold: 0.9
# Time in seconds; 120 - 3600
thresholdDuration: 300
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Adding a Warning threshold is optional
- priority: WARNING
operator: ABOVE
threshold: 0.8
thresholdDuration: 300
thresholdOccurrences: ALL


# OPTIONAL: URL of runbook to be sent with notification
runbookUrl:

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
27 changes: 27 additions & 0 deletions alert-policies/aws-transcribe/AsyncServerErrorCount.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High AsyncServerErrorCount

description: |+
This alert is triggered when the Async Server Error Count is above 100 in 10 minutes.
type: STATIC
nrql:
query: "SELECT count(`aws.transcribe.AsyncServerErrorCount`) as 'Query' FROM Metric WHERE aws.Namespace = 'AWS/Transcribe'"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 100
# Time in seconds; 120 - 3600
thresholdDuration: 600
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
27 changes: 27 additions & 0 deletions alert-policies/aws-transcribe/AsyncUserErrorCount.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High AsyncUserErrorCount

description: |+
This alert is triggered when the Async User Error Count is above 100 in 10 minutes.
type: STATIC
nrql:
query: "SELECT count(`aws.transcribe.AsyncUserErrorCount`) as 'Query' FROM Metric WHERE aws.Namespace = 'AWS/Transcribe'"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 100
# Time in seconds; 120 - 3600
thresholdDuration: 600
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
27 changes: 27 additions & 0 deletions alert-policies/aws-transcribe/SyncServerErrorCount.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High SyncServerErrorCount

description: |+
This alert is triggered when the Sync Server Error Count is above 100 in 10 minutes.
type: STATIC
nrql:
query: "SELECT count(`aws.transcribe.SyncServerErrorCount`) as 'Query' FROM Metric WHERE aws.Namespace = 'AWS/Transcribe'"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 100
# Time in seconds; 120 - 3600
thresholdDuration: 600
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
27 changes: 27 additions & 0 deletions alert-policies/aws-transcribe/SyncUserErrorCount.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High SyncUserErrorCount

description: |+
This alert is triggered when the Sync User Error Count is above 100 in 10 minutes.
type: STATIC
nrql:
query: "SELECT count(`aws.transcribe.SyncUserErrorCount`) as 'Query' FROM Metric WHERE aws.Namespace = 'AWS/Transcribe'"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 100
# Time in seconds; 120 - 3600
thresholdDuration: 600
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
Loading

0 comments on commit bade645

Please sign in to comment.