Merge branch 'master' into sharing-saved-objects-phase-3.5

elastic · Jun 28, 2021 · 9dc1612 · 9dc1612
2 parents c38dcb6 + 928c4b8
commit 9dc1612
Show file tree

Hide file tree

Showing 35 changed files with 1,235 additions and 254 deletions.
diff --git a/docs/api/task-manager/health.asciidoc b/docs/api/task-manager/health.asciidoc
@@ -1,5 +1,5 @@
 [[task-manager-api-health]]
-=== Get Task Manager health API
+== Task Manager health API
 ++++
 <titleabbrev>Get Task Manager health</titleabbrev>
 ++++
@@ -8,20 +8,20 @@ Retrieve the health status of the {kib} Task Manager.
 
 [float]
 [[task-manager-api-health-request]]
-==== Request
+=== Request
 
 `GET <kibana host>:<port>/api/task_manager/_health`
 
 [float]
 [[task-manager-api-health-codes]]
-==== Response code
+=== Response code
 
 `200`::
     Indicates a successful call.
 
 [float]
 [[task-manager-api-health-example]]
-==== Example
+=== Example
 
 Retrieve the health status of the {kib} Task Manager:
 

diff --git a/docs/user/alerting/alerting-troubleshooting.asciidoc b/docs/user/alerting/alerting-troubleshooting.asciidoc
diff --git a/docs/user/alerting/images/connector-save-and-test.png b/docs/user/alerting/images/connector-save-and-test.png
diff --git a/docs/user/alerting/images/email-connector-test.png b/docs/user/alerting/images/email-connector-test.png
diff --git a/docs/user/alerting/images/index-threshold-chart.png b/docs/user/alerting/images/index-threshold-chart.png
diff --git a/docs/user/alerting/images/rules-details-health.png b/docs/user/alerting/images/rules-details-health.png
diff --git a/docs/user/alerting/images/rules-management-health.png b/docs/user/alerting/images/rules-management-health.png
diff --git a/docs/user/alerting/images/teams-connector-test.png b/docs/user/alerting/images/teams-connector-test.png
diff --git a/docs/user/alerting/troubleshooting/alerting-common-issues.asciidoc b/docs/user/alerting/troubleshooting/alerting-common-issues.asciidoc
@@ -0,0 +1,253 @@
+[role="xpack"]
+[[alerting-common-issues]]
+=== Common Issues
+
+This page describes how to resolve common problems you might encounter with Alerting.
+
+[float]
+[[rules-small-check-interval-run-late]]
+==== Rules with small check intervals run late
+
+*Problem*
+
+Rules with a small check interval, such as every two seconds, run later than scheduled.
+
+*Solution*
+
+Rules run as background tasks at a cadence defined by their *check interval*.
+When a Rule *check interval* is smaller than the Task Manager <<task-manager-settings,`poll_interval`>>, the rule will run late.
+
+Either tweak the <<task-manager-settings,{kib} Task Manager settings>> or increase the *check interval* of the rules in question.
+
+For more details, see <<task-manager-health-scheduled-tasks-small-schedule-interval-run-late>>.
+
+
+[float]
+[[scheduled-rules-run-late]]
+==== Rules with the inconsistent cadence
+
+*Problem*
+
+Scheduled rules run at an inconsistent cadence, often running late.
+
+Actions run long after the status of a rule changes, sending a notification of the change too late.
+
+*Solution*
+
+Rules and actions run as background tasks by each {kib} instance at a default rate of ten tasks every three seconds.
+When diagnosing issues related to Alerting, focus on the tasks that begin with `alerting:` and `actions:`.
+
+Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<rule-type-index-threshold, index threshold stack rule>>.
+Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.
+
+For more details on monitoring and diagnosing task execution in Task Manager, see <<task-manager-health-monitoring>>.
+
+[float]
+[[connector-tls-settings]]
+==== Connectors have TLS errors when executing actions
+
+*Problem*
+
+When executing actions, a connector gets a TLS socket error when connecting to
+the server.
+
+*Solution*
+
+Configuration options are available to specialize connections to TLS servers,
+including ignoring server certificate validation, and providing certificate
+authority data to verify servers using custom certificates.  For more details, 
+see <<action-settings,Action settings>>.
+
+[float]
+[[rules-long-execution-time]]
+==== Rules take a long time to run
+
+*Problem* 
+
+Rules  are taking a long time to execute and are impacting the overall health of your deployment.
+
+[IMPORTANT]
+==============================================
+By default, only users with a `superuser` role can query the {kib} event log because it is a system index. To enable additional users to execute this query, assign `read` privileges to the `.kibana-event-log*` index.
+==============================================
+
+*Solution*
+
+Query for a list of rule ids, bucketed by their execution times:
+
+[source,console]
+--------------------------------------------------
+GET /.kibana-event-log*/_search
+{
+  "size": 0,
+  "query": {
+    "bool": {
+      "filter": [
+        {
+          "range": {
+            "@timestamp": {
+              "gte": "now-1d", <1>
+              "lte": "now"
+            }
+          }
+        },
+        {
+          "term": {
+            "event.action": {
+              "value": "execute"
+            }
+          }
+        },
+        {
+          "term": {
+            "event.provider": {
+              "value": "alerting" <2>
+            }
+          }
+        }
+      ]
+    }
+  },
+  "runtime_mappings": { <3>
+    "event.duration_in_seconds": {
+      "type": "double",
+      "script": {
+        "source": "emit(doc['event.duration'].value / 1E9)"
+      }
+    }
+  },
+  "aggs": {
+    "ruleIdsByExecutionDuration": {
+      "histogram": {
+        "field": "event.duration_in_seconds",
+        "min_doc_count": 1,
+        "interval": 1 <4>
+      },
+      "aggs": {
+        "ruleId": {
+          "nested": {
+            "path": "kibana.saved_objects"
+          },
+          "aggs": {
+            "ruleId": {
+              "terms": {
+                "field": "kibana.saved_objects.id",
+                "size": 10 <5>
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
+// TEST
+
+<1> This queries for rules executed in the last day. Update the values of `lte` and `gte` to query over a different time range.
+<2> Use `event.provider: actions` to query for long-running action executions.
+<3> Execution durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
+<4> This interval buckets the `event.duration_in_seconds` runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
+<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.
+
+This query returns the following:
+
+[source,json]
+--------------------------------------------------
+{
+  "took" : 322,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 326,
+      "relation" : "eq"
+    },
+    "max_score" : null,
+    "hits" : [ ]
+  },
+  "aggregations" : {
+    "ruleIdsByExecutionDuration" : {
+      "buckets" : [
+        {
+          "key" : 0.0, <1>
+          "doc_count" : 320,
+          "ruleId" : {
+            "doc_count" : 320,
+            "ruleId" : {
+              "doc_count_error_upper_bound" : 0,
+              "sum_other_doc_count" : 0,
+              "buckets" : [
+                {
+                  "key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
+                  "doc_count" : 140
+                },
+                {
+                  "key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
+                  "doc_count" : 130
+                },
+                {
+                  "key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
+                  "doc_count" : 50
+                }
+              ]
+            }
+          }
+        },
+        {
+          "key" : 30.0, <2>
+          "doc_count" : 6,
+          "ruleId" : {
+            "doc_count" : 6,
+            "ruleId" : {
+              "doc_count_error_upper_bound" : 0,
+              "sum_other_doc_count" : 0,
+              "buckets" : [
+                {
+                  "key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
+                  "doc_count" : 6
+                }
+              ]
+            }
+          }
+        }
+      ]
+    }
+  }
+}
+--------------------------------------------------
+<1> Most rule execution durations fall within the first bucket (0 - 1 seconds).
+<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to execute.
+
+Use the <<get-rule-api,Get Rule API>> to retrieve additional information about rules that take a long time to execute.
+
+[float]
+[[rule-cannot-decrypt-api-key]]
+=== Rule cannot decrypt apiKey
+
+*Problem*:
+
+The rule fails to execute and has an `Unable to decrypt attribute "apiKey"` error.
+
+*Solution*:
+
+This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used during rule execution. Depending on the scenario, there are different ways to solve this problem:
+
+[cols="2*<"]
+|===
+
+| If the value in `xpack.encryptedSavedObjects.encryptionKey` was manually changed, and the previous encryption key is still known.
+| Ensure any previous encryption key is included in the keys used for <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only>>.
+
+| If another {kib} instance with a different encryption key connects to the cluster.
+| The other {kib} instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the {kib} instances are the same, and setting <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only keys>> for previously used encryption keys.
+
+| If other scenarios don't apply.
+| Generate a new API key for the rule by disabling then enabling the rule.
+
+|===