Skip to content

Commit

Permalink
Merge branch 'master' into sharing-saved-objects-phase-3.5
Browse files Browse the repository at this point in the history
  • Loading branch information
kibanamachine authored Jun 28, 2021
2 parents c38dcb6 + 928c4b8 commit 9dc1612
Show file tree
Hide file tree
Showing 35 changed files with 1,235 additions and 254 deletions.
8 changes: 4 additions & 4 deletions docs/api/task-manager/health.asciidoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[[task-manager-api-health]]
=== Get Task Manager health API
== Task Manager health API
++++
<titleabbrev>Get Task Manager health</titleabbrev>
++++
Expand All @@ -8,20 +8,20 @@ Retrieve the health status of the {kib} Task Manager.

[float]
[[task-manager-api-health-request]]
==== Request
=== Request

`GET <kibana host>:<port>/api/task_manager/_health`

[float]
[[task-manager-api-health-codes]]
==== Response code
=== Response code

`200`::
Indicates a successful call.

[float]
[[task-manager-api-health-example]]
==== Example
=== Example

Retrieve the health status of the {kib} Task Manager:

Expand Down
366 changes: 152 additions & 214 deletions docs/user/alerting/alerting-troubleshooting.asciidoc

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
253 changes: 253 additions & 0 deletions docs/user/alerting/troubleshooting/alerting-common-issues.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
[role="xpack"]
[[alerting-common-issues]]
=== Common Issues

This page describes how to resolve common problems you might encounter with Alerting.

[float]
[[rules-small-check-interval-run-late]]
==== Rules with small check intervals run late

*Problem*

Rules with a small check interval, such as every two seconds, run later than scheduled.

*Solution*

Rules run as background tasks at a cadence defined by their *check interval*.
When a Rule *check interval* is smaller than the Task Manager <<task-manager-settings,`poll_interval`>>, the rule will run late.

Either tweak the <<task-manager-settings,{kib} Task Manager settings>> or increase the *check interval* of the rules in question.

For more details, see <<task-manager-health-scheduled-tasks-small-schedule-interval-run-late>>.


[float]
[[scheduled-rules-run-late]]
==== Rules with the inconsistent cadence

*Problem*

Scheduled rules run at an inconsistent cadence, often running late.

Actions run long after the status of a rule changes, sending a notification of the change too late.

*Solution*

Rules and actions run as background tasks by each {kib} instance at a default rate of ten tasks every three seconds.
When diagnosing issues related to Alerting, focus on the tasks that begin with `alerting:` and `actions:`.

Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<rule-type-index-threshold, index threshold stack rule>>.
Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.

For more details on monitoring and diagnosing task execution in Task Manager, see <<task-manager-health-monitoring>>.

[float]
[[connector-tls-settings]]
==== Connectors have TLS errors when executing actions

*Problem*

When executing actions, a connector gets a TLS socket error when connecting to
the server.

*Solution*

Configuration options are available to specialize connections to TLS servers,
including ignoring server certificate validation, and providing certificate
authority data to verify servers using custom certificates. For more details,
see <<action-settings,Action settings>>.

[float]
[[rules-long-execution-time]]
==== Rules take a long time to run

*Problem*

Rules are taking a long time to execute and are impacting the overall health of your deployment.

[IMPORTANT]
==============================================
By default, only users with a `superuser` role can query the {kib} event log because it is a system index. To enable additional users to execute this query, assign `read` privileges to the `.kibana-event-log*` index.
==============================================

*Solution*

Query for a list of rule ids, bucketed by their execution times:

[source,console]
--------------------------------------------------
GET /.kibana-event-log*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1d", <1>
"lte": "now"
}
}
},
{
"term": {
"event.action": {
"value": "execute"
}
}
},
{
"term": {
"event.provider": {
"value": "alerting" <2>
}
}
}
]
}
},
"runtime_mappings": { <3>
"event.duration_in_seconds": {
"type": "double",
"script": {
"source": "emit(doc['event.duration'].value / 1E9)"
}
}
},
"aggs": {
"ruleIdsByExecutionDuration": {
"histogram": {
"field": "event.duration_in_seconds",
"min_doc_count": 1,
"interval": 1 <4>
},
"aggs": {
"ruleId": {
"nested": {
"path": "kibana.saved_objects"
},
"aggs": {
"ruleId": {
"terms": {
"field": "kibana.saved_objects.id",
"size": 10 <5>
}
}
}
}
}
}
}
}
--------------------------------------------------
// TEST

<1> This queries for rules executed in the last day. Update the values of `lte` and `gte` to query over a different time range.
<2> Use `event.provider: actions` to query for long-running action executions.
<3> Execution durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
<4> This interval buckets the `event.duration_in_seconds` runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.

This query returns the following:

[source,json]
--------------------------------------------------
{
"took" : 322,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 326,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"ruleIdsByExecutionDuration" : {
"buckets" : [
{
"key" : 0.0, <1>
"doc_count" : 320,
"ruleId" : {
"doc_count" : 320,
"ruleId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
"doc_count" : 140
},
{
"key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
"doc_count" : 130
},
{
"key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
"doc_count" : 50
}
]
}
}
},
{
"key" : 30.0, <2>
"doc_count" : 6,
"ruleId" : {
"doc_count" : 6,
"ruleId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
"doc_count" : 6
}
]
}
}
}
]
}
}
}
--------------------------------------------------
<1> Most rule execution durations fall within the first bucket (0 - 1 seconds).
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to execute.

Use the <<get-rule-api,Get Rule API>> to retrieve additional information about rules that take a long time to execute.

[float]
[[rule-cannot-decrypt-api-key]]
=== Rule cannot decrypt apiKey

*Problem*:

The rule fails to execute and has an `Unable to decrypt attribute "apiKey"` error.

*Solution*:

This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used during rule execution. Depending on the scenario, there are different ways to solve this problem:

[cols="2*<"]
|===

| If the value in `xpack.encryptedSavedObjects.encryptionKey` was manually changed, and the previous encryption key is still known.
| Ensure any previous encryption key is included in the keys used for <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only>>.

| If another {kib} instance with a different encryption key connects to the cluster.
| The other {kib} instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the {kib} instances are the same, and setting <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only keys>> for previously used encryption keys.

| If other scenarios don't apply.
| Generate a new API key for the rule by disabling then enabling the rule.

|===
Loading

0 comments on commit 9dc1612

Please sign in to comment.