[Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan #104541

dominiqueclarke · 2021-07-06T20:11:21Z

This PR updates the Monitor Status rule type to check for down monitors by monitor.timespan, rather than @timestamp.

The monitor.timespan field is a date range field that represents the time at which the ping started all the way through the time of the next scheduled monitor. By using this field, we are essentially querying against the entire time the ping is "effective".

To execute this query in a performant way, we must first filter the number of documents we are searching against, otherwise we'd search through the entire history of documents. To achieve this, we first filter documents by @timestamp for one of the following timeframes (whichever time frame is larger)

Now to now minus the user configured timerange (for example, monitors are down in the last 15 minutes) minus 24 hrs
Now to now minus the scheduled rule interval

Why timeframe minus 24 hours?

The timeframe represents the time in which the user desires to query for down monitors, but those documents may have been indexed farther back the timeframe and still have an effectively down monitor according to the monitor.timespan field. For this reason, we create a buffer of 24h behind the timeframe to check for documents that may contain a matching down document for monitor.timespan. For pings with large scheduled intervals, we are at risk of missing down documents if we do not have a sufficiently large window we are looking within.

Why use the larger of the two timeframes?

Edge cases exist for scenarios various rule configuration scenarios. These edge cases are only present for monitors with large schedule intervals for example:

Rule schedule interval: Check every 2 days
User configured timeframe: Check if monitors are down 3 times within the last 15 minutes

When the rule schedule interval is greater than the timeframe-24hrs, there is the potential to miss documents indexed at the beginning of the rule schedule interval. In this case, we filter @timestamp by now to now minus rule schedule interval

Rule schedule interval: Check every 1 minute
User configured timeframe: Check if monitors are down 3 times within the last 2 days

When the user configured timeframe is greater than the rule interval, there is the potential to miss documents indexed within the timeframe. In this case, we filter @timestamp by now to now minus the timeframe minus 24hrs

Edge cases

Filtering first by @timestamp inherently creates a scenario in which documents could be missed. By creating look-back logic, we capture most of the potential documents which may be down, but are still at risk of missing documents with high scheduled intervals.

Outstanding questions

Should we also be filtering for availability using the monitor.timespan
How, if at all, should we update our documentation to convey this change to our users? (Also, should we contain this in the release notes)

Testing

elasticmachine · 2021-07-06T20:11:23Z

Pinging @elastic/uptime (Team:uptime)

paulb-elastic · 2021-07-07T10:30:50Z

@dominiqueclarke just checking, this is the (draft) PR for #103428?

dominiqueclarke · 2021-07-07T14:17:55Z

@dominiqueclarke just checking, this is the (draft) PR for #103428?

Yep, getting ready to finalize draft.

dominiqueclarke · 2021-07-08T14:45:14Z

@elasticmachine merge upstream

…us-by-monitor.timespan

justinkambic

I tested this locally with a monitor, it seemed to function alright for me. The code looks good, too; I had one minor recommendation that you can take or leave.

I would like a little more time to test it some more before we merge.

justinkambic · 2021-07-08T17:45:42Z

x-pack/plugins/uptime/server/lib/alerts/status_check.ts

+    .parse(timerangeLookback)
+    ?.subtract('24', 'hours')
+    .valueOf();
+  const absoluteFrom = min([scheduleIntervalAbsoluteTime, defaultIntervalAbsoluteTime]);


Suggested change

const absoluteFrom = min([scheduleIntervalAbsoluteTime, defaultIntervalAbsoluteTime]);

const from = min([scheduleIntervalAbsoluteTime, defaultIntervalAbsoluteTime]) ?? 'now-24h';

Just a suggestion, do you think this is cleaner? My thinking is it simplifies the return object so much that you don't really need to look at it.

….timespan' of https://github.com/dominiqueclarke/kibana into fix/99660-uptime-alerts-check-monitor-status-by-monitor.timespan

justinkambic

I've done some additional checking locally and I think this will give us some improvement in the behavior of these alerts.

LGTM

dominiqueclarke · 2021-07-12T12:23:13Z

@elasticmachine merge upstream

…us-by-monitor.timespan

kibanamachine · 2021-07-12T14:32:29Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #137556 succeeded 0008224
💚 Build #137255 succeeded 6b678fe
💛 Build #136905 was flaky 22fb1cd
💚 Build #136720 succeeded d813e2e
💔 Build #136706 failed d7bb281

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

shahzad31

Looks good !!
Smoke tested and code review !!

…tor.timespan (elastic#104541) * check monitor status by monitor.timespan * Delete mappings.json * adjust logic Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

kibanamachine · 2021-07-13T12:52:35Z

💚 Backport successful

Status	Branch	Result
✅	7.14
✅	7.x

The backport PRs will be merged automatically after passing CI.

…-png-pdf-report-type * 'master' of github.com:elastic/kibana: (292 commits) bring back KQL autocomplete in timeline + fix last updated (elastic#105380) [Maps] Change TOC pop-up wording to reflect filter change, not search bar change (elastic#105163) Updating urls to upstream elastic repo (elastic#105250) [Maps] Move new vector layer wizard card down (elastic#104797) Exclude registering the cases feature if not enabled (elastic#105292) [Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan (elastic#104541) updated UI copy (elastic#105184) Log a warning when documents of unknown types are detected during migration (elastic#105213) [Logs UI] Register log threshold rule as lifecycle rule (elastic#104341) [Ingest pipelines] add network direction processor (elastic#103436) [Console] Autocomplete definitions (manual backport) (elastic#105086) [Security Solution] User can make Exceptions for Memory protection alerts (elastic#102196) [Lens] Formula: add validation for multiple field/metrics (elastic#104092) Removing async from file upload and data visualizer plugins start lifecycle (elastic#105197) Fix error when validating the form with non blocking validations (elastic#103629) [ML] Fix "View by" swim lane with applied filter and sorting by score (elastic#105217) Update dependency @elastic/charts to v32 (elastic#104625) [CTI] shortens large numbers on Dashboard Link Panel (elastic#105269) [Security Solution][Endpoint][Host Isolation] Fixes bug to remove excess host metadata status toasts on non user initiated errors (elastic#105331) [Cases] Fix pushing alerts count on every push to external service (elastic#105030) ... # Conflicts: # x-pack/plugins/reporting/common/types.ts

…tor.timespan (#104541) (#105425) * check monitor status by monitor.timespan * Delete mappings.json * adjust logic Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Dominique Clarke <doclarke71@gmail.com>

…tor.timespan (#104541) (#105426) * check monitor status by monitor.timespan * Delete mappings.json * adjust logic Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Dominique Clarke <doclarke71@gmail.com>

dominiqueclarke requested a review from a team as a code owner July 6, 2021 20:11

dominiqueclarke marked this pull request as draft July 6, 2021 20:11

check monitor status by monitor.timespan

d813e2e

dominiqueclarke force-pushed the fix/99660-uptime-alerts-check-monitor-status-by-monitor.timespan branch from d7bb281 to d813e2e Compare July 6, 2021 21:03

Delete mappings.json

22fb1cd

dominiqueclarke marked this pull request as ready for review July 7, 2021 15:27

shahzad31 added the v7.15.0 label Jul 7, 2021

Merge branch 'master' into fix/99660-uptime-alerts-check-monitor-stat…

6b678fe

…us-by-monitor.timespan

justinkambic reviewed Jul 8, 2021

View reviewed changes

dominiqueclarke added 2 commits July 9, 2021 10:24

adjust logic

8623c9b

Merge branch 'fix/99660-uptime-alerts-check-monitor-status-by-monitor…

0008224

….timespan' of https://github.com/dominiqueclarke/kibana into fix/99660-uptime-alerts-check-monitor-status-by-monitor.timespan

justinkambic approved these changes Jul 9, 2021

View reviewed changes

Merge branch 'master' into fix/99660-uptime-alerts-check-monitor-stat…

06a3d81

…us-by-monitor.timespan

shahzad31 approved these changes Jul 13, 2021

View reviewed changes

dominiqueclarke merged commit 3743eb8 into elastic:master Jul 13, 2021

dominiqueclarke deleted the fix/99660-uptime-alerts-check-monitor-status-by-monitor.timespan branch July 13, 2021 12:50

kibanamachine mentioned this pull request Jul 13, 2021

[7.14] [Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan (#104541) #105425

Merged

kibanamachine mentioned this pull request Jul 13, 2021

[7.x] [Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan (#104541) #105426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan #104541

[Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan #104541

dominiqueclarke commented Jul 6, 2021 •

edited

Loading

elasticmachine commented Jul 6, 2021

paulb-elastic commented Jul 7, 2021

dominiqueclarke commented Jul 7, 2021

dominiqueclarke commented Jul 8, 2021

justinkambic left a comment

justinkambic Jul 8, 2021

justinkambic left a comment

dominiqueclarke commented Jul 12, 2021

kibanamachine commented Jul 12, 2021

shahzad31 left a comment

kibanamachine commented Jul 13, 2021

	const absoluteFrom = min([scheduleIntervalAbsoluteTime, defaultIntervalAbsoluteTime]);
	const from = min([scheduleIntervalAbsoluteTime, defaultIntervalAbsoluteTime]) ?? 'now-24h';

[Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan #104541

[Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan #104541

Conversation

dominiqueclarke commented Jul 6, 2021 • edited Loading

Why timeframe minus 24 hours?

Why use the larger of the two timeframes?

Edge cases

Outstanding questions

Testing

elasticmachine commented Jul 6, 2021

paulb-elastic commented Jul 7, 2021

dominiqueclarke commented Jul 7, 2021

dominiqueclarke commented Jul 8, 2021

justinkambic left a comment

Choose a reason for hiding this comment

justinkambic Jul 8, 2021

Choose a reason for hiding this comment

justinkambic left a comment

Choose a reason for hiding this comment

dominiqueclarke commented Jul 12, 2021

kibanamachine commented Jul 12, 2021

💚 Build Succeeded

Metrics [docs]

History

shahzad31 left a comment

Choose a reason for hiding this comment

kibanamachine commented Jul 13, 2021

💚 Backport successful

dominiqueclarke commented Jul 6, 2021 •

edited

Loading