Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan #104541

Conversation

dominiqueclarke
Copy link
Contributor

@dominiqueclarke dominiqueclarke commented Jul 6, 2021

Resolves #103428

This PR updates the Monitor Status rule type to check for down monitors by monitor.timespan, rather than @timestamp.

The monitor.timespan field is a date range field that represents the time at which the ping started all the way through the time of the next scheduled monitor. By using this field, we are essentially querying against the entire time the ping is "effective".

To execute this query in a performant way, we must first filter the number of documents we are searching against, otherwise we'd search through the entire history of documents. To achieve this, we first filter documents by @timestamp for one of the following timeframes (whichever time frame is larger)

  • Now to now minus the user configured timerange (for example, monitors are down in the last 15 minutes) minus 24 hrs
  • Now to now minus the scheduled rule interval

Why timeframe minus 24 hours?

The timeframe represents the time in which the user desires to query for down monitors, but those documents may have been indexed farther back the timeframe and still have an effectively down monitor according to the monitor.timespan field. For this reason, we create a buffer of 24h behind the timeframe to check for documents that may contain a matching down document for monitor.timespan. For pings with large scheduled intervals, we are at risk of missing down documents if we do not have a sufficiently large window we are looking within.

Why use the larger of the two timeframes?

Edge cases exist for scenarios various rule configuration scenarios. These edge cases are only present for monitors with large schedule intervals for example:

  • Rule schedule interval: Check every 2 days
  • User configured timeframe: Check if monitors are down 3 times within the last 15 minutes

When the rule schedule interval is greater than the timeframe-24hrs, there is the potential to miss documents indexed at the beginning of the rule schedule interval. In this case, we filter @timestamp by now to now minus rule schedule interval

  • Rule schedule interval: Check every 1 minute
  • User configured timeframe: Check if monitors are down 3 times within the last 2 days

When the user configured timeframe is greater than the rule interval, there is the potential to miss documents indexed within the timeframe. In this case, we filter @timestamp by now to now minus the timeframe minus 24hrs

Edge cases

Filtering first by @timestamp inherently creates a scenario in which documents could be missed. By creating look-back logic, we capture most of the potential documents which may be down, but are still at risk of missing documents with high scheduled intervals.

Outstanding questions

  • Should we also be filtering for availability using the monitor.timespan
  • How, if at all, should we update our documentation to convey this change to our users? (Also, should we contain this in the release notes)

Testing

  • Auto generated monitor status alert
    • Create a heartbeat monitor that is always down with an interval greater than 1 minute, which is the default rule schedule interval for auto generated alerts.
    • Add an action connector in the Settings page
    • Toggle the alert toggle for this monitor
    • Observe that alert is triggered
    • Observe that the alert is not resolved after the alert is executed a second time
    • Change the heartbeat config so that the monitor comes up, and restart heartbeat
    • Observe that the alert is resolved
  • User generated monitor status alert

@dominiqueclarke dominiqueclarke added bug Fixes for quality problems that affect the customer experience v8.0.0 Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability release_note:skip Skip the PR/issue when compiling release notes v7.14.0 auto-backport Deprecated - use backport:version if exact versions are needed labels Jul 6, 2021
@dominiqueclarke dominiqueclarke requested a review from a team as a code owner July 6, 2021 20:11
@elasticmachine
Copy link
Contributor

Pinging @elastic/uptime (Team:uptime)

@dominiqueclarke dominiqueclarke marked this pull request as draft July 6, 2021 20:11
@dominiqueclarke dominiqueclarke force-pushed the fix/99660-uptime-alerts-check-monitor-status-by-monitor.timespan branch from d7bb281 to d813e2e Compare July 6, 2021 21:03
@paulb-elastic
Copy link
Contributor

@dominiqueclarke just checking, this is the (draft) PR for #103428?

@dominiqueclarke
Copy link
Contributor Author

@dominiqueclarke just checking, this is the (draft) PR for #103428?

Yep, getting ready to finalize draft.

@dominiqueclarke dominiqueclarke marked this pull request as ready for review July 7, 2021 15:27
@dominiqueclarke
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally with a monitor, it seemed to function alright for me. The code looks good, too; I had one minor recommendation that you can take or leave.

I would like a little more time to test it some more before we merge.

.parse(timerangeLookback)
?.subtract('24', 'hours')
.valueOf();
const absoluteFrom = min([scheduleIntervalAbsoluteTime, defaultIntervalAbsoluteTime]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const absoluteFrom = min([scheduleIntervalAbsoluteTime, defaultIntervalAbsoluteTime]);
const from = min([scheduleIntervalAbsoluteTime, defaultIntervalAbsoluteTime]) ?? 'now-24h';

Just a suggestion, do you think this is cleaner? My thinking is it simplifies the return object so much that you don't really need to look at it.

Copy link
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some additional checking locally and I think this will give us some improvement in the behavior of these alerts.

LGTM

@dominiqueclarke
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@shahzad31 shahzad31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good !!
Smoke tested and code review !!

@dominiqueclarke dominiqueclarke merged commit 3743eb8 into elastic:master Jul 13, 2021
@dominiqueclarke dominiqueclarke deleted the fix/99660-uptime-alerts-check-monitor-status-by-monitor.timespan branch July 13, 2021 12:50
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Jul 13, 2021
…tor.timespan (elastic#104541)

* check monitor status by monitor.timespan

* Delete mappings.json

* adjust logic

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Jul 13, 2021
…tor.timespan (elastic#104541)

* check monitor status by monitor.timespan

* Delete mappings.json

* adjust logic

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.14
7.x

The backport PRs will be merged automatically after passing CI.

jloleysens added a commit to jloleysens/kibana that referenced this pull request Jul 13, 2021
…-png-pdf-report-type

* 'master' of github.com:elastic/kibana: (292 commits)
  bring back KQL autocomplete in timeline + fix last updated (elastic#105380)
  [Maps] Change TOC pop-up wording to reflect filter change, not search bar change (elastic#105163)
  Updating urls to upstream elastic repo (elastic#105250)
  [Maps] Move new vector layer wizard card down (elastic#104797)
  Exclude registering the cases feature if not enabled (elastic#105292)
  [Uptime] Alerts - Monitor status alert - check monitor status by monitor.timespan (elastic#104541)
  updated UI copy (elastic#105184)
  Log a warning when documents of unknown types are detected during migration (elastic#105213)
  [Logs UI] Register log threshold rule as lifecycle rule (elastic#104341)
  [Ingest pipelines] add network direction processor (elastic#103436)
  [Console] Autocomplete definitions (manual backport) (elastic#105086)
  [Security Solution] User can make Exceptions for Memory protection alerts (elastic#102196)
  [Lens] Formula: add validation for multiple field/metrics (elastic#104092)
  Removing async from file upload and data visualizer plugins start lifecycle (elastic#105197)
  Fix error when validating the form with non blocking validations (elastic#103629)
  [ML] Fix "View by" swim lane with applied filter and sorting by score  (elastic#105217)
  Update dependency @elastic/charts to v32 (elastic#104625)
  [CTI] shortens large numbers on Dashboard Link Panel (elastic#105269)
  [Security Solution][Endpoint][Host Isolation] Fixes bug to remove excess host metadata status toasts on non user initiated errors (elastic#105331)
  [Cases] Fix pushing alerts count on every push to external service (elastic#105030)
  ...

# Conflicts:
#	x-pack/plugins/reporting/common/types.ts
kibanamachine added a commit that referenced this pull request Jul 13, 2021
…tor.timespan (#104541) (#105425)

* check monitor status by monitor.timespan

* Delete mappings.json

* adjust logic

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Dominique Clarke <doclarke71@gmail.com>
kibanamachine added a commit that referenced this pull request Jul 13, 2021
…tor.timespan (#104541) (#105426)

* check monitor status by monitor.timespan

* Delete mappings.json

* adjust logic

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Dominique Clarke <doclarke71@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed bug Fixes for quality problems that affect the customer experience release_note:skip Skip the PR/issue when compiling release notes Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability v7.14.0 v7.15.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Uptime] Check monitor status by monitor.timespan for monitor status alerts
6 participants