[Monitoring] Missing monitoring data alert #78208

chrisronline · 2020-09-22T19:50:33Z

Resolves #74823

This PR introduces a new out of the box alert for Stack Monitoring that identifies missing periods of monitoring data.

Copy

Firing message

We have not detected any monitoring data for 2 stack product(s) in cluster: abc123

Firing UI message

For the past 2m, we have not detected any monitoring data from the Kibana instance: kib-01, starting at September 23, 2020 3:07 PM EDT

Screenshots

…_data_alert

elasticmachine · 2020-09-24T19:23:01Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

igoristic · 2020-09-27T09:43:30Z

x-pack/plugins/monitoring/server/lib/alerts/fetch_missing_data.ts

+          }
+        }
+
+        uniqueList[`${clusterUuid}::${stackProduct}::${stackProductUuid}`] = {


You should only overwrite: if (differenceInMs > uniqueList[key]?.gapDuration) otherwise you might overwrite a big gap (that could've potentially trigger the alert) with a smaller one (that would not)

igoristic · 2020-09-27T09:43:38Z

x-pack/plugins/monitoring/server/lib/alerts/fetch_missing_data.ts

+  size: number
+): Promise<AlertMissingData[]> {
+  const endMs = +new Date();
+  const startMs = endMs - limit - limit * 0.25; // Go a bit farther back because we need to detect the difference between seeing the monitoring data versus just not looking far enough back


I don't think this is a good idea, since 25% is a pretty big padding for the one day default. How about we just have a minimum limit of 3 minutes? That way we can do something like: endMs - (limit + 180000)

Yea, that's probably fair. I guess I just wanted to be sure I accounted for various changes to the default collection period but 3m is probably a good enough distance to go back. Thanks!

igoristic · 2020-09-27T10:52:36Z

x-pack/plugins/monitoring/server/lib/alerts/fetch_missing_data.ts

+        const differenceInMs = +new Date() - uuidBucket.most_recent.value;
+        let stackProductName = stackProductUuid;
+        for (const nameField of nameFields) {
+          stackProductName = get(uuidBucket, `top.hits.hits[0]._source.${nameField}`);


Suggested change

stackProductName = get(uuidBucket, `top.hits.hits[0]._source.${nameField}`);

stackProductName = get(uuidBucket, `document.hits.hits[0]._source.${nameField}`);

There was no top.* field name in my results, so I assume you wanted the above?

Yes, thank you!

igoristic · 2020-09-27T11:48:54Z

x-pack/plugins/monitoring/server/alerts/missing_data_alert.ts

+      return {
+        instanceKey: `${missing.clusterUuid}:${missing.stackProduct}:${missing.stackProductUuid}`,
+        clusterUuid: missing.clusterUuid,
+        shouldFire: missing.gapDuration > duration && missing.gapDuration <= limit,


I don't think you need: ... && missing.gapDuration <= limit check, since your search query is already within the limit range (give or take). But, even if, wouldn't it still qualify as a valid trigger? Since, duration would always be less than limit

igoristic · 2020-09-27T11:59:10Z

x-pack/plugins/monitoring/server/alerts/missing_data_alert.ts

+  ];
+
+  protected async fetchData(
+    params: CommonAlertParams,


You can also set the type as params: MissingDataParams here, and params: CommonAlertParams | unknown in base_alerts.ts That way you won't need to do any funky casting/recasting

igoristic

@chrisronline

This is good effort, and I like the UI/UX feel of it, however I have some concerns/opinions:

I think these should be separate alerts based on individual product, so we can set different thresholds and have the ability to enable/disable them for each specific product (es, beats, kibana, etc)
Probably an oversight on our part, but the term: “Missing Data” is kinda confusing. It makes it seem as though there’s missing data in the production cluster. I think we should rename it to something like "Intermittent Monitoring" or Monitoring Collectors" alert and avoid the word “data”
I can’t get it to trigger most of the time even though my gaps are bigger than the threshold. This is how I tested it:

First I made my threshold values more sensitive: gaps 1min, range 1 day
Then I did: "xpack.monitoring.collection.enabled": false via cluster settings (for about ~10min)
Waited for about 5min for the notification to show up on the Overview page
Then went to the node’s detail page and confirmed that the gaps are indeed there I did get it to trigger one time (the next day, but don’t know what I did differently)

I feel like the code for these alerts (including cpu and disk usage alerts) are pretty bulky, even though a lot of the functionality/features are very similar (if not exactly the same). I'm only bringing this up, because I see us adding a lot of these alerts in the future (and don't know how well it will scale if each alert/pr is 1-2k lines of code and has some custom logic/flow). I sort of tried to address this in my "Disk Usage" pr: [Monitoring] Disk usage alerting #75419, but starting to feel like maybe the one size fits all approach is too good to be true. This might be just the nature of things, and I'm probably going off on a rant/tangent here. Would like to hear your opinion though

igoristic · 2020-09-27T17:56:46Z

Just figured out why I wasn't getting any triggers

Notice 1.601228437628E12 which is odd, since it should look something like 1601228946214. Looks like the timestamp is getting casted as a float (probably because of a precision/rounding error somewhere, maybe ES bug)

This is the query I used

GET .monitoring-*-7-*/_search?filter_path=aggregations.index.buckets,took
{
"size": 0,
"query": {
  "bool": {
    "filter": [
      {
        "terms": {
          "cluster_uuid": [
            "wuXG3QJKThmajyOWMx20hw"
          ]
        }
      },
      {
        "range": {
          "timestamp": {
            "gte": "now-1d"
          }
        }
      }
    ]
  }
},
"aggs": {
  "index": {
    "terms": {
      "field": "_index",
      "size": 10000
    },
    "aggs": {
      "clusters": {
        "terms": {
          "field": "cluster_uuid",
          "size": 10000
        },
        "aggs": {
          "es_uuids": {
            "terms": {
              "field": "node_stats.node_id",
              "size": 10000
            },
            "aggs": {
              "most_recent": {
                "max": {
                  "field": "timestamp"
                }
              },
              "document": {
                "top_hits": {
                  "size": 1,
                  "sort": [
                    {
                      "timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "source_node.name",
                      "kibana_stats.kibana.name",
                      "logstash_stats.logstash.host",
                      "beats_stats.beat.name"
                    ]
                  }
                }
              }
            }
          },
          "kibana_uuids": {
            "terms": {
              "field": "kibana_stats.kibana.uuid",
              "size": 10000
            },
            "aggs": {
              "most_recent": {
                "max": {
                  "field": "timestamp"
                }
              },
              "document": {
                "top_hits": {
                  "size": 1,
                  "sort": [
                    {
                      "timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "source_node.name",
                      "kibana_stats.kibana.name",
                      "logstash_stats.logstash.host",
                      "beats_stats.beat.name"
                    ]
                  }
                }
              }
            }
          },
          "beats": {
            "filter": {
              "bool": {
                "must_not": {
                  "term": {
                    "beats_stats.beat.type": "apm-server"
                  }
                }
              }
            },
            "aggs": {
              "beats_uuids": {
                "terms": {
                  "field": "beats_stats.beat.uuid",
                  "size": 10000
                },
                "aggs": {
                  "most_recent": {
                    "max": {
                      "field": "timestamp"
                    }
                  },
                  "document": {
                    "top_hits": {
                      "size": 1,
                      "sort": [
                        {
                          "timestamp": {
                            "order": "desc"
                          }
                        }
                      ],
                      "_source": {
                        "includes": [
                          "source_node.name",
                          "kibana_stats.kibana.name",
                          "logstash_stats.logstash.host",
                          "beats_stats.beat.name"
                        ]
                      }
                    }
                  }
                }
              }
            }
          },
          "apms": {
            "filter": {
              "bool": {
                "must": {
                  "term": {
                    "beats_stats.beat.type": "apm-server"
                  }
                }
              }
            },
            "aggs": {
              "apm_uuids": {
                "terms": {
                  "field": "beats_stats.beat.uuid",
                  "size": 10000
                },
                "aggs": {
                  "most_recent": {
                    "max": {
                      "field": "timestamp"
                    }
                  },
                  "document": {
                    "top_hits": {
                      "size": 1,
                      "sort": [
                        {
                          "timestamp": {
                            "order": "desc"
                          }
                        }
                      ],
                      "_source": {
                        "includes": [
                          "source_node.name",
                          "kibana_stats.kibana.name",
                          "logstash_stats.logstash.host",
                          "beats_stats.beat.name"
                        ]
                      }
                    }
                  }
                }
              }
            }
          },
          "logstash_uuids": {
            "terms": {
              "field": "logstash_stats.logstash.uuid",
              "size": 10000
            },
            "aggs": {
              "most_recent": {
                "max": {
                  "field": "timestamp"
                }
              },
              "document": {
                "top_hits": {
                  "size": 1,
                  "sort": [
                    {
                      "timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "source_node.name",
                      "kibana_stats.kibana.name",
                      "logstash_stats.logstash.host",
                      "beats_stats.beat.name"
                    ]
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
}

I tried it with epoch time range as well, and still got the same results

…_data_alert

chrisronline · 2020-09-28T18:30:10Z

I think these should be separate alerts based on individual product, so we can set different thresholds and have the ability to enable/disable them for each specific product (es, beats, kibana, etc)

@ravikesarwani Do you have any thoughts about this? Should we have a single alert for any missing monitoring data? or a separate alert for each product?

Probably an oversight on our part, but the term: “Missing Data” is kinda confusing. It makes it seem as though there’s missing data in the production cluster. I think we should rename it to something like "Intermittent Monitoring" or Monitoring Collectors" alert and avoid the word “data”

💯 I absolutely agree and it never crossed my mind once so thank you for pointing it out! I will update the label in the UI.

I can’t get it to trigger most of the time even though my gaps are bigger than the threshold. This is how I tested it:

Then I did: "xpack.monitoring.collection.enabled": false via cluster settings (for about ~10min)

I got this to work for me, so I'm not sure. Maybe do a screen recording?

Would like to hear your opinion though

I definitely agree and don't like the copy/paste for a new alert. My goal was to build a few of these alerts out to truly see what abstraction made sense - I think after 7.10, we can take a look at what we have and make a pass at some abstraction layer that will make it easier to build new alerts.

…_data_alert

igoristic · 2020-09-29T14:19:51Z

@chrisronline

Now that I understand we don't look for specific gaps in a range, but rather when the data stops for a specific time span. The limit setting ("Look this far back in time for any data") does not make any sense. I think we can remove it (from the UI) and use something like: limit = duration * 1.25. But, maybe I'm missing something?

chrisronline · 2020-09-29T14:22:00Z

@igoristic Both levers feel important to me. I feel a user should be able to configure how long the alert should be left to fire until we basically give up. Perhaps we need to rename the parameter, but I don't feel we can define a threshold that applies to all users. I recall @ravikesarwani feeling both levers had value too.

igoristic · 2020-09-29T14:32:33Z

@chrisronline I see your point, but the wording is actually what made me assume we're looking for gaps

I still feel like it's not really needed, and we're also giving a user more options that could potentially break the trigger (ex: limit < duration or limit > 7d can timeout a query)

…_data_alert

igoristic · 2020-09-30T05:06:34Z

@chrisronline Thanks for adding the changes! Looking a lot better 👍

Though, I still can't get it to trigger with a simple "xpack.monitoring.collection.enabled": false (via cluster settings). My threshold is at 5 min and I waited 10 min. I confirmed my nodes details page and the data did indeed stop

Also, my PR is now using getSafeForExternalLink to add current state (which can either be single or ccs) to the link:
8693ed7#diff-64a93554c0926988a7c616494814ac6bR62 This will break some of the links, after my PR is merged

…_data_alert

chrisronline · 2020-09-30T18:49:39Z

@igoristic Awesome find, I found the reason for your testing issues. Ready for another round!

igoristic

Yeyr! Working pretty good now. Awesome job! 🏆

Since you still need to fix the the Type check. Can you please also:

Remove all the globalState ?_g from links, since it's added in the UI now
And correct any translation ids that don't relate to the context: eg: ...missingData.ui.nextSteps.hotThreads

…_data_alert

chrisronline · 2020-10-01T13:11:53Z

Remove all the globalState ?_g from links, since it's added in the UI now

FWIW, this shouldn't apply to the action usage of global state in the URL because that is delivered through the notification provider (slack, email) and that needs to contain the contextual link

* WIP for alert * Surface alert most places * Fix up alert placement * Fix tests * Type fix * Update copy * Add alert presence to APM in the UI * Fetch data a little differently * We don't need moment * Add tests * PR feedback * Update copy * Fix up bug around grabbing old data * PR feedback * PR feedback * Fix tests # Conflicts: # x-pack/plugins/monitoring/public/components/apm/instance/instance.js # x-pack/plugins/monitoring/public/components/beats/beat/beat.js

chrisronline · 2020-10-01T18:13:48Z

Backport:

7.x: 45c215d

…aly-detection-partition-field * 'master' of github.com:elastic/kibana: (76 commits) Fix z-index of KQL Suggestions dropdown (elastic#79184) [babel] remove unused/unneeded babel plugins (elastic#79173) [Search] Fix timeout upgrade link (elastic#79045) Always Show Embeddable Panel Header in Edit Mode (elastic#79152) [Ingest]: add more test for transform index (elastic#79154) [ML] DF Analytics: Collapsable sections on results pages (elastic#76641) [Fleet] Fix agent policy change action migration (elastic#79046) [Ingest Manager] Match package spec `dataset`->`data_stream` and `config_templates`->`policy_templates` renaming (elastic#78699) Revert "[Metrics UI] Add ability to override datafeeds and job config for partition field (elastic#78875)" [ML] Update transform cloning to include description and new fields (elastic#78364) chore(NA): remove non existing plugin paths from case api integration tests (elastic#79127) [Ingest Manager] Ensure we trigger agent policy updated event when we bump revision. (elastic#78836) [Metrics UI] Display No Data context.values as [NO DATA] (elastic#78038) [Monitoring] Missing data alert (elastic#78208) [Lens] Fix embeddable title and description for reporting and dashboard tooltip (elastic#78767) [Lens] Consistent Drag and Drop styles (elastic#78674) [ML] Model management UI fixes and enhancements (elastic#79072) [Metrics UI] Add ability to override datafeeds and job config for partition field (elastic#78875) [Security Solution]Fix basepath used by endpoint telemetry tests (elastic#79027) update rum agent version which contains longtasks (elastic#79105) ...

kibanamachine · 2020-12-15T15:08:05Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: 08374e1
Pipeline Steps (look for red circles / failed steps)
Interpreting CI Failures

Failed CI Steps

Metrics [docs]

‼️ ERROR: no builds found for mergeBase sha [4fe7625]

History

💚 Build #78748 succeeded 08374e1
💔 Build #78737 failed c262c6a
💔 Build #78450 failed f4c9b67
💚 Build #78035 succeeded 9c70e11

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

chrisronline added 12 commits September 21, 2020 16:01

WIP for alert

627afe3

Surface alert most places

2f496f1

Merge in master

ea5d19a

Fix up alert placement

2190c2a

Fix tests

b244269

Type fix

8c24540

Update copy

740ecf7

Merge remote-tracking branch 'elastic/master' into monitoring/missing…

468da3b

…_data_alert

Add alert presence to APM in the UI

8a69629

Fetch data a little differently

35c77f5

We don't need moment

0b2db89

Add tests

1ffd812

chrisronline marked this pull request as ready for review September 24, 2020 19:22

chrisronline requested a review from a team September 24, 2020 19:22

chrisronline self-assigned this Sep 24, 2020

chrisronline added release_note:enhancement review Team:Monitoring Stack Monitoring team v7.10.0 v8.0.0 labels Sep 24, 2020

igoristic reviewed Sep 27, 2020

View reviewed changes

igoristic suggested changes Sep 27, 2020

View reviewed changes

Merge remote-tracking branch 'elastic/master' into monitoring/missing…

0c39b98

…_data_alert

chrisronline added 2 commits September 28, 2020 15:38

Merge remote-tracking branch 'elastic/master' into monitoring/missing…

ce6d465

…_data_alert

Update copy

6514604

chrisronline added 2 commits September 29, 2020 12:26

Merge remote-tracking branch 'elastic/master' into monitoring/missing…

1fc2ce3

…_data_alert

Fix up bug around grabbing old data

9c70e11

chrisronline requested a review from igoristic September 29, 2020 18:43

chrisronline added 3 commits September 30, 2020 12:38

Merge remote-tracking branch 'elastic/master' into monitoring/missing…

a5ba085

…_data_alert

Merge remote-tracking branch 'elastic/master' into monitoring/missing…

2679131

…_data_alert

PR feedback

f4c9b67

igoristic approved these changes Sep 30, 2020

View reviewed changes

Merge remote-tracking branch 'elastic/master' into monitoring/missing…

490dac0

…_data_alert

chrisronline added 2 commits October 1, 2020 09:22

PR feedback

c262c6a

Fix tests

08374e1

chrisronline merged commit a61f4d4 into elastic:master Oct 1, 2020

chrisronline deleted the monitoring/missing_data_alert branch October 1, 2020 16:28

chrisronline mentioned this pull request Oct 1, 2020

[7.x] [Monitoring] Missing data alert (#78208) #79163

Merged

chrisronline mentioned this pull request Dec 14, 2020

[Stack Monitoring] [Test Scenario] Out of the box alerting #85841

Closed

23 tasks

chrisronline changed the title ~~[Monitoring] Missing data alert~~ [Monitoring] Missing monitoring data alert Dec 15, 2020

chrisronline mentioned this pull request Jan 11, 2021

[Montoring] Use fetchClustersRange #87882

Merged

chrisronline mentioned this pull request Mar 1, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

Closed

24 tasks

simianhacker mentioned this pull request Apr 29, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #98765

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Missing monitoring data alert #78208

[Monitoring] Missing monitoring data alert #78208

chrisronline commented Sep 22, 2020 •

edited

Loading

elasticmachine commented Sep 24, 2020

igoristic Sep 27, 2020

igoristic Sep 27, 2020

chrisronline Sep 28, 2020

igoristic Sep 27, 2020

chrisronline Sep 28, 2020

igoristic Sep 27, 2020

igoristic Sep 27, 2020

igoristic left a comment

igoristic commented Sep 27, 2020

chrisronline commented Sep 28, 2020

igoristic commented Sep 29, 2020

chrisronline commented Sep 29, 2020

igoristic commented Sep 29, 2020

igoristic commented Sep 30, 2020

chrisronline commented Sep 30, 2020

igoristic left a comment •

edited

Loading

chrisronline commented Oct 1, 2020

chrisronline commented Oct 1, 2020

kibanamachine commented Dec 15, 2020 •

edited

Loading

	stackProductName = get(uuidBucket, `top.hits.hits[0]._source.${nameField}`);
	stackProductName = get(uuidBucket, `document.hits.hits[0]._source.${nameField}`);

[Monitoring] Missing monitoring data alert #78208

[Monitoring] Missing monitoring data alert #78208

Conversation

chrisronline commented Sep 22, 2020 • edited Loading

Copy

Firing message

Firing UI message

Screenshots

elasticmachine commented Sep 24, 2020

igoristic Sep 27, 2020

Choose a reason for hiding this comment

igoristic Sep 27, 2020

Choose a reason for hiding this comment

chrisronline Sep 28, 2020

Choose a reason for hiding this comment

igoristic Sep 27, 2020

Choose a reason for hiding this comment

chrisronline Sep 28, 2020

Choose a reason for hiding this comment

igoristic Sep 27, 2020

Choose a reason for hiding this comment

igoristic Sep 27, 2020

Choose a reason for hiding this comment

igoristic left a comment

Choose a reason for hiding this comment

igoristic commented Sep 27, 2020

chrisronline commented Sep 28, 2020

igoristic commented Sep 29, 2020

chrisronline commented Sep 29, 2020

igoristic commented Sep 29, 2020

igoristic commented Sep 30, 2020

chrisronline commented Sep 30, 2020

igoristic left a comment • edited Loading

Choose a reason for hiding this comment

chrisronline commented Oct 1, 2020

chrisronline commented Oct 1, 2020

kibanamachine commented Dec 15, 2020 • edited Loading

💔 Build Failed

Failed CI Steps

Metrics [docs]

History

chrisronline commented Sep 22, 2020 •

edited

Loading

igoristic left a comment •

edited

Loading

kibanamachine commented Dec 15, 2020 •

edited

Loading