[Stack Monitoring] Add stale status reporting for Kibana #132613

miltonhultgren · 2022-05-20T12:07:04Z

Summary

This PR adds visual warnings in the Stack Monitoring UI when one or more Kibana instances have a delay in their stats reporting. The delay can be configured with a kibana.yml setting and defaults to 120 seconds.

Cluster overview:

Kibana overview:

Kibana instances:

Kibana instances table row:

Kibana instance details:

How to test

Set monitoring.ui.kibana.reporting.stale_status_threshold_seconds to something low (like 10) in your kibana.yml
Ingest some Stack Monitoring data for Kibana, either with Internal Collection or Metricbeat (easier)
Stop collection and wait for reports to become stale
Verify that the UI reports the stale status information

To do

Hide Stale tag in Setup mode
Update tooltips to describe a partial state (some stale, some active)

Checklist

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

)

elasticmachine · 2022-05-31T19:09:33Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

miltonhultgren · 2022-05-31T19:10:20Z

@elastic/observability-design Since @katefarrar is out, I would love some feedback on this improvised design change!

matschaffer · 2022-05-31T22:13:59Z

Wondering. Does this end up firing in the event of a kibana instance replacement as well? Could probably launch it on ESS and scale kibana up/down to test.

formgeist · 2022-06-01T09:43:48Z

@miltonhultgren I would like to propose the following design changes to the indication 👍

Cluster overview:

Kibana overview:

Kibana instances:

I'd propose to replace the EuiIcon glyph+text combination with an EuiBadge instance like so;

<EuiBadge iconType="alert" color="warning">
  Stale
</EuiBadge>

Kibana instances table row:

You can continue to use the EuiIcon implementation here in the table column.

Kibana instance details:

I'd convert this to the EuiBadge example as mentioned above.

miltonhultgren · 2022-06-01T10:20:00Z

@formgeist Thanks for the swift feedback, will implement!

smith · 2022-06-01T12:44:45Z

"since we heard" sounds like it's missing a helping verb. "since we have heard" or "since we've heard" sounds better.

miltonhultgren · 2022-06-02T12:39:44Z

@formgeist @smith Applied your feedback, thanks!

…-seen-reporting

x-pack/plugins/monitoring/server/config.test.ts

jportner · 2022-06-03T18:20:10Z

test/plugin_functional/test_suites/core_plugins/rendering.ts

@@ -130,6 +130,7 @@ export default function ({ getService }: PluginFunctionalProviderContext) {
        'monitoring.kibana.collection.enabled (boolean)',
        'monitoring.kibana.collection.interval (number)',
        'monitoring.ui.ccs.enabled (boolean)',
+        'monitoring.ui.kibana.reporting.stale_status_threshold_seconds (number)',


Platform Security review:

expectedExposedConfigKeys integration test change LGTM

miltonhultgren · 2022-06-03T19:55:43Z

@matschaffer I ran a test on Cloud, when the instances stop reporting we mark them as stale as they rotate out.

Scaled from 1 to 3 (original is killed and marked as stale, 3 new ones are green):

Scale back down to 1 and change time range (original is rotated out, one green, two stale):

Change time range (two stale ones rotate out, one green, aggregate status back to green):

I'm curious why scaling up gives me 3 new IDs but scaling down only kills 2 IDs and keeps 1 ID. I'd expect the scale up to keep the original instance running and simply add 2 but that's the Kubernetes way so it might not apply.

matschaffer · 2022-06-06T04:22:28Z

I'm curious why scaling up gives me 3 new IDs but scaling down only kills 2 IDs and keeps 1 ID. I'd expect the scale up to keep the original instance running and simply add 2 but that's the Kubernetes way so it might not apply.

There are a lot of variables to account for there. For example if the VM your first kibana was on got marked for removal by the cloud provider.

So given the behavior, I'm a little concerned about what this will look like.

Since any migration of the kibana instance (even typical cloud maintenance migrations) will reflect as "stale". It'd be good if we can clarify that we have a mix of stale/good I think.

neptunian · 2022-06-06T16:12:43Z

In the scenario I have intentionally decided to not collect Kibana metrics on one or more instances should this appear? Or does this only address the scenario where something went wrong and the user would want to be alerted/notified like this? I noticed this scenario after I enabled the kibana module in metricbeat and then disabled the kibana module in metricbeat and got the stale badge.

Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer.

neptunian · 2022-06-06T19:03:44Z

In setup mode there is always the stale badge even if I've never collected metrics on Kibana yet.

miltonhultgren · 2022-06-07T07:00:46Z

@elasticmachine merge upstream

miltonhultgren · 2022-06-07T07:35:13Z

There are a lot of variables to account for there. For example if the VM your first kibana was on got marked for removal by the cloud provider.

So given the behavior, I'm a little concerned about what this will look like.

Since any migration of the kibana instance (even typical cloud maintenance migrations) will reflect as "stale". It'd be good if we can clarify that we have a mix of stale/good I think.

Showing the warning feels correct, that instance we used to hear from is no longer sending feedback.
I'm not sure if we from within SM can know why that is the case (expected vs unexpected, hello Health/Topology API).

@matschaffer We could add to that text something like "(1 of 4)"?
@formgeist Is copy text like this something you can give feedback on too? :)

matschaffer · 2022-06-07T07:36:47Z

We could add to that text something like "(1 of 4)"?

Yeah, that would make more sense to me. Not sure how to represent it visually (maybe @formgeist can help) but as an operator if 3 are green and 1 is stale, I'd like to see that distinction called out. If I just see "stale" I'd presume all instances are stale.

miltonhultgren · 2022-06-07T07:51:12Z

In the scenario I have intentionally decided to not collect Kibana metrics on one or more instances should this appear? Or does this only address the scenario where something went wrong and the user would want to be alerted/notified like this? I noticed this scenario after I enabled the kibana module in metricbeat and then disabled the kibana module in metricbeat and got the stale badge.

I'm not sure I understand. If you have never started to collect metrics, then we won't be aware of that instance at all, right?
If however, you start to collect with Metricbeat and then later on decide to stop, then yes, it will show this warning because we only measure the time from the last timestamp. I don't think we have a way to distinguish a collection outage from a collection stop. (I wonder if Fleet/Agent could help since we could see that the Agent has been removed from such a policy perhaps, also related to the Health/Topology API ideas)

Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer.

That's 100% true, though the same happens for Elasticsearch and the whole cluster. If you turn off Metricbeat and wait for the 15 minute window to move then you get the no-data/couldn't find cluster screen.

I don't know what to do about this. It's the same problem as the Entity Model for Infra Metrics.
We would have to make two queries, one to first grab a list of all the instances we've ever seen, then a second query for the metrics for those instances in the last 15 minutes. If we did that we could keep the stale instances in scope but that list might become big. So I wonder if we need some kind of lookback window like some alerts use to limit the scope? Last 24 hours?
@jasonrhodes This feels like a bigger problem than originally scoped but it needs thought.

In setup mode there is always the stale badge even if I've never collected metrics on Kibana yet.

I'll fix that.

kibana-ci · 2022-06-07T08:01:44Z

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`monitoring`	503	504	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`monitoring`	471.5KB	476.5KB	+5.0KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`monitoring`	23.7KB	23.8KB	+85.0B

History

💚 Build #49198 succeeded 53b2e02
💔 Build #49189 failed e8687dd
💔 Build #49183 failed 21a4c73
💔 Build #49065 failed a46b3d6
💔 Build #49016 failed aaa137c
💔 Build #48998 failed 6c8bb41

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

neptunian · 2022-06-07T14:06:39Z

In the scenario I have intentionally decided to not collect Kibana metrics on one or more instances should this appear? Or does this only address the scenario where something went wrong and the user would want to be alerted/notified like this? I noticed this scenario after I enabled the kibana module in metricbeat and then disabled the kibana module in metricbeat and got the stale badge.

I'm not sure I understand. If you have never started to collect metrics, then we won't be aware of that instance at all, right? If however, you start to collect with Metricbeat and then later on decide to stop, then yes, it will show this warning because we only measure the time from the last timestamp. I don't think we have a way to distinguish a collection outage from a collection stop. (I wonder if Fleet/Agent could help since we could see that the Agent has been removed from such a policy perhaps, also related to the Health/Topology API ideas)

Sorry. Yes, I meant what you described as trying to distinguish between intentionally turning off metrics vs something went wrong. Seems a bit noisy and excessive having the stale badges and warning icons if nothing is actually wrong, especially with the new columns of "last seen". Kind of feel like we're trying to replace the job of an alert notification here without the user opting for it. Also the fact we're only doing it for Kibana and not the other products will probably cause confusion.

jasonrhodes · 2022-06-07T21:58:47Z

Just a reminder: the user problem we really need to solve is reporting green when an instance is down. Whatever solution we choose, we have to choose one that fixes this problem because it's a very embarrassing and imo indefensible state to find ourselves in for a customer in an outage.

as an operator if 3 are green and 1 is stale, I'd like to see that distinction called out. If I just see "stale" I'd presume all instances are stale.

I understand how it looks this way, but we don't handle any of the aggregate statuses this granular-ly. I believe if you have 4 instances and 3 are green and 1 is red, we will show "Status: Red", is that right? We should match that functionality in this ticket and revisit holistically if we don't like it, but I think aggregate statuses should show the worst and entice you to dig in to see what the problem is.

Yes, I meant what you described as trying to distinguish between intentionally turning off metrics vs something went wrong. Seems a bit noisy and excessive having the stale badges and warning icons if nothing is actually wrong, especially with the new columns of "last seen". Kind of feel like we're trying to replace the job of an alert notification here without the user opting for it.

Noisy is a potential problem, I agree, but it's the flip side of the situation where we don't notify the user at all and then they think things are great during an outage. Stale isn't itself a warning state (this is why we originally left the status in place and applied this extra notification on top of it, because it isn't really a full-fledged status). It's a tip that we haven't heard from at least 1 instance in a given time range, which is something you may or may not be able to ignore. If we can find a better solution that solves the main problem, I'm definitely open, but absent that I think we should move forward with this one for now.

Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer.

This is true and also okay, because of the problem we're solving. If the instance has disappeared from the window under investigation, we don't know it exists, so we don't show it. But if a user has a graph pinned to "Last 48 Hours" for some reason and a Kibana node goes down, but reports as "Green: Healthy" for 47 hours and 59 minutes, that's a scenario we can't defend.

Also the fact we're only doing it for Kibana and not the other products will probably cause confusion.

It might be a good idea to log tickets for other components and try to implement the same logic before a big customer has an ES outage and asks us why the Stack Monitoring page was reporting their ES nodes as green/healthy for hours during an outage. If we find a better solution to Kibana, we can apply it across the board as well.

neptunian · 2022-06-08T00:15:41Z

Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer.

This is true and also okay, because of the problem we're solving. If the instance has disappeared from the window under investigation, we don't know it exists, so we don't show it. But if a user has a graph pinned to "Last 48 Hours" for some reason and a Kibana node goes down, but reports as "Green: Healthy" for 47 hours and 59 minutes, that's a scenario we can't defend.

Makes sense. @miltonhultgren had said in a comment that this was due to a bug and the status should actually have been grey if the last Kibana document is more than 10 minutes old. I kind of like this because grey feels like there is no current status. I was thinking fixing this could suffice without the extra UI stuff, but if design is happy, I'm not fussed.

miltonhultgren force-pushed the 126386-add-last-seen-reporting branch from 5b23bbe to ec54671 Compare May 20, 2022 17:35

[Stack Monitoring] Add stale status reporting for Kibana (elastic#126386

011a04f

)

miltonhultgren force-pushed the 126386-add-last-seen-reporting branch from 5d7199d to 011a04f Compare May 31, 2022 18:55

miltonhultgren changed the title ~~[Stack Monitoring] Add stale status reporting to Kibana endpoints (#1…~~ [Stack Monitoring] Add stale status reporting for Kibana May 31, 2022

miltonhultgren added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring release_note:feature Makes this part of the condensed release notes v8.4.0 labels May 31, 2022

miltonhultgren marked this pull request as ready for review May 31, 2022 19:09

miltonhultgren requested a review from a team as a code owner May 31, 2022 19:09

miltonhultgren added the ci:deploy-cloud label Jun 1, 2022

miltonhultgren added 5 commits June 2, 2022 14:40

Fix stale message grammar and update stale indicator to use EuiBadge

136355b

Fix i18n ids

7b954f3

Remove unused i18n key

63b3fd0

Merge branch 'main' of github.com:elastic/kibana into 126386-add-last…

6c8bb41

…-seen-reporting

Fix Jest tests

aaa137c

miltonhultgren commented Jun 3, 2022

View reviewed changes

x-pack/plugins/monitoring/server/config.test.ts Show resolved Hide resolved

miltonhultgren added 2 commits June 3, 2022 13:53

Update exposeToBrowser test

6c6a18e

Update API integration tests

a46b3d6

miltonhultgren requested a review from a team as a code owner June 3, 2022 11:54

neptunian self-requested a review June 3, 2022 18:02

Fix functional tests

21a4c73

jportner approved these changes Jun 3, 2022

View reviewed changes

Fix API integration tests

e8687dd

Update snapshots

53b2e02

miltonhultgren removed the ci:deploy-cloud label Jun 3, 2022

Merge branch 'main' into 126386-add-last-seen-reporting

607c98f

neptunian approved these changes Jun 8, 2022

View reviewed changes

miltonhultgren merged commit ee7d9b0 into elastic:main Jun 8, 2022

kibanamachine added the backport:skip This commit does not require backporting label Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] Add stale status reporting for Kibana #132613

[Stack Monitoring] Add stale status reporting for Kibana #132613

miltonhultgren commented May 20, 2022 •

edited

Loading

elasticmachine commented May 31, 2022

miltonhultgren commented May 31, 2022

matschaffer commented May 31, 2022

formgeist commented Jun 1, 2022

miltonhultgren commented Jun 1, 2022

smith commented Jun 1, 2022

miltonhultgren commented Jun 2, 2022

jportner Jun 3, 2022

miltonhultgren commented Jun 3, 2022

matschaffer commented Jun 6, 2022

neptunian commented Jun 6, 2022 •

edited

Loading

neptunian commented Jun 6, 2022

miltonhultgren commented Jun 7, 2022

miltonhultgren commented Jun 7, 2022

matschaffer commented Jun 7, 2022

miltonhultgren commented Jun 7, 2022 •

edited

Loading

kibana-ci commented Jun 7, 2022

neptunian commented Jun 7, 2022

jasonrhodes commented Jun 7, 2022

neptunian commented Jun 8, 2022

[Stack Monitoring] Add stale status reporting for Kibana #132613

[Stack Monitoring] Add stale status reporting for Kibana #132613

Conversation

miltonhultgren commented May 20, 2022 • edited Loading

Summary

How to test

To do

Checklist

elasticmachine commented May 31, 2022

miltonhultgren commented May 31, 2022

matschaffer commented May 31, 2022

formgeist commented Jun 1, 2022

miltonhultgren commented Jun 1, 2022

smith commented Jun 1, 2022

miltonhultgren commented Jun 2, 2022

jportner Jun 3, 2022

Choose a reason for hiding this comment

miltonhultgren commented Jun 3, 2022

matschaffer commented Jun 6, 2022

neptunian commented Jun 6, 2022 • edited Loading

neptunian commented Jun 6, 2022

miltonhultgren commented Jun 7, 2022

miltonhultgren commented Jun 7, 2022

matschaffer commented Jun 7, 2022

miltonhultgren commented Jun 7, 2022 • edited Loading

kibana-ci commented Jun 7, 2022

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

Page load bundle

History

neptunian commented Jun 7, 2022

jasonrhodes commented Jun 7, 2022

neptunian commented Jun 8, 2022

miltonhultgren commented May 20, 2022 •

edited

Loading

neptunian commented Jun 6, 2022 •

edited

Loading

miltonhultgren commented Jun 7, 2022 •

edited

Loading