Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Add stale status reporting for Kibana #132613

Merged

Conversation

miltonhultgren
Copy link
Contributor

@miltonhultgren miltonhultgren commented May 20, 2022

Summary

Fixes #126386

This PR adds visual warnings in the Stack Monitoring UI when one or more Kibana instances have a delay in their stats reporting. The delay can be configured with a kibana.yml setting and defaults to 120 seconds.

Cluster overview:
Screenshot 2022-06-02 at 14 18 58

Kibana overview:
Screenshot 2022-06-02 at 14 25 44

Kibana instances:
Screenshot 2022-06-02 at 14 25 58

Kibana instances table row:

Screenshot 2022-06-02 at 14 27 30

Kibana instance details:
Screenshot 2022-06-02 at 14 36 23

How to test

  1. Set monitoring.ui.kibana.reporting.stale_status_threshold_seconds to something low (like 10) in your kibana.yml
  2. Ingest some Stack Monitoring data for Kibana, either with Internal Collection or Metricbeat (easier)
  3. Stop collection and wait for reports to become stale
  4. Verify that the UI reports the stale status information

To do

  • Hide Stale tag in Setup mode
  • Update tooltips to describe a partial state (some stale, some active)

Checklist

@miltonhultgren miltonhultgren force-pushed the 126386-add-last-seen-reporting branch from 5b23bbe to ec54671 Compare May 20, 2022 17:35
@miltonhultgren miltonhultgren force-pushed the 126386-add-last-seen-reporting branch from 5d7199d to 011a04f Compare May 31, 2022 18:55
@miltonhultgren miltonhultgren changed the title [Stack Monitoring] Add stale status reporting to Kibana endpoints (#1… [Stack Monitoring] Add stale status reporting for Kibana May 31, 2022
@miltonhultgren miltonhultgren added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring release_note:feature Makes this part of the condensed release notes v8.4.0 labels May 31, 2022
@miltonhultgren miltonhultgren marked this pull request as ready for review May 31, 2022 19:09
@miltonhultgren miltonhultgren requested a review from a team as a code owner May 31, 2022 19:09
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@miltonhultgren
Copy link
Contributor Author

@elastic/observability-design Since @katefarrar is out, I would love some feedback on this improvised design change!

@matschaffer
Copy link
Contributor

Wondering. Does this end up firing in the event of a kibana instance replacement as well? Could probably launch it on ESS and scale kibana up/down to test.

@formgeist
Copy link
Contributor

@miltonhultgren I would like to propose the following design changes to the indication 👍

Cluster overview: Screenshot 2022-05-31 at 21 02 46

Kibana overview: Screenshot 2022-05-31 at 21 03 08

Kibana instances: Screenshot 2022-05-31 at 21 03 22

I'd propose to replace the EuiIcon glyph+text combination with an EuiBadge instance like so;

CleanShot 2022-06-01 at 11 37 43@2x

<EuiBadge iconType="alert" color="warning">
  Stale
</EuiBadge>

Kibana instances table row: Screenshot 2022-05-31 at 21 03 40

You can continue to use the EuiIcon implementation here in the table column.

Kibana instance details: Screenshot 2022-05-31 at 21 03 53

I'd convert this to the EuiBadge example as mentioned above.

@miltonhultgren
Copy link
Contributor Author

@formgeist Thanks for the swift feedback, will implement!

@smith
Copy link
Contributor

smith commented Jun 1, 2022

"since we heard" sounds like it's missing a helping verb. "since we have heard" or "since we've heard" sounds better.

@miltonhultgren
Copy link
Contributor Author

@formgeist @smith Applied your feedback, thanks!

@miltonhultgren miltonhultgren requested a review from a team as a code owner June 3, 2022 11:54
@neptunian neptunian self-requested a review June 3, 2022 18:02
@@ -130,6 +130,7 @@ export default function ({ getService }: PluginFunctionalProviderContext) {
'monitoring.kibana.collection.enabled (boolean)',
'monitoring.kibana.collection.interval (number)',
'monitoring.ui.ccs.enabled (boolean)',
'monitoring.ui.kibana.reporting.stale_status_threshold_seconds (number)',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Platform Security review:

expectedExposedConfigKeys integration test change LGTM

@miltonhultgren
Copy link
Contributor Author

@matschaffer I ran a test on Cloud, when the instances stop reporting we mark them as stale as they rotate out.

Scaled from 1 to 3 (original is killed and marked as stale, 3 new ones are green):
Screenshot 2022-06-03 at 21 43 50

Scale back down to 1 and change time range (original is rotated out, one green, two stale):
Screenshot 2022-06-03 at 21 44 02

Change time range (two stale ones rotate out, one green, aggregate status back to green):
Screenshot 2022-06-03 at 21 44 14

I'm curious why scaling up gives me 3 new IDs but scaling down only kills 2 IDs and keeps 1 ID. I'd expect the scale up to keep the original instance running and simply add 2 but that's the Kubernetes way so it might not apply.

@matschaffer
Copy link
Contributor

I'm curious why scaling up gives me 3 new IDs but scaling down only kills 2 IDs and keeps 1 ID. I'd expect the scale up to keep the original instance running and simply add 2 but that's the Kubernetes way so it might not apply.

There are a lot of variables to account for there. For example if the VM your first kibana was on got marked for removal by the cloud provider.

So given the behavior, I'm a little concerned about what this will look like.

image

Since any migration of the kibana instance (even typical cloud maintenance migrations) will reflect as "stale". It'd be good if we can clarify that we have a mix of stale/good I think.

@neptunian
Copy link
Contributor

neptunian commented Jun 6, 2022

In the scenario I have intentionally decided to not collect Kibana metrics on one or more instances should this appear? Or does this only address the scenario where something went wrong and the user would want to be alerted/notified like this? I noticed this scenario after I enabled the kibana module in metricbeat and then disabled the kibana module in metricbeat and got the stale badge.

Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer.

@neptunian
Copy link
Contributor

In setup mode there is always the stale badge even if I've never collected metrics on Kibana yet.
Screen Shot 2022-06-06 at 2 54 25 PM

@miltonhultgren
Copy link
Contributor Author

@elasticmachine merge upstream

@miltonhultgren
Copy link
Contributor Author

There are a lot of variables to account for there. For example if the VM your first kibana was on got marked for removal by the cloud provider.

So given the behavior, I'm a little concerned about what this will look like.

image

Since any migration of the kibana instance (even typical cloud maintenance migrations) will reflect as "stale". It'd be good if we can clarify that we have a mix of stale/good I think.

Showing the warning feels correct, that instance we used to hear from is no longer sending feedback.
I'm not sure if we from within SM can know why that is the case (expected vs unexpected, hello Health/Topology API).

@matschaffer We could add to that text something like "(1 of 4)"?
@formgeist Is copy text like this something you can give feedback on too? :)

@matschaffer
Copy link
Contributor

We could add to that text something like "(1 of 4)"?

Yeah, that would make more sense to me. Not sure how to represent it visually (maybe @formgeist can help) but as an operator if 3 are green and 1 is stale, I'd like to see that distinction called out. If I just see "stale" I'd presume all instances are stale.

@miltonhultgren
Copy link
Contributor Author

miltonhultgren commented Jun 7, 2022

In the scenario I have intentionally decided to not collect Kibana metrics on one or more instances should this appear? Or does this only address the scenario where something went wrong and the user would want to be alerted/notified like this? I noticed this scenario after I enabled the kibana module in metricbeat and then disabled the kibana module in metricbeat and got the stale badge.

I'm not sure I understand. If you have never started to collect metrics, then we won't be aware of that instance at all, right?
If however, you start to collect with Metricbeat and then later on decide to stop, then yes, it will show this warning because we only measure the time from the last timestamp. I don't think we have a way to distinguish a collection outage from a collection stop. (I wonder if Fleet/Agent could help since we could see that the Agent has been removed from such a policy perhaps, also related to the Health/Topology API ideas)

Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer.

That's 100% true, though the same happens for Elasticsearch and the whole cluster. If you turn off Metricbeat and wait for the 15 minute window to move then you get the no-data/couldn't find cluster screen.

I don't know what to do about this. It's the same problem as the Entity Model for Infra Metrics.
We would have to make two queries, one to first grab a list of all the instances we've ever seen, then a second query for the metrics for those instances in the last 15 minutes. If we did that we could keep the stale instances in scope but that list might become big. So I wonder if we need some kind of lookback window like some alerts use to limit the scope? Last 24 hours?
@jasonrhodes This feels like a bigger problem than originally scoped but it needs thought.

In setup mode there is always the stale badge even if I've never collected metrics on Kibana yet.

I'll fix that.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
monitoring 503 504 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
monitoring 471.5KB 476.5KB +5.0KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
monitoring 23.7KB 23.8KB +85.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@neptunian
Copy link
Contributor

In the scenario I have intentionally decided to not collect Kibana metrics on one or more instances should this appear? Or does this only address the scenario where something went wrong and the user would want to be alerted/notified like this? I noticed this scenario after I enabled the kibana module in metricbeat and then disabled the kibana module in metricbeat and got the stale badge.

I'm not sure I understand. If you have never started to collect metrics, then we won't be aware of that instance at all, right? If however, you start to collect with Metricbeat and then later on decide to stop, then yes, it will show this warning because we only measure the time from the last timestamp. I don't think we have a way to distinguish a collection outage from a collection stop. (I wonder if Fleet/Agent could help since we could see that the Agent has been removed from such a policy perhaps, also related to the Health/Topology API ideas)

Sorry. Yes, I meant what you described as trying to distinguish between intentionally turning off metrics vs something went wrong. Seems a bit noisy and excessive having the stale badges and warning icons if nothing is actually wrong, especially with the new columns of "last seen". Kind of feel like we're trying to replace the job of an alert notification here without the user opting for it. Also the fact we're only doing it for Kibana and not the other products will probably cause confusion.

@jasonrhodes
Copy link
Member

Just a reminder: the user problem we really need to solve is reporting green when an instance is down. Whatever solution we choose, we have to choose one that fixes this problem because it's a very embarrassing and imo indefensible state to find ourselves in for a customer in an outage.

as an operator if 3 are green and 1 is stale, I'd like to see that distinction called out. If I just see "stale" I'd presume all instances are stale.

I understand how it looks this way, but we don't handle any of the aggregate statuses this granular-ly. I believe if you have 4 instances and 3 are green and 1 is red, we will show "Status: Red", is that right? We should match that functionality in this ticket and revisit holistically if we don't like it, but I think aggregate statuses should show the worst and entice you to dig in to see what the problem is.

Yes, I meant what you described as trying to distinguish between intentionally turning off metrics vs something went wrong. Seems a bit noisy and excessive having the stale badges and warning icons if nothing is actually wrong, especially with the new columns of "last seen". Kind of feel like we're trying to replace the job of an alert notification here without the user opting for it.

Noisy is a potential problem, I agree, but it's the flip side of the situation where we don't notify the user at all and then they think things are great during an outage. Stale isn't itself a warning state (this is why we originally left the status in place and applied this extra notification on top of it, because it isn't really a full-fledged status). It's a tip that we haven't heard from at least 1 instance in a given time range, which is something you may or may not be able to ignore. If we can find a better solution that solves the main problem, I'm definitely open, but absent that I think we should move forward with this one for now.

Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer.

This is true and also okay, because of the problem we're solving. If the instance has disappeared from the window under investigation, we don't know it exists, so we don't show it. But if a user has a graph pinned to "Last 48 Hours" for some reason and a Kibana node goes down, but reports as "Green: Healthy" for 47 hours and 59 minutes, that's a scenario we can't defend.

Also the fact we're only doing it for Kibana and not the other products will probably cause confusion.

It might be a good idea to log tickets for other components and try to implement the same logic before a big customer has an ES outage and asks us why the Stack Monitoring page was reporting their ES nodes as green/healthy for hours during an outage. If we find a better solution to Kibana, we can apply it across the board as well.

@neptunian
Copy link
Contributor

Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer.

This is true and also okay, because of the problem we're solving. If the instance has disappeared from the window under investigation, we don't know it exists, so we don't show it. But if a user has a graph pinned to "Last 48 Hours" for some reason and a Kibana node goes down, but reports as "Green: Healthy" for 47 hours and 59 minutes, that's a scenario we can't defend.

Makes sense. @miltonhultgren had said in a comment that this was due to a bug and the status should actually have been grey if the last Kibana document is more than 10 minutes old. I kind of like this because grey feels like there is no current status. I was thinking fixing this could suffice without the extra UI stuff, but if design is happy, I'm not fussed.

@miltonhultgren miltonhultgren merged commit ee7d9b0 into elastic:main Jun 8, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Stack Monitoring release_note:feature Makes this part of the condensed release notes Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Stack Monitoring] Kibana should not report healthy when recent data is missing