Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics UI] Add support for severity levels to threshold alerts #88591

Closed
sorantis opened this issue Jan 18, 2021 · 12 comments · Fixed by #90070
Closed

[Metrics UI] Add support for severity levels to threshold alerts #88591

sorantis opened this issue Jan 18, 2021 · 12 comments · Fixed by #90070
Assignees
Labels
enhancement New value added to drive a business result Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@sorantis
Copy link

User story
As a DevOps engineer I want to be able to set multiple threshold values based on severity levels, so that I could channel notifications based on the severity levels.

Describe the feature:
Today our threshold alerts support only one default severity level. Customers are asking to introduce multiple severity levels, e.g. warning, critical. The Kibana alerting team has been working on some design mockups for multi-level notifications, which can be found here.

In 7.11 the alerting team will has added generic UI for the definition of conditions for Action Groups. We will need to incorporate multiple severity levels in our alerts. Note, this can (and probably will) be applicable to other observability alert types.

Currently the following thresholds are interesting from the Inventory/Metric alerting perspective: warning threshold, alert/critical threshold. However other solutions have different assumptions. This needs to be aligned.

cc @cyrille-leclerc, @mukeshelastic.

@sorantis sorantis added enhancement New value added to drive a business result Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Jan 18, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@sgrodzicki sgrodzicki added this to the Metrics UI 7.12 milestone Jan 18, 2021
@Zacqary Zacqary self-assigned this Jan 19, 2021
@Zacqary Zacqary removed their assignment Jan 21, 2021
@Zacqary
Copy link
Contributor

Zacqary commented Jan 26, 2021

The design mockups don't easily translate to Metrics alerts since they're using the full-width layout:

Screen Shot 2021-01-26 at 4 06 15 PM

And we're still debating how to incorporate that layout into the multi-conditional alert types that we use in Observability, seems like we're targeting 7.13 for it: #69233

I can throw something together that works with the way we do multiple conditions right now, probably by including a Critical/Warning/Minor/etc. threshold component within each conditional expression. I don't expect it to look pretty, though. Is that a blocker, or do we just want to get this feature out even if the UI looks kind of janky?

@Zacqary
Copy link
Contributor

Zacqary commented Jan 29, 2021

Here's what I've got implemented so far:

Screen Shot 2021-01-29 at 1 04 22 PM

@katefarrar Thoughts?

Not sure I like the look of the Add warning threshold button in particular, but I'm not sure what to tweak about it.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 29, 2021

Is there any reason why we should make it possible to select a different Comparator for the Critical vs. Warning threshold? Or should we have a single Comparator value and lock it to both of them?

It's currently a bit of a hassle to get the <ThresholdExpression> component to update in response to an outside param change rather than user input, so before I do the work to fix that I'd like to know if it's necessary.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 29, 2021

If we're going to be switching to using action groups to differentiate between threshold states, we might want to remove the Alert on No Data checkbox:

Screen Shot 2021-01-29 at 3 20 02 PM

and instead handle No Data states with action groups.

This change would break backwards compatibility with existing No Data alerts so I'm not sure how we should approach releasing it? I think Alerting is out of beta now, but I really don't think we should wait for a Breaking release to make this kind of change. Especially if Warning thresholds are going to use their own separate action group.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 29, 2021

Currently the following thresholds are interesting from the Inventory/Metric alerting perspective: warning threshold, alert/critical threshold

I feel like I want to call it the "Critical" threshold if the Warning threshold is present, but this would create a weird discrepancy with action group naming. Assuming we want to change the name of the Fired action group, it could be jarring for users without a Warning threshold configured to see this:

Screen Shot 2021-01-29 at 3 29 33 PM

So maybe Alert and Warning are better choices

Screen Shot 2021-01-29 at 3 29 37 PM

@Zacqary
Copy link
Contributor

Zacqary commented Jan 29, 2021

Oof hmm I don't like "Run when Alert"

Screen Shot 2021-01-29 at 3 33 13 PM

Currently the action group is called "Fired" so it says "Run when Fired" and "Run when Recovered," but I don't think we want to do:

Screen Shot 2021-01-29 at 3 36 01 PM

"Run when Critical" sounds fine, but see above for why I don't think it's a good idea to do that either.

I'm checking with the Alerting team to see if it's possible to dynamically change the contents of the Run when menu depending on the configured Alert parameters; if so, we could just call that action group Fired when no Warning threshold is configured, and switch it to Critical when it is. But barring that solution, what should we do about this linguistic problem?

EDIT: We do not have that option yet. Tracking it here: #89898

@Zacqary
Copy link
Contributor

Zacqary commented Jan 29, 2021

Screen Shot 2021-01-29 at 5 09 34 PM

I remember we did a lot of work figuring out the correct language for alert previews, but now I'm not sure what to do about multiple severity thresholds.

There were 5 critical and 2 warning instances that satisfied the conditions of this alert in the last hour

Seems closest to the current language but it feels off. "Satisfied the conditions of this alert" feels redundant when you enumerate the severity thresholds.

There were 5 instances that satisfied the critical conditions, and 2 instances that satisfied the warning conditions, of this alert in the last hour

Feels better but also needlessly verbose.

@Zacqary
Copy link
Contributor

Zacqary commented Feb 1, 2021

Screen Shot 2021-02-01 at 12 55 32 PM

With the current logic, this Warning threshold would never fire, because ALL conditions need to have their thresholds met. Should we even support this use case, in which one condition can have a Warning but not others, or should we force all conditions to have a Warning threshold if you enable it on one of them?

@katefarrar
Copy link
Contributor

Here's what I've got implemented so far:

Screen Shot 2021-01-29 at 1 04 22 PM

@katefarrar Thoughts?

Not sure I like the look of the Add warning threshold button in particular, but I'm not sure what to tweak about it.

we're going to add plusInCircleFilled before Add warning threshold to make it a little more clear

@katefarrar
Copy link
Contributor

katefarrar commented Feb 1, 2021

For the alerting language, I think this works well...

There were 5 instances that satisfied the critical conditions, and 2 instances that satisfied the warning conditions of this alert in the last hour.

I'd rather err on the side of verbose than unclear 🙂

@Zacqary
Copy link
Contributor

Zacqary commented Feb 1, 2021

For Group By alerts we currently say:

There were 5 instances across 3 hosts that satisfied the conditions...

What do we think of switching this to:

Across 3 hosts, there were 5 instances that satisfied the critical conditions, and 2 instances that satisfied the warning conditions...

I feel like that'd be less of a headache than continuing to put the number of hosts afterward

EDIT: Actually this would be a huge i18n burden to correctly handle the conditional capital letter at the beginning, so I think we should stick with the host number in the middle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants