Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Add a possibility to define a custom status for the alert instances #78981

Closed
YulNaumenko opened this issue Sep 30, 2020 · 18 comments
Closed
Labels
discuss Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@YulNaumenko
Copy link
Contributor

Currently alert instance can be in a two statuses: Active and OK.
For example, Maps team noticed that it would be useful to have a special status for Contained object inside the polygon.
When object just crossed the polygon border, alert was triggered and become an Active just for a second and after the next execution moved to status OK and then just disappeared.
'Contained' alert instance status can be useful for the user to understand that the object was crossed the polygon border and currently stayed inside.

@YulNaumenko YulNaumenko added discuss Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Sep 30, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

mikecote commented Oct 1, 2020

We can discuss this from another angle as well. Once we add UI support for multiple action groups, they should be reflected in the status somehow. As a discussion starting point, I had in mind the following:

  • Alert with two action groups: Warning and Severe
  • Instance statuses could then be: OK, Active (Warning) and Active (Severe)

This way it can keep a standard status across different types of alerts but allow a bit more information which would be tied to the action group. Using the issue description, usage of a Contained action group would work with this idea.

@pmuellr
Copy link
Member

pmuellr commented Oct 6, 2020

From the triage meeting, we left this as discuss as:

  • it seems like this is a case for multiple action groups
  • I don't think we should be allowing custom statuses, at least at the moment:
    • we have code that calculates alert status by the statuses of instances, so not clear how that calculation would change with new types of instance status
    • we have UI affordances (totals bar, "nice" version of the status values, etc) for the current fixed set, not clear how we'd handle new types of instance status for these

So, we'll need to check back with Maps to see if the multiple action groups will handle their needs, and if not, get a better understanding of the problem. Also makes me wonder, given the note about the instance status disappearing, if there is something about there alert instance strings that needs to be tweaked. IIRC, they embed a location in the alert instance string, which could lead to instances seeming to come and go.

@kindsun
Copy link
Contributor

kindsun commented Nov 12, 2020

So, we'll need to check back with Maps to see if the multiple action groups will handle their needs, and if not, get a better understanding of the problem

@pmuellr Maybe, but the geo user experience might be better if we could explicitly assign the top-level states. It sounds like using action groups, we could have an action group called tracking_contained and one called tracking_outside. The active column would therefore contain a mix of tracking_contained alerts which are contained and tracking_outside alerts which are outside. At the state level it's telling the user more about the truthy state of the alert rather than the geo status of being contained vs. not. Maybe this is fine and what we're looking for is out-of-scope for an alerting dashboard, but I'd like to keep this issue open for now pending further discussion

@pmuellr
Copy link
Member

pmuellr commented Nov 16, 2020

At the state level it's telling the user more about the truthy state of the alert rather than the geo status of being contained vs. not.

We recently merged PR #82275 , which adds the action group to the alert details view, per row. So the customers would see "active (tracking_contained)" or "active (tracking_outside)".

BTW, I purposefully named the default action group for the index threshold to "met threshold" (with a space!) because I figured we'd eventually want to make these "human readable", and so wanted to make sure we weren't "surprised" if someone else wanted to use spaces in their action groups. And there may be i18n stuff going on besides that to do further translations anyway.

Not clear if you knew that was coming or not - does that suit your needs? Or perhaps I'm misunderstanding the comment.

@kindsun
Copy link
Contributor

kindsun commented Nov 19, 2020

@pmuellr After chatting a little further about it with folks from Maps and Alerting, what we ideally need is something dynamic and more granular. This would be roughly the ideal workflow:

  1. User creates tracking threshold alert
  2. Server executes shape query (one time on alert first run or re-enablement) using user-defined filters to determine shapes we're monitoring for point containment or non-containment
  3. Alert runs query using shapes from previous query as point filters, returning results that include what shapes include what points
  4. This is where we might be able to use your help. If a point changed shapes, it's still contained and therefore active but its status has changed and is newly alertable. If it were throttled to 1, we'd still want a new single alert on shape change. The way we can hack the system a little bit is to create alert instances that have a name like: pointId (shapeId) so that it's treated as a separate alertable instance. In a perfect world, we might be able to do custom statuses that track and indicate "active variants", different states of active that in our case would be different shapes/regions.

We're still determining how best to use action groups, but I don't think they'll help us here since the categories we use are dynamically determined (in step 2 above). Also since action groups are more user facing, I can't imagine a user explicitly defining the conditions for each of the 50 states of the USA (as an example).

@pmuellr
Copy link
Member

pmuellr commented Nov 19, 2020

Thanks for the additional detail, this is great! More use cases like "notify when a user moves into a different US state" are good!

pointId (shapeId) could work, but then you wouldn't be able to do any historic tracking via the event log on just the point, just when the point was in a particular shape. Entirely possible this is fine for you, I assume the points are being tracked in the app somewhere else anyway. Also, seems entirely possible that we could allow some "wildcard-y" searches through the event log, if the instanceId is shaped in a consistent way for particular events. Instances this way are actually kind of nice, in that you'd get a "resolved" capability when they LEFT a shape.

What happens if shapes overlap?

My first thought is that the instanceId should basically be pointId (although, you want some human readable version of it, if at all possible, for use in the generic alerting tools). The shape(s) could be part of the context variables, so they could be available during action executions (leftShapes: [...], enteredShape: [...]). This isn't enough of a change to break a throttle tho, so doesn't met your requirements.

On the face of it, I'd prefer a new alert instance service function breakThrottle() to custom statuses, if it came down to it.

Is there any software in this area, doing this kind of stuff, already used in the field?

@kindsun
Copy link
Contributor

kindsun commented Nov 19, 2020

@pmuellr Good questions/comments!

pointId (shapeId) could work, but then you wouldn't be able to do any historic tracking via the event log on just the point, just when the point was in a particular shape.

Correct. Not ideal for us, just our best idea for a workaround in the current code. We'd prefer to be able to do historic tracking of where the point was prior.

I assume the points are being tracked in the app somewhere else anyway.

We can add a layer in the Maps app that displays a point tracking layer containing all records of where a point has been and then overlay an indexed alert layer. It's not ideal for filtering though if the alert instance ID (123-someShape) is different from the normal ID (123).

What happens if shapes overlap?

In the proposed setup this would create different alert instances. The user would be alerted of newly active 123-someShape containment and also 123-someOtherShape. I think this is fine for now but might be worth some more thought.

My first thought is that the instanceId should basically be pointId (although, you want some human readable version of it, if at all possible, for use in the generic alerting tools).

Agreed!

The shape(s) could be part of the context variables, so they could be available during action executions (leftShapes: [...], enteredShape: [...]). This isn't enough of a change to break a throttle tho, so doesn't met your requirements.

This is interesting. Can we include context variable data in the dash?

On the face of it, I'd prefer a new alert instance service function breakThrottle() to custom statuses, if it came down to it.

A function like breakThrottle could work, but it would put the onus on us to maintain a record of where a point was before to know that the current shape it finds itself in is actually a new one. We could do this, but since Alerting is already tracking status changes, custom statuses feels like a better fit (open to pushback here).

Is there any software in this area, doing this kind of stuff, already used in the field?

I was just thinking of having a setStatusCategories function but I'm very likely oversimplifying it since I don't know what the underlying code looks like.

@pmuellr
Copy link
Member

pmuellr commented Nov 19, 2020

pointId (shapeId) could work, but then you wouldn't be able to do any historic tracking via the event log on just the point, just when the point was in a particular shape.

Correct. Not ideal for us, just our best idea for a workaround in the current code. We'd prefer to be able to do historic tracking of where the point was prior.

Event log does not current store any alert-specific information, so there's no opportunity to store the locations. We'll probably stick to this line for as long as we can, otherwise we'll have more mapping complexities, potential for alerts storing huge documents, etc. We tend to think of alerting storage as "just alerting-related bits" and not "app-related data", so we'd prefer to keep app-related data linked somehow (via saved object references, instance ids, etc), instead of duplicated.

We can add a layer in the Maps app that displays a point tracking layer containing all records of where a point has been and then overlay an indexed alert layer. It's not ideal for filtering though if the alert instance ID (123-someShape) is different from the normal ID (123).

So again, those "records" won't be event log records, they would need to be something you store in separate SO's, or encode in the instanceId. Basically, you shouldn't consider the event log to be your primary store.

This is interesting. Can we include context variable data in the dash?

Currently no, IIUC. The mustache templating variables provided by an alert that can be used in action params are transient, we don't store those anywhere. We render the action params with the variables provided, and store the results of those renderings, which are the action params themselves, and even those are transient - they exist within task manager documents, but will be deleted once the action successfully executes.

We could do this, but since Alerting is already tracking status changes, custom statuses feels like a better fit (open to pushback here). ... I was just thinking of having a setStatusCategories() function but I'm very likely oversimplifying it since I don't know what the underlying code looks like.

I think having the statuses open-ended, extendable by alerts, is going to be a bit of complexity nightmare. Maybe it makes sense to have a "side car" property (string(s)) on the status that could be alert-specific (and completely ignored by alerting itself), that could also be displayed in UIs. @mikecote?

@mikecote
Copy link
Contributor

I think having the statuses open-ended, extendable by alerts, is going to be a bit of complexity nightmare. Maybe it makes sense to have a "side car" property (string(s)) on the status that could be alert-specific (and completely ignored by alerting itself), that could also be displayed in UIs. @mikecote?

Agreed. @aaronjcaldwell I'm trying to understand how the "shapes" are defined by the user and maybe the alert creation flow will help me better understand what we can propose here. This could be a case for dynamic action groups 🙈 but I may be wrong.

@kindsun
Copy link
Contributor

kindsun commented Nov 19, 2020

@mikecote @pmuellr

I think having the statuses open-ended, extendable by alerts, is going to be a bit of complexity nightmare.

We're open to other options. We just need to accomplish the following: When a point moves to a different shape, let's say it's a car that goes from Alabama to Georgia, an alert is triggered. It was active before and it's active now, but its containing State has changed.

I'm trying to understand how the "shapes" are defined by the user and maybe the alert creation flow will help me better understand what we can propose here.

This comment gives the high-level version, but can do a deeper dive if needed! Upon creation of a single alert, a user selects an index and filters to select a number of shapes for containment tracking. These could really be any shapes, a single square in the middle of the ocean or every zip code in the USA. It's pretty open ended.

This could be a case for dynamic action groups but I may be wrong.

This possibly is a case for dynamic action groups. The only reason we aren't considering action groups currently is because they're static and declared upfront when the alert type is registered. If they were dynamic, we could have an action group per shape. I'm not entirely sure how this would play out in the UI though.

@pmuellr
Copy link
Member

pmuellr commented Nov 20, 2020

How would dynamic action groups work from a UI view? You certainly don't want to have to add the same actions to 50 action groups in the UI!

@kindsun
Copy link
Contributor

kindsun commented Nov 20, 2020

I'd also have questions about the interplay of the current static action groups with dynamic action groups. We're already thinking static action groups will be useful to us to represent contained and resolved (no longer contained). We don't want to add to these action groups, but rather alert on subsets of contained.

@gmmorris
Copy link
Contributor

gmmorris commented Dec 3, 2020

Hi @aaronjcaldwell ,
I've picked up Patrick's issue as per the above discussion, and in this PR we introduce a concept of "Action Subgroup", which is basically a way for you to specify that an instance has changed state in manner schedules action but is still part of the same action group.

In this way, you could have Contained as an action group, and then schedule action as Contained with a dynamic Subgroup which is the shape in which the instance is contained.

Would this address your need?

@kindsun
Copy link
Contributor

kindsun commented Dec 3, 2020

@gmmorris I believe this would fit our needs perfectly! A few questions:

  • What would be the interplay of this with Ability to throttle alert instances until action group changes #50077?
  • Does resolution of an action group automatically resolve all its sub-actions?
  • If a change resolves both an action group and a sub-action group, does it fire two actions? In our case imagine a car tracked in the USA driving north from Minnesota to Canada, it both leaves the USA (contained action group) and Minnesota (contained sub-action group)

@gmmorris
Copy link
Contributor

gmmorris commented Dec 3, 2020

I need to sync with @ymao1 but I'd expect it to be the same as with normal action groups.

  • Does resolution of an action group automatically resolve all its sub-actions?

Subgroups are just away of identifying when you might want to fire fresh actions even though the instance is still in the same action group as before. This means recovery should be unaffected as the instance hasn't in that case recovered, but rather remained active within another group.

  • If a change resolves both an action group and a sub-action group, does it fire two actions? In our case imagine a car tracked in the USA driving north from Minnesota to Canada, it both leaves the USA (contained action group) and Minnesota (contained sub-action group)

It's impossible top recover from a subgroup, only from a normal action group, so it's the same as before

@gmmorris gmmorris added the Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework label Jul 1, 2021
@gmmorris
Copy link
Contributor

gmmorris commented Jul 6, 2021

After chatting with @aaronjcaldwell it sounds like #84751 addresses this need for the Maps team.

Do we think there's more to this beyond what we've already delivered?

@gmmorris
Copy link
Contributor

After chatting with @aaronjcaldwell it sounds like #84751 addresses this need for the Maps team.

Do we think there's more to this beyond what we've already delivered?

After chatting with @mikecote I'm going to close this - seems like our SubActionGroups addressed the core need and feel it's sufficient

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

7 participants