[Alerting] Telemetry for potential rule execution guardrails #122535

mikecote · 2022-01-10T13:50:51Z

Before implementing guardrails and limitations to the alerting rules, we should gather data to validate where guardrails are necessary in relation to the rule execution. The following would be interesting to gather, and potentially others as we think of them.

Copied from #60315, we can start with the following:

Telemetry questions	Why	PR
What is the maximum number of alerts a single rule execution created?	Knowing these values will indicate what scale of alerts we are dealing with and decide if we need to guardrail against such large values.	#130479
What is the maximum number of actions a rule has scheduled during a single execution?	Knowing these values will indicate what scale of actions we are dealing with and decide if we need to guardrail against such large values. Note that there is a relationship between number of alerts and actions, but it's variable as some alerts could be muted, throttled, and have more than one action defined	#128891

elasticmachine · 2022-01-10T13:50:53Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2022-01-10T13:55:23Z

@pmuellr I created this telemetry issue for 8.1 to collect data on potential guardrails. If you feel there are other areas we should collect, can you add them to the table above?

This issue overlaps a bit with #113465 but gathers scope outside of long running rules.

mikecote · 2022-01-10T14:04:33Z

Prior conversation on capturing the max number of alerts created during execution => #116047.

pmuellr · 2022-01-19T21:25:26Z

side discussion with Mike: we're thinking capturing percentiles rather than min/max. For instance, we can easily compare p50 vs p90, and determine if the value at p90 is an outlier vs p50, if it's much larger than p50. If they're similar, the p50 is probably not an outlier. We're less interested in outliers than things that are more consistently "not good".

gmmorris · 2022-01-20T09:40:38Z

stretch: How much event loop delay was occurring during a rule execution?

I'm wondering if we can split this issue into several deliverables (no need for multiple issues, just multiple PRs).
I'm worried about the stretch causing us to miss the 8.1FF date.
If in doubt, prioritise making incremental progress that makes the FF date.

mikecote · 2022-03-03T17:09:35Z

We should split this into smaller deliverables. There may be an issue already for "How much time was spent in elasticsearch searches?".

mikecote · 2022-03-07T14:16:51Z

Removing How much time was spent in elasticsearch searches? (moved to #125967) and How much event loop delay was occurring during a rule execution? (moved to #124366).

mikecote · 2022-03-07T14:18:06Z

I will leave it up to the person who picks up the issue if they want to do both telemetry questions at once or if they feel it should be split up and create a separate issue for one of the other telemetry questions.

mikecote · 2022-03-07T15:27:36Z

Might be worth using percentiles and maybe breaking down by rule type.

mikecote mentioned this issue Jan 10, 2022

More alerting services telemetry #60315

Closed

3 tasks

mikecote mentioned this issue Jan 10, 2022

[Alerting] There is no efficient way to retrieve info about number of alert instances created on rule execution. #116047

Closed

pmuellr self-assigned this Jan 19, 2022

mikecote mentioned this issue Jan 24, 2022

[ResponseOps] further rate-limit the number of action executions #122544

Open

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

This was referenced Feb 7, 2022

Research how to add a circuit breaker for max number of active alerts per rule #124870

Closed

Add circuit breaker for max number of actions a rule can schedule per execution #124871

Closed

pmuellr removed their assignment Mar 3, 2022

ymao1 self-assigned this Mar 28, 2022

ymao1 mentioned this issue Mar 31, 2022

[Alerting] Add telemetry for number of scheduled actions during rule execution #128891

Merged

1 task

ymao1 mentioned this issue Apr 19, 2022

[Alerting] Tracking number of alerts in event log execute document and adding telemetry for it. #130479

Merged

1 task

ymao1 closed this as completed in #130479 Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Telemetry for potential rule execution guardrails #122535

[Alerting] Telemetry for potential rule execution guardrails #122535

mikecote commented Jan 10, 2022 •

edited by ymao1

Loading

elasticmachine commented Jan 10, 2022

mikecote commented Jan 10, 2022

mikecote commented Jan 10, 2022

pmuellr commented Jan 19, 2022

gmmorris commented Jan 20, 2022

mikecote commented Mar 3, 2022 •

edited

Loading

mikecote commented Mar 7, 2022

mikecote commented Mar 7, 2022

mikecote commented Mar 7, 2022

[Alerting] Telemetry for potential rule execution guardrails #122535

[Alerting] Telemetry for potential rule execution guardrails #122535

Comments

mikecote commented Jan 10, 2022 • edited by ymao1 Loading

elasticmachine commented Jan 10, 2022

mikecote commented Jan 10, 2022

mikecote commented Jan 10, 2022

pmuellr commented Jan 19, 2022

gmmorris commented Jan 20, 2022

mikecote commented Mar 3, 2022 • edited Loading

mikecote commented Mar 7, 2022

mikecote commented Mar 7, 2022

mikecote commented Mar 7, 2022

mikecote commented Jan 10, 2022 •

edited by ymao1

Loading

mikecote commented Mar 3, 2022 •

edited

Loading