Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Telemetry for potential rule execution guardrails #122535

Closed
mikecote opened this issue Jan 10, 2022 · 9 comments · Fixed by #130479
Closed

[Alerting] Telemetry for potential rule execution guardrails #122535

mikecote opened this issue Jan 10, 2022 · 9 comments · Fixed by #130479
Assignees
Labels
estimate:medium Medium Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) telemetry Issues related to the addition of telemetry to a feature

Comments

@mikecote
Copy link
Contributor

mikecote commented Jan 10, 2022

Before implementing guardrails and limitations to the alerting rules, we should gather data to validate where guardrails are necessary in relation to the rule execution. The following would be interesting to gather, and potentially others as we think of them.

Copied from #60315, we can start with the following:

Telemetry questions Why PR
What is the maximum number of alerts a single rule execution created? Knowing these values will indicate what scale of alerts we are dealing with and decide if we need to guardrail against such large values. #130479
What is the maximum number of actions a rule has scheduled during a single execution? Knowing these values will indicate what scale of actions we are dealing with and decide if we need to guardrail against such large values. Note that there is a relationship between number of alerts and actions, but it's variable as some alerts could be muted, throttled, and have more than one action defined #128891
@mikecote mikecote added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework telemetry Issues related to the addition of telemetry to a feature estimate:medium Medium Estimated Level of Effort labels Jan 10, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor Author

@pmuellr I created this telemetry issue for 8.1 to collect data on potential guardrails. If you feel there are other areas we should collect, can you add them to the table above?

This issue overlaps a bit with #113465 but gathers scope outside of long running rules.

@mikecote
Copy link
Contributor Author

Prior conversation on capturing the max number of alerts created during execution => #116047.

@pmuellr pmuellr self-assigned this Jan 19, 2022
@pmuellr
Copy link
Member

pmuellr commented Jan 19, 2022

side discussion with Mike: we're thinking capturing percentiles rather than min/max. For instance, we can easily compare p50 vs p90, and determine if the value at p90 is an outlier vs p50, if it's much larger than p50. If they're similar, the p50 is probably not an outlier. We're less interested in outliers than things that are more consistently "not good".

@gmmorris
Copy link
Contributor

stretch: How much event loop delay was occurring during a rule execution?

I'm wondering if we can split this issue into several deliverables (no need for multiple issues, just multiple PRs).
I'm worried about the stretch causing us to miss the 8.1FF date.
If in doubt, prioritise making incremental progress that makes the FF date.

@mikecote
Copy link
Contributor Author

mikecote commented Mar 3, 2022

We should split this into smaller deliverables. There may be an issue already for "How much time was spent in elasticsearch searches?".

@mikecote
Copy link
Contributor Author

mikecote commented Mar 7, 2022

Removing How much time was spent in elasticsearch searches? (moved to #125967) and How much event loop delay was occurring during a rule execution? (moved to #124366).

@mikecote
Copy link
Contributor Author

mikecote commented Mar 7, 2022

I will leave it up to the person who picks up the issue if they want to do both telemetry questions at once or if they feel it should be split up and create a separate issue for one of the other telemetry questions.

@mikecote
Copy link
Contributor Author

mikecote commented Mar 7, 2022

Might be worth using percentiles and maybe breaking down by rule type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:medium Medium Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) telemetry Issues related to the addition of telemetry to a feature
Projects
No open projects
6 participants