Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users of alerting are not notified when the framework is failing #75042

Closed
mikecote opened this issue Aug 14, 2020 · 16 comments · Fixed by #79056
Closed

Users of alerting are not notified when the framework is failing #75042

mikecote opened this issue Aug 14, 2020 · 16 comments · Fixed by #79056
Assignees
Labels
Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Aug 14, 2020

As a simpler version of the meta alerts issue. We need to have a pragmatic solution in place to indicate administrators when the alerting framework is failing.

Some scenarios:

  • The rate of failures increased on a connector over the past x time (or consistent?)
  • The actions saved object can't be decrypted

Some of the ideas bounced around in a team discussion:

  1. Create an always firing alert. Users can use this to send emails (ex: daily) and they'll know something is wrong when the email isn't sent.
  2. Building on top of the first point, the email sent could have a summary of failures, activity, etc.
  3. Have a configuration in kibana.yml (pre configured connector?) that the framework can use to communicate externally (ex: send emails on failures, etc).
  4. A health API, health status bar in the connectors management page (failures over past x).
@mikecote mikecote added Feature:Alerting Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Aug 14, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote mikecote changed the title Users of alerting are not notified when the framework is completely failing Users of alerting are not notified when the framework is failing Aug 18, 2020
@mikecote
Copy link
Contributor Author

From #75271

When the Alerting framework as a whole, or individual alerts, break, then we have internal methods for notifying the user, but we don't have any method for notifying them outside of the system, which means alerting might break without users knowing it.

We should find a way to notify the user without requiring them to actively log in.
Ideally we'd be able to execute an Action that would notify them (by email presumably) but it's unclear how we could do that if the source of the breakage is something like "Decryption is broken", in which case, you can't decrypt the action that would be used to notify the user.

We could, perhaps, use something like the PreConfigured Actions which don't require decryption, but it's not clear how we'd support that on cloud.

@YulNaumenko
Copy link
Contributor

@elastic/kibana-alerting-services Based on the team discussion, I want to summarize a few topics under this issue:

  1. We are planning to move forward with approach that allows us to create a two separate services - one for updating the health state of the alerting framework and the second one for the exposing this state externally - that means that other systems can execute it or schedule its execution checks by their own needs. I assume that we should use a SO for storage of this info with the last update time.
  2. Which data should be analyzed to identify the health of the whole framework? We come up with the suggestion to fetch the alerting executions history from the event log and compare the percentage of success and failures. If the failures is closed to 100% then framework is in unhealthy state. The same analyzes we can do for action executions and return its own health state. Waiting here on the other teams feedback what they expect to get from the alerting health for their specific needs.
  3. Which mechanisms should be used to schedule a push service for a framework health update - the first solution which is come up is using a Task Manager

@YulNaumenko
Copy link
Contributor

Update from the Platform team - they are a new owners of Kibana Status page, and today they are planning to merge PR which allows us to register Alerting and Actions health status via the core.status.set API Once this PR merges it will be exposed on that endpoint

/**
* API for accessing status of Core and this plugin's dependencies as well as for customizing this plugin's status.
*
* @remarks
* By default, a plugin inherits it's current status from the most severe status level of any Core services and any
* plugins that it depends on. This default status is available on the
* {@link ServiceStatusSetup.derivedStatus$ | core.status.derviedStatus$} API.
*
* Plugins may customize their status calculation by calling the {@link ServiceStatusSetup.set | core.status.set} API
* with an Observable. Within this Observable, a plugin may choose to only depend on the status of some of its
* dependencies, to ignore severe status levels of particular Core services they are not concerned with, or to make its
* status dependent on other external services.
It's all pretty fresh API and Alerting would be the first plugin to use it 🥇 .
This update doesn't cancel any topic from the above summary, just give the Alerting framework the great place where display the health info for the users.

@FrankHassanabad
Copy link
Contributor

cc @aarju and @randomuserid for more feedback or corrections.

At a high level we want:

  • Immediate alerts on any existing alerts not firing or indicators at a glance for a dashboard that these alerts are not firing. Users or we can make the dashboards if we have access to the data.

  • Any alerts that have not fired even if they are reporting good health. "Dead person's switch" type of alert. If the alert does not report status of success within X time of its duration then it is not performing. This could be from NodeJS block times, bugs, etc... But anytime users see an alert is not firing and they didn't know about it they become distrustful and upset with the framework.

  • Any actions that were not delivered or failed to be delivered or errors from the actions even though the rule executed. "Dead person's switch if possible" here but with things like emails you sometimes cannot get a delivery receipt.

  • Audit trails of who modified the alerts, activated them, modified them, deleted them (even), added them, etc... This might fall outside of this ticket but it's still something in a comprehensive monitoring of alerting systems people will want.

  • Positive feedback loops on performance. Query times of alerts, execution times of alerts and actions. Alerts framework can allow us as a solution to put arbitrary domain and metrics into the alerts themselves so we can add our metrics of what we are doing within alerts. Eventually users want to "measure" our alerts through marketing terms such as MTTD and MTTR. Mean time to detect, mean time to respond. So they will want these metrics to push their MTTD down but will want if possible even action delivery times reported if possible or all this information capable of being rolled up.


Degraded states such as some rules firing sometimes and not 100% all the time being operational is a very bad thing in the current alerting implementations in solution spaces. Even one alert not firing correctly should be bubbled up to the level of a SOC Manager from the sounds of things as they need them working all the time.

SOC Managers it sounds like want the majority of any "alert not firing" or "operational errors" to be notified immediately and they will track the occurrences and why's but for the most part want to tune them to just be operational efficient enough all the time as well as feel confident that the alerts work "all the time" unless something unexpected happened in which case they will need to intervene.

Outside of that persona, the performance of rules I think can and will fall outside of that manager to the alerting specific people to improve their metrics like MTTD/MTTR.

Also the way we do backtracking, we will be "ok" when rules fall behind scheduling up to a point but once our second circuit breaker trips and we have "gaps" where we could have missed alerts/signals then we want some type of error state that is dire enough to go through the SOC Manager immediately to be sent out.

Our current metrics on our dashboard for reference which we guide users towards for both "tuning" and for SOC Management roles at the moment:
Screen Shot 2020-09-17 at 9 02 07 AM

@aarju
Copy link

aarju commented Sep 18, 2020

A lot of the requirements organizations have for logging and alerting are driven by regulatory and compliance efforts such as SOC2, PCI, HIPAA, etc. In many of these regulations you need to be able to be 'provably' secure and show that you do not have any large gaps in your logging and alerting.

To satisfy our SOC2 and other regulatory requirements we currently use a watcher that runs hourly and will alert us when there is a drop in *siem* events being written to the .kibana-event*index. We just went with an easy watcher that lets us know if the number of events in the last hour drops below a threshold. Each cluster has a different number of average alert executions per hour so we have to set the alerting threshold at a custom level on each cluster. There is likely a better way to programmatically do this to automatically detect a drop in the number of alerts that have been run. This will alert us if there is a large problem with the alerting rules, but it doesn't help us if only one rule is failing to run. For that I like the idea of looking at each of the enabled rules scheduled period and seeing if the time since the last run is longer than that (with some tolerance). Maybe that could be something that the users set in the alerting interface. For example, when I create a new high priority rule I can chose to get a slack message if it is ever later than 10s, but the lower priority rule I only want notified if it is later than 10m.

My personal preference for this is to use Watchers to monitor the Kibana alerting since they operate in ES and not in Kibana. If our Kibana nodes are broken and the alerts are not firing then the watchers will still alert us. We could probably include capabilities to configure watchers with email, slack, and pagerduty alerting directly from the Kibana Alerting interface to alert when the alerting is failing.

Another advantage of this is that watchers are part of x-pack and not included in the basic license so these capabilities would be a good feature for the sales team to point to as a reason why an enterprise looking to achieve certification should purchase a higher license level or move to the cloud. They could be greyed out in the basic version with a button to offer a 14 day free trial like we do with ML jobs.

@aarju
Copy link

aarju commented Sep 18, 2020

The auditing of any changes to alerts or exceptions lists is also a concern of multiple regulatory frameworks. Auditors and SOC managers often want to know the who, what, when, why of any rule changes. For example we currently keep a copy of all or our rules and exceptions in Github to track who made the changes and when, but that doesn't help in the case of a rogue SOC analyst making changes to the alerting without documenting it. If a workstation is filtered from all new malware alerts there needs to be an 'immutable' record of who made those changes and when.

@mikecote
Copy link
Contributor Author

@YulNaumenko, @bmcconaghy mentioned we could use the HTTP 207 status whenever there's some checks that are unhealthy but don't warrant an unhealthy status.

@gmmorris
Copy link
Contributor

@YulNaumenko, @bmcconaghy mentioned we could use the HTTP 207 status whenever there's some checks that are unhealthy but don't warrant an unhealthy status.

I have to admit I've never seen an HTTP 207 in the wild... glad to finally come across a valid use case 😃

@YulNaumenko
Copy link
Contributor

As a basic approach we will use information about encryption/decryption failures and it should be enough for the first implementation and later we can come up with more information about framework failures. Thanks all for a great thoughts and proposals! We definitely will include all of them to our road map as a part of Meta Alert.

@pmuellr
Copy link
Member

pmuellr commented Sep 28, 2020

I try to avoid using unusual HTTP error codes, like 207. Occasionally, proxies/gateways will run into issues with them, because ... they're unusual. And sometimes client libraries don't quite know how to handle these either, and do the wrong thing. They will also show up as outliers when looking at http status aggregation reports from the server logs - then everyone's going to wonder, "what's a 207!?!?".

My preference would be to always return a 200, with individual status values in the response, maybe some kind of "overall" status as well. Leave all non-200 responses to the usual problems - auth, validation, etc.

@mikecote
Copy link
Contributor Author

mikecote commented Oct 2, 2020

As a basic approach we will use information about encryption/decryption failures and it should be enough for the first implementation and later we can come up with more information about framework failures. Thanks all for a great thoughts and proposals! We definitely will include all of them to our road map as a part of Meta Alert.

I was chatting with @gmmorris and wondering if we should implement something for connector failures? A basic metric could be a 100% failure rate based on past day or past x amount of executions. This should capture decryption failures of actions as well as any configuration that is no longer working.

@pmuellr
Copy link
Member

pmuellr commented Oct 2, 2020

wondering if we should implement something for connector failures?

YES, since connector failures are much more common than alert failures, as far as I can tell.

These are tough though. The failures really need to be tracked back to their source - today currently just alert's action invocations and directly via HTTP. The HTTP ones we don't need to worry about, as the error is returned in the response.

For the alerts one, or any future action invoked on behalf of another SO, we'd like to only make the errors a connector has available to those users who can see the alert. Eg, we can't just let "anyone" view the action events in the event log, they could see an action invocation from an alert they aren't authorized to see.

We have enough stuff set up in the event log to make this work, but will need to thread the alert SO id through the action execution, so when the action write's it's event, it can reference the alert SO. Then we can provide a way to view all the action invocations across all the alerts the user is authorized to read.

That "threading" wasn't there when the event log code was first written, and it wasn't clear exactly how to do it, and wasn't a huge priority at the time so I deferred doing anything, assuming we'd get more requirements on issues like this.

I think that "threading" work has already started - maybe as part of RBAC? I remember seeing a PR that passed an alert SO into an action execution somehow, which is probably most or all of what we need to finish getting the core plumbing to work.

Likely, we'll need more elaborate event log APIs - eg, one that can take multiple SO references and return events for all of them at the same time. There are already some issues open for that.

We probably want to create a meta-issue for "provide action status from the alerts UI" - we can start to noodle on a design, collect all the currently blocking issues, add new ones, etc.

@mikecote
Copy link
Contributor Author

mikecote commented Oct 2, 2020

Definitely +1 on starting a meta / discussion around this. There's been some chat around this a while ago when the event log was being designed (saved object array vs single object). I know there was discussion on how we "link" these events together either: parent id, child id, bi-directional, event uuids, etc. I'm sure there's pros and cons to each approach and would be good seeing them written down somewhere! 🙂

@pmuellr
Copy link
Member

pmuellr commented Oct 2, 2020

Ya, I think we're using nested objects for this, and that we already do write both the action and alert SO to an event doc when the action is scheduled by the alert. We just haven't done that when the action is actually executed.

@YulNaumenko
Copy link
Contributor

cc @aarju and @randomuserid for more feedback or corrections.

At a high level we want:

  • Immediate alerts on any existing alerts not firing or indicators at a glance for a dashboard that these alerts are not firing. Users or we can make the dashboards if we have access to the data.
  • Any alerts that have not fired even if they are reporting good health. "Dead person's switch" type of alert. If the alert does not report status of success within X time of its duration then it is not performing. This could be from NodeJS block times, bugs, etc... But anytime users see an alert is not firing and they didn't know about it they become distrustful and upset with the framework.
  • Any actions that were not delivered or failed to be delivered or errors from the actions even though the rule executed. "Dead person's switch if possible" here but with things like emails you sometimes cannot get a delivery receipt.
  • Audit trails of who modified the alerts, activated them, modified them, deleted them (even), added them, etc... This might fall outside of this ticket but it's still something in a comprehensive monitoring of alerting systems people will want.
  • Positive feedback loops on performance. Query times of alerts, execution times of alerts and actions. Alerts framework can allow us as a solution to put arbitrary domain and metrics into the alerts themselves so we can add our metrics of what we are doing within alerts. Eventually users want to "measure" our alerts through marketing terms such as MTTD and MTTR. Mean time to detect, mean time to respond. So they will want these metrics to push their MTTD down but will want if possible even action delivery times reported if possible or all this information capable of being rolled up.

Degraded states such as some rules firing sometimes and not 100% all the time being operational is a very bad thing in the current alerting implementations in solution spaces. Even one alert not firing correctly should be bubbled up to the level of a SOC Manager from the sounds of things as they need them working all the time.

SOC Managers it sounds like want the majority of any "alert not firing" or "operational errors" to be notified immediately and they will track the occurrences and why's but for the most part want to tune them to just be operational efficient enough all the time as well as feel confident that the alerts work "all the time" unless something unexpected happened in which case they will need to intervene.

Outside of that persona, the performance of rules I think can and will fall outside of that manager to the alerting specific people to improve their metrics like MTTD/MTTR.

Also the way we do backtracking, we will be "ok" when rules fall behind scheduling up to a point but once our second circuit breaker trips and we have "gaps" where we could have missed alerts/signals then we want some type of error state that is dire enough to go through the SOC Manager immediately to be sent out.

Our current metrics on our dashboard for reference which we guide users towards for both "tuning" and for SOC Management roles at the moment:
Screen Shot 2020-09-17 at 9 02 07 AM

Move this to the meta alerts issue

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
8 participants