-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Users of alerting are not notified when the framework is failing #75042
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
From #75271
|
@elastic/kibana-alerting-services Based on the team discussion, I want to summarize a few topics under this issue:
|
Update from the Platform team - they are a new owners of Kibana Status page, and today they are planning to merge PR which allows us to register Alerting and Actions health status via the kibana/src/core/server/status/types.ts Lines 118 to 129 in 043ef5e
This update doesn't cancel any topic from the above summary, just give the Alerting framework the great place where display the health info for the users. |
cc @aarju and @randomuserid for more feedback or corrections. At a high level we want:
Degraded states such as some rules firing sometimes and not 100% all the time being operational is a very bad thing in the current alerting implementations in solution spaces. Even one alert not firing correctly should be bubbled up to the level of a SOC Manager from the sounds of things as they need them working all the time. SOC Managers it sounds like want the majority of any "alert not firing" or "operational errors" to be notified immediately and they will track the occurrences and why's but for the most part want to tune them to just be operational efficient enough all the time as well as feel confident that the alerts work "all the time" unless something unexpected happened in which case they will need to intervene. Outside of that persona, the performance of rules I think can and will fall outside of that manager to the alerting specific people to improve their metrics like MTTD/MTTR. Also the way we do backtracking, we will be "ok" when rules fall behind scheduling up to a point but once our second circuit breaker trips and we have "gaps" where we could have missed alerts/signals then we want some type of error state that is dire enough to go through the SOC Manager immediately to be sent out. Our current metrics on our dashboard for reference which we guide users towards for both "tuning" and for SOC Management roles at the moment: |
A lot of the requirements organizations have for logging and alerting are driven by regulatory and compliance efforts such as SOC2, PCI, HIPAA, etc. In many of these regulations you need to be able to be 'provably' secure and show that you do not have any large gaps in your logging and alerting. To satisfy our SOC2 and other regulatory requirements we currently use a watcher that runs hourly and will alert us when there is a drop in My personal preference for this is to use Watchers to monitor the Kibana alerting since they operate in ES and not in Kibana. If our Kibana nodes are broken and the alerts are not firing then the watchers will still alert us. We could probably include capabilities to configure watchers with email, slack, and pagerduty alerting directly from the Kibana Alerting interface to alert when the alerting is failing. Another advantage of this is that watchers are part of x-pack and not included in the basic license so these capabilities would be a good feature for the sales team to point to as a reason why an enterprise looking to achieve certification should purchase a higher license level or move to the cloud. They could be greyed out in the basic version with a button to offer a 14 day free trial like we do with ML jobs. |
The auditing of any changes to alerts or exceptions lists is also a concern of multiple regulatory frameworks. Auditors and SOC managers often want to know the who, what, when, why of any rule changes. For example we currently keep a copy of all or our rules and exceptions in Github to track who made the changes and when, but that doesn't help in the case of a rogue SOC analyst making changes to the alerting without documenting it. If a workstation is filtered from all new malware alerts there needs to be an 'immutable' record of who made those changes and when. |
@YulNaumenko, @bmcconaghy mentioned we could use the HTTP 207 status whenever there's some checks that are unhealthy but don't warrant an unhealthy status. |
I have to admit I've never seen an HTTP 207 in the wild... glad to finally come across a valid use case 😃 |
As a basic approach we will use information about encryption/decryption failures and it should be enough for the first implementation and later we can come up with more information about framework failures. Thanks all for a great thoughts and proposals! We definitely will include all of them to our road map as a part of Meta Alert. |
I try to avoid using unusual HTTP error codes, like 207. Occasionally, proxies/gateways will run into issues with them, because ... they're unusual. And sometimes client libraries don't quite know how to handle these either, and do the wrong thing. They will also show up as outliers when looking at http status aggregation reports from the server logs - then everyone's going to wonder, "what's a 207!?!?". My preference would be to always return a 200, with individual status values in the response, maybe some kind of "overall" status as well. Leave all non-200 responses to the usual problems - auth, validation, etc. |
I was chatting with @gmmorris and wondering if we should implement something for connector failures? A basic metric could be a 100% failure rate based on past day or past x amount of executions. This should capture decryption failures of actions as well as any configuration that is no longer working. |
YES, since connector failures are much more common than alert failures, as far as I can tell. These are tough though. The failures really need to be tracked back to their source - today currently just alert's action invocations and directly via HTTP. The HTTP ones we don't need to worry about, as the error is returned in the response. For the alerts one, or any future action invoked on behalf of another SO, we'd like to only make the errors a connector has available to those users who can see the alert. Eg, we can't just let "anyone" view the action events in the event log, they could see an action invocation from an alert they aren't authorized to see. We have enough stuff set up in the event log to make this work, but will need to thread the alert SO id through the action execution, so when the action write's it's event, it can reference the alert SO. Then we can provide a way to view all the action invocations across all the alerts the user is authorized to read. That "threading" wasn't there when the event log code was first written, and it wasn't clear exactly how to do it, and wasn't a huge priority at the time so I deferred doing anything, assuming we'd get more requirements on issues like this. I think that "threading" work has already started - maybe as part of RBAC? I remember seeing a PR that passed an alert SO into an action execution somehow, which is probably most or all of what we need to finish getting the core plumbing to work. Likely, we'll need more elaborate event log APIs - eg, one that can take multiple SO references and return events for all of them at the same time. There are already some issues open for that. We probably want to create a meta-issue for "provide action status from the alerts UI" - we can start to noodle on a design, collect all the currently blocking issues, add new ones, etc. |
Definitely +1 on starting a meta / discussion around this. There's been some chat around this a while ago when the event log was being designed (saved object array vs single object). I know there was discussion on how we "link" these events together either: parent id, child id, bi-directional, event uuids, etc. I'm sure there's pros and cons to each approach and would be good seeing them written down somewhere! 🙂 |
Ya, I think we're using nested objects for this, and that we already do write both the action and alert SO to an event doc when the action is scheduled by the alert. We just haven't done that when the action is actually executed. |
Move this to the meta alerts issue |
As a simpler version of the meta alerts issue. We need to have a pragmatic solution in place to indicate administrators when the alerting framework is failing.
Some scenarios:
Some of the ideas bounced around in a team discussion:
The text was updated successfully, but these errors were encountered: