[ResponseOps] retry rule runs when retryable errors occur #138124
Labels
Feature:Alerting/RulesFramework
Issues related to the Alerting Rules Framework
R&D
Research and development ticket (not meant to produce code, but to make a decision)
resilience
Issues related to Platform resilience in terms of scale, performance & backwards compatibility
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Currently when a rule executor throws an error, we don't retry the execution. However, there are cases where this would probably be a good thing to enable. An example is a transient networking error like a Socket Hang Up. If we know the rule is "safe" to run again, and detect a transient error, a retry after a short delay has a good chance of executing successfully.
Lots of things to consider:
Probably also applies to the ES index connector, but ... harder, as by definition it is writing data - we don't want to write it twice! It's also the case that we could instrument the ES index connector itself for retry support, and not have to worry about it at the "connector level".
The text was updated successfully, but these errors were encountered: