Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops] RuleDataClient initialization fails if any alerts indices are snapshots #139969

Closed
marshallmain opened this issue Sep 1, 2022 · 2 comments · Fixed by #140778
Closed
Assignees
Labels
impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. sdh-linked Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@marshallmain
Copy link
Contributor

marshallmain commented Sep 1, 2022

The first time a rule runs for a namespace and attempts to write an alert, the RuleDataClient creates the index template for that namespace and tries to apply the mappings from that template to any existing alerts indices for the namespace. Since the template we create does not explicitly specify all the mappings, instead referencing component templates, we gather the names of the existing indices and simulate the mappings that would be applied to those index names after installing the new template (here).

However, if alerts indices have moved to snapshots via ILM then the name that comes back when we fetch the existing indices will have either restored- or partial- as a prefix. When we pass these index names in to the simulateIndexTemplate API, the prefix causes the name not to match the installed index template, and the mappings come back empty. The empty mappings are then passed to putMapping, which fails, throws an error, and disables writing new alerts.

Example of how the names are behaving unexpectedly - the request below should restrict the response to include only index names that match .internal.alerts-observability.metrics.alerts-default-*, but it includes partial-.internal.alerts-observability.metrics.alerts-default-000006:

GET .internal.alerts-observability.metrics.alerts-default-*/_alias/.alerts-observability.metrics.alerts-default

"partial-.internal.alerts-observability.metrics.alerts-default-000006" : {
    "aliases" : {
      ".alerts-observability.metrics.alerts-default" : { },
      ".internal.alerts-observability.metrics.alerts-default-000006" : { }
    }
 },

The alerts ILM policy that ships with Kibana keeps alerts indices in the hot phase indefinitely, however in some customer systems the ILM policies have been modified to move alerts to snapshots. This seems to work for customers until Kibana restarts and the RuleDataClient has to re-initialize (e.g. when they upgrade stack versions), at which point initialization fails and it appears that the upgrade broke their system.

Possible Fixes

While we don't support users making changes to the built in alerts ILM policy, a delayed failure that results in alerts not being written while the problem is debugged is a particularly bad failure mode. RuleDataClient initialization should not fail even in the presence of snapshotted alerts indices.

We don't really need to apply new mappings to old indices at the moment, since we don't have any runtime mappings that would actually affect indices that aren't the write index. So we could restrict the mapping update logic to the write index, which should never be a snapshot (hopefully?).

We could also try simulating the mappings only once using an index name we know will match the template rather than simulating each concrete index and having some of them fail. Then we could apply the mappings to every concrete index.

@marshallmain marshallmain added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) sdh-linked labels Sep 1, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@kobelb
Copy link
Contributor

kobelb commented Sep 14, 2022

If users have modified their alerts ILM policy to move alerts into the cold tier or frozen tiers, not being able to create new alerts is only the tip of the iceberg. The system was designed with the assumption that alerts would always be in the hot tier, so various functionality will behave erratically if alerts are moved out of this tier.

The .alerts-ilm-policy should be modified to set _meta.managed: true to cause the following warning to be displayed in the ILM policy management UI and minimize the likelihood of this occurring in the future:

Screen Shot 2022-09-14 at 3 49 32 PM

With regard to what we should do with users who have already modified their ILM policy, I don't know what the best course of action is here. We need to prompt our users that something is awry and have them fix it. While we don't need to modify the mappings of the old alerts indices at the moment, we might want to in the future, so just allowing this problem to remain un-remedied and make everything look like it's working correctly will just be kicking the can down the road.

@pmuellr pmuellr moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Sep 15, 2022
pmuellr added a commit that referenced this issue Sep 20, 2022
…0778)

resolves #139969

Changes the ResourceInstaller to ignore cases when the elasticsearch 
simulateIndexTemplate() API returns an error or empty mappings, logging an 
error instead. This will hopefully allow initialization to continue to set 
up the alerts-as-data indices and backing resources for future indexing.

Also adds _meta: { managed: true } to the ILM policy, which should show a 
warning in Kibana UX when attempting to make changes to the policy. Which 
was the cause of why simulateIndexTemplate() could return empty mappings.
Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Sep 20, 2022
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Sep 20, 2022
…stic#140778)

resolves elastic#139969

Changes the ResourceInstaller to ignore cases when the elasticsearch
simulateIndexTemplate() API returns an error or empty mappings, logging an
error instead. This will hopefully allow initialization to continue to set
up the alerts-as-data indices and backing resources for future indexing.

Also adds _meta: { managed: true } to the ILM policy, which should show a
warning in Kibana UX when attempting to make changes to the policy. Which
was the cause of why simulateIndexTemplate() could return empty mappings.

(cherry picked from commit 01daf31)
kibanamachine added a commit that referenced this issue Sep 20, 2022
…0778) (#141058)

resolves #139969

Changes the ResourceInstaller to ignore cases when the elasticsearch
simulateIndexTemplate() API returns an error or empty mappings, logging an
error instead. This will hopefully allow initialization to continue to set
up the alerts-as-data indices and backing resources for future indexing.

Also adds _meta: { managed: true } to the ILM policy, which should show a
warning in Kibana UX when attempting to make changes to the policy. Which
was the cause of why simulateIndexTemplate() could return empty mappings.

(cherry picked from commit 01daf31)

Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
pmuellr added a commit to pmuellr/kibana that referenced this issue Sep 20, 2022
…stic#140778)

resolves elastic#139969

Changes the ResourceInstaller to ignore cases when the elasticsearch
simulateIndexTemplate() API returns an error or empty mappings, logging an
error instead. This will hopefully allow initialization to continue to set
up the alerts-as-data indices and backing resources for future indexing.

Also adds _meta: { managed: true } to the ILM policy, which should show a
warning in Kibana UX when attempting to make changes to the policy. Which
was the cause of why simulateIndexTemplate() could return empty mappings.

(cherry picked from commit 01daf31)

# Conflicts:
#	x-pack/plugins/rule_registry/server/rule_data_plugin_service/resource_installer.test.ts
#	x-pack/plugins/rule_registry/server/rule_data_plugin_service/resource_installer.ts
pmuellr added a commit that referenced this issue Sep 20, 2022
…0778) (#141097)

resolves #139969

Changes the ResourceInstaller to ignore cases when the elasticsearch
simulateIndexTemplate() API returns an error or empty mappings, logging an
error instead. This will hopefully allow initialization to continue to set
up the alerts-as-data indices and backing resources for future indexing.

Also adds _meta: { managed: true } to the ILM policy, which should show a
warning in Kibana UX when attempting to make changes to the policy. Which
was the cause of why simulateIndexTemplate() could return empty mappings.

(cherry picked from commit 01daf31)

# Conflicts:
#	x-pack/plugins/rule_registry/server/rule_data_plugin_service/resource_installer.test.ts
#	x-pack/plugins/rule_registry/server/rule_data_plugin_service/resource_installer.ts
pmuellrgitoff pushed a commit to pmuellrgitoff/kibana that referenced this issue Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. sdh-linked Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants