[RAC] [Observability] Alert documents are not updated as expected when the write index changes #110519

weltenwort · 2021-08-30T19:00:47Z

🐞 Problem description

The lifecycle-aware observability rule executors use the RuleDataClient's bulk() method to both create and update the alert documents. Currently any updates targeting existing alerts will incorrectly duplicate the alert document when the write index of the alerts alias has changed (e.g. during a rollover).

The reason for that is that the rule data client sets the bulk operation's target index to the alias, which resolves to the current write index. Indexing a document with an existing id by indexing again doesn't update documents in non-write indices. Instead, a new document with the id is created in the write index.

Depending on whether we expect roll-overs to happen within 7.15, this could be a bad bug. If we are confident rollovers won't happen, this could be deferred to 7.15.1.

💡 Possible solution

If the bulk() method of the rule data client removed the require_alias setting, the lifecycle helper could specify bulk update operations that target the respective concrete indices of the previously loaded alerts.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-30T19:00:48Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

banderror · 2021-08-31T12:04:26Z

Depending on whether we expect roll-overs to happen within 7.15, this could be a bad bug. If we are confident rollovers won’t happen, this could be deferred to 7.15.1.

I think we can’t know for sure when the first rollover will happen after a user upgrades Kibana to 7.15.0 and new RAC indices are created.

This is the default ILM policy that’s currently used for all the indices:

export const defaultLifecyclePolicy = {
  policy: {
    phases: {
      hot: {
        actions: {
          rollover: {
            max_age: '90d',
            max_size: '50gb',
          },
        },
      },
      delete: {
        actions: {
          delete: {},
        },
      },
    },
  },
};

Depending on the amount of alerts being written, this 50gb threshold can be reached faster than we release 7.15.1.

I’d personally not take this risk and I feel like this issue should be addressed asap and shipped in 7.15.0.

banderror · 2021-08-31T12:11:02Z

Just for additional context

The reason for that is that the rule data client sets the bulk operation's target index to the alias, which resolves to the current write index. Indexing a document with an existing id by indexing again doesn't update documents in non-write indices. Instead, a new document with the id is created in the write index.

You can simulate this behavior in Kibana DevTools (queries provided by @weltenwort):

PUT test-000001
{
  "aliases": {
    "test": {"is_write_index": true}
  }, 
  "mappings": {
    "dynamic_templates": [
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    ],
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "text"
      }
    }
  }
}

PUT test/_doc/1
{
  "@timestamp": "2021-08-30T12:00:00.000Z",
  "message": "message 1"
}

GET test/_search

PUT test-000001/_alias/test
{
  "is_write_index": false
}


PUT test-000002
{
  "aliases": {
    "test": {"is_write_index": true}
  }, 
  "mappings": {
    "dynamic_templates": [
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    ],
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "text"
      }
    }
  }
}

PUT test/_doc/1
{
  "@timestamp": "2021-08-30T12:00:00.000Z",
  "message": "message 2"
}

GET test/_search

PUT test-000002/_alias/test
{
  "is_write_index": false
}

PUT test-000003
{
  "aliases": {
    "test": {"is_write_index": true}
  }, 
  "mappings": {
    "dynamic_templates": [
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    ],
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "text"
      }
    }
  }
}

POST test/_update/1
{
  "doc": {
    "@timestamp": "2021-08-30T12:00:00.000Z",
    "message": "message 3"
  }
}

The 2nd PUT with "message 2" creates a duplicate instead of updating the existing document by its id ("1"). The POST test/_update/1 fails with an error from Elasticsearch, because the document with id "1" is not in the write index.

The call to ruleDataClient.bulk() in the lifecycle executor in fact does the PUT (indexes a doc which can lead to creating a duplicate), see it here:

kibana/x-pack/plugins/rule_registry/server/utils/create_lifecycle_executor.ts

Line 276 in 7ebed93

    
           body: eventsToIndex.flatMap((event) => [{ index: { _id: event[ALERT_UUID]! } }, event]),

banderror · 2021-08-31T12:19:55Z

@weltenwort has an idea for a hotfix which makes sense to me:

1. Remove `require_alias: true` from `bulk()`

Remove require_alias: true from the bulk() method of RuleDataWriter.

See x-pack/plugins/rule_registry/server/rule_data_client/rule_data_client.ts.

2. Pass the concrete index as `_index` in each document's bulk header

createLifecycleExecutor needs to be adjusted. It combines existing alerts with new ones in a single array eventsToIndex, but they should be separated. The new ones should be written like it is now:

body: eventsToIndex.flatMap((event) => [{ index: { _id: event[ALERT_UUID]! } }, event]),

and the existing ones should be updated via

{ update: { _id: event[ALERT_UUID]!, _index: event._index } }

banderror · 2021-08-31T12:21:17Z

We also think that exposing a raw bulk method is not a good API for RuleDataWriter. There need to be more specific methods for writing new alerts and updating existing ones. I will create a ticket for that improvement.

jasonrhodes · 2021-08-31T21:46:41Z

Let me make sure I understand this with an example:

09:00 - Indices bootstrapped, "alerts-001" is the backing index for write alias "alerts"
09:45 - Alert triggered and created, written to "alerts" (backed by "alerts-001" with ID "A100")
Every 5 min from 09:45-11:00, Alert A100 is still active and the document is updated inside the alerts-001 index
11:02 - For whatever reason, a rollover is triggered, "alerts-002" is created (perhaps with updated mappings) and "alerts" write alias updated to point at "alerts-002"
11:05 - Alert A100 is still active, but when we try to update this alert document, the write will use the "alerts" alias, which points to "alerts-002" which doesn't have a document "Alert A100", so it will be created. Now there are two alert documents with ID A100, one in alerts-001 and another in alerts-002

Is this roughly correct?

If so, is this an accurate description of the proposed solution?

11:05 - Alert A100 is still active, but when we update this alert document, we explicitly tell ES it is in alerts-001 so that it updates the document there. No new alert document will be created in alerts-002 for this ongoing alert event.

That would mean that an ongoing alert won't adopt the new mappings of an updated index while the alert continues to be active. I think that's ok (in fact, I assume this is exactly what we want, right?).

If this is all pretty much accurate, this sounds like a great solution. And a HUGE +1 on abstracting some of this away in a slightly better API for the RuleDataService.

mgiota · 2021-09-01T08:13:45Z

@jasonrhodes I was about to comment about this line https://github.com/elastic/kibana/blob/master/x-pack/plugins/rule_registry/server/rule_data_client/rule_data_client.ts#L124, where we explicitly specify the alias which would end up in problems and then I saw your message, which I guess confirms my thoughts as well.

11:05 - Alert A100 is still active, but when we try to update this alert document, the write will use the "alerts" alias,

Indeed if the write uses the alerts alias I think it will happen what you described, we will have two alert documents with same id in different indices. The solution is to not write to the alerts alias, but to the correct index as you suggested:

we explicitly tell ES it is in alerts-001 so that it updates the document there

Can somebody else confirm that @jasonrhodes's proposal to write to a concrete index instead of the alias is more accurate?

mgiota · 2021-09-01T08:41:58Z

Before jumping into the implementation I did a bit of testing on Kibana Dev tools with the suggested queries. On purpose I tested the scenario that multiple indices were writable at the same time and as I would expect I got this error.

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_state_exception",
        "reason" : "alias [test] has more than one write index [test-000002,test-000003]"
      }
    ],
    "type" : "illegal_state_exception",
    "reason" : "alias [test] has more than one write index [test-000002,test-000003]"
  },
  "status" : 500
}

I don't know much about the internal implementation, so I was wondering if we already prevent having multiple write indices at the same time. My guess is that either rollover mechanism or maybe the rule registry upgrade mechanism already takes care of that and we shouldn't worry about above error happening to our users. Am I right?

weltenwort · 2021-09-01T09:10:46Z

@jasonrhodes yes, that's correct

mgiota · 2021-09-02T06:52:39Z

When I removed require_alias: true from the bulk() I noticed that .keyword was appended to the field names. Also index name was .alerts-observability.apm.alerts-default, whereas when I had the require_alias option the index was.internal.alerts-observability.apm.alerts-default-000001.

{
        "_index" : ".alerts-observability.apm.alerts-default",
        "_id" : "043cf432-a900-4cac-bbab-a337dab33552",
        "_score" : null,
        "fields" : {
          "kibana.alert.rule.category.keyword" : [
            "Error count threshold"
          ],
          "kibana.alert.id.keyword" : [
            "apm.error_rate_elastic-co-frontend_https://www.elastic.co"
          ],
          "kibana.alert.status" : [
            "active"
          ],
          "kibana.alert.reason.keyword" : [
            "Error count is greater than 0 (current value is 6) for elastic-co-frontend"
          ],
          "kibana.alert.rule.producer.keyword" : [
            "apm"
          ],
          "service.name.keyword" : [
            "elastic-co-frontend"
          ],
          "kibana.alert.rule.producer" : [
            "apm"
          ],
          "kibana.alert.rule.name.keyword" : [
            "apm_error_count_stack"
          ],
          "kibana.alert.rule.rule_type_id" : [
            "apm.error_rate"
          ],
          "kibana.alert.evaluation.value" : [
            6
          ],
          "processor.event" : [
            "error"
          ],
          "event.kind.keyword" : [
            "signal"
          ],
          "kibana.alert.rule.name" : [
            "apm_error_count_stack"
          ],
          "kibana.alert.id" : [
            "apm.error_rate_elastic-co-frontend_https://www.elastic.co"
          ],
          "event.action.keyword" : [
            "open"
          ],
          "event.kind" : [
            "signal"
          ],
          "kibana.version.keyword" : [
            "8.0.0"
          ],
          "kibana.space_ids.keyword" : [
            "default"
          ],
          "service.environment" : [
            "https://www.elastic.co"
          ],
          "kibana.alert.workflow_status" : [
            "open"
          ],
          "kibana.alert.workflow_status.keyword" : [
            "open"
          ],
          "service.name" : [
            "elastic-co-frontend"
          ],
          "kibana.alert.rule.uuid" : [
            "8c867ac0-0b77-11ec-9120-b707eb0435b6"
          ],
          "processor.event.keyword" : [
            "error"
          ],
          "kibana.alert.reason" : [
            "Error count is greater than 0 (current value is 6) for elastic-co-frontend"
          ],
          "kibana.alert.rule.consumer" : [
            "alerts"
          ],
          "kibana.alert.rule.uuid.keyword" : [
            "8c867ac0-0b77-11ec-9120-b707eb0435b6"
          ],
          "kibana.alert.rule.category" : [
            "Error count threshold"
          ],
          "kibana.alert.start" : [
            "2021-09-01T22:54:27.817Z"
          ],
          "event.action" : [
            "open"
          ],
          "kibana.alert.duration.us" : [
            0
          ],
          "@timestamp" : [
            "2021-09-01T22:54:27.817Z"
          ],
          "kibana.alert.status.keyword" : [
            "active"
          ],
          "service.environment.keyword" : [
            "https://www.elastic.co"
          ],
          "kibana.alert.uuid.keyword" : [
            "043cf432-a900-4cac-bbab-a337dab33552"
          ],
          "kibana.alert.rule.rule_type_id.keyword" : [
            "apm.error_rate"
          ],
          "kibana.alert.rule.consumer.keyword" : [
            "alerts"
          ],
          "kibana.alert.uuid" : [
            "043cf432-a900-4cac-bbab-a337dab33552"
          ],
          "kibana.space_ids" : [
            "default"
          ],
          "kibana.version" : [
            "8.0.0"
          ],
          "kibana.alert.evaluation.threshold" : [
            0
          ]
        },

Alerts table doesn't load any data in this case. Here's the error I get

 "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [kibana.alert.rule.producer] in order to load field data by uninverting the inverted index. Note that this can use significant memory."

@banderror @weltenwort any thoughts on that? One thing that is not clear to me is if I need to have this configuration # xpack.ruleRegistry.unsafe.indexUpgrade.enabled: true

mgiota · 2021-09-02T10:48:06Z

@banderror We had a look with @weltenwort and we got it working now. Root of the problem was indeed the removal of require_alias: true. When removing this flag, the mappings were not installed correctly. This is because we install the resources only when there are errors https://github.com/elastic/kibana/blob/master/x-pack/plugins/rule_registry/server/rule_data_client/rule_data_client.ts#L139.

So when first writing data, resources were not installed and this ended up in Alerts table not being rendered. So the fix was to:

keep the require_alias: true
and disable it only when updating a document

await ruleDataClient.getWriter().bulk({
      body: allEventsToIndex.flatMap(({ event, indexName }) => [
        indexName
          ? { index: { _id: event[ALERT_UUID]!, _index: indexName, require_alias: false } }
          : { index: { _id: event[ALERT_UUID]! } },
        event,
      ]),
    });

banderror · 2021-09-03T12:30:45Z

Before jumping into the implementation I did a bit of testing on Kibana Dev tools with the suggested queries. On purpose I tested the scenario that multiple indices were writable at the same time and as I would expect I got this error.
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_state_exception",
        "reason" : "alias [test] has more than one write index [test-000002,test-000003]"
      }
    ],
    "type" : "illegal_state_exception",
    "reason" : "alias [test] has more than one write index [test-000002,test-000003]"
  },
  "status" : 500
}
I don't know much about the internal implementation, so I was wondering if we already prevent having multiple write indices at the same time. My guess is that either rollover mechanism or maybe the rule registry upgrade mechanism already takes care of that and we shouldn't worry about above error happening to our users. Am I right?

@mgiota Sorry for the late reply. Yes, it kind of takes care about that implicitly. There are two cases where concrete indices can be created: initial index creation and rollover. In both of the cases we use deterministic names (we explicitly specify a name for the initial index and a name for the rollover target index); in both cases it's possible to have race conditions which are handled in the code, so it's not possible that 2 parallel rollovers create 2 different write indices, for example.

weltenwort added bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Theme: rac label obsolete v7.15.0 labels Aug 30, 2021

mgiota self-assigned this Aug 31, 2021

weltenwort mentioned this issue Aug 31, 2021

[RAC][Rule Registry] Implement versioning and backing indices #109276

Merged

7 tasks

mgiota mentioned this issue Sep 1, 2021

Update alert documents when the write index changes #110788

Merged

banderror mentioned this issue Sep 2, 2021

[RAC][Meta] Consolidate the two indexing implementations in rule_registry plugin #101016

Open

41 tasks

mgiota mentioned this issue Sep 2, 2021

[RAC] Alert ILM policy shouldn't delete old indices after rollover #111029

Closed

mgiota closed this as completed in #110788 Sep 3, 2021

This was referenced Sep 3, 2021

[RAC][Rule Registry] Make endpoints for updating alerts consistent #111162

Closed

[RAC][Rule Registry] Implement separate methods for writing new alerts and updating existing ones #111175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RAC] [Observability] Alert documents are not updated as expected when the write index changes #110519

[RAC] [Observability] Alert documents are not updated as expected when the write index changes #110519

weltenwort commented Aug 30, 2021 •

edited

Loading

elasticmachine commented Aug 30, 2021

banderror commented Aug 31, 2021

banderror commented Aug 31, 2021

banderror commented Aug 31, 2021

banderror commented Aug 31, 2021

jasonrhodes commented Aug 31, 2021 •

edited

Loading

mgiota commented Sep 1, 2021

mgiota commented Sep 1, 2021 •

edited

Loading

weltenwort commented Sep 1, 2021

mgiota commented Sep 2, 2021 •

edited

Loading

mgiota commented Sep 2, 2021

banderror commented Sep 3, 2021

[RAC] [Observability] Alert documents are not updated as expected when the write index changes #110519

[RAC] [Observability] Alert documents are not updated as expected when the write index changes #110519

Comments

weltenwort commented Aug 30, 2021 • edited Loading

🐞 Problem description

💡 Possible solution

elasticmachine commented Aug 30, 2021

banderror commented Aug 31, 2021

banderror commented Aug 31, 2021

banderror commented Aug 31, 2021

1. Remove require_alias: true from bulk()

2. Pass the concrete index as _index in each document's bulk header

banderror commented Aug 31, 2021

jasonrhodes commented Aug 31, 2021 • edited Loading

mgiota commented Sep 1, 2021

mgiota commented Sep 1, 2021 • edited Loading

weltenwort commented Sep 1, 2021

mgiota commented Sep 2, 2021 • edited Loading

mgiota commented Sep 2, 2021

banderror commented Sep 3, 2021

weltenwort commented Aug 30, 2021 •

edited

Loading

1. Remove `require_alias: true` from `bulk()`

2. Pass the concrete index as `_index` in each document's bulk header

jasonrhodes commented Aug 31, 2021 •

edited

Loading

mgiota commented Sep 1, 2021 •

edited

Loading

mgiota commented Sep 2, 2021 •

edited

Loading