Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC] [Observability] Alert documents are not updated as expected when the write index changes #110519

Closed
Tracked by #101016
weltenwort opened this issue Aug 30, 2021 · 12 comments · Fixed by #110788
Closed
Tracked by #101016
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Theme: rac label obsolete v7.15.0

Comments

@weltenwort
Copy link
Member

weltenwort commented Aug 30, 2021

🐞 Problem description

The lifecycle-aware observability rule executors use the RuleDataClient's bulk() method to both create and update the alert documents. Currently any updates targeting existing alerts will incorrectly duplicate the alert document when the write index of the alerts alias has changed (e.g. during a rollover).

The reason for that is that the rule data client sets the bulk operation's target index to the alias, which resolves to the current write index. Indexing a document with an existing id by indexing again doesn't update documents in non-write indices. Instead, a new document with the id is created in the write index.

Depending on whether we expect roll-overs to happen within 7.15, this could be a bad bug. If we are confident rollovers won't happen, this could be deferred to 7.15.1.

💡 Possible solution

If the bulk() method of the rule data client removed the require_alias setting, the lifecycle helper could specify bulk update operations that target the respective concrete indices of the previously loaded alerts.

@weltenwort weltenwort added bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Theme: rac label obsolete v7.15.0 labels Aug 30, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@banderror
Copy link
Contributor

Depending on whether we expect roll-overs to happen within 7.15, this could be a bad bug. If we are confident rollovers won’t happen, this could be deferred to 7.15.1.

I think we can’t know for sure when the first rollover will happen after a user upgrades Kibana to 7.15.0 and new RAC indices are created.

This is the default ILM policy that’s currently used for all the indices:

export const defaultLifecyclePolicy = {
  policy: {
    phases: {
      hot: {
        actions: {
          rollover: {
            max_age: '90d',
            max_size: '50gb',
          },
        },
      },
      delete: {
        actions: {
          delete: {},
        },
      },
    },
  },
};

Depending on the amount of alerts being written, this 50gb threshold can be reached faster than we release 7.15.1.

I’d personally not take this risk and I feel like this issue should be addressed asap and shipped in 7.15.0.

@banderror
Copy link
Contributor

Just for additional context

The reason for that is that the rule data client sets the bulk operation's target index to the alias, which resolves to the current write index. Indexing a document with an existing id by indexing again doesn't update documents in non-write indices. Instead, a new document with the id is created in the write index.

You can simulate this behavior in Kibana DevTools (queries provided by @weltenwort):

PUT test-000001
{
  "aliases": {
    "test": {"is_write_index": true}
  }, 
  "mappings": {
    "dynamic_templates": [
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    ],
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "text"
      }
    }
  }
}

PUT test/_doc/1
{
  "@timestamp": "2021-08-30T12:00:00.000Z",
  "message": "message 1"
}

GET test/_search

PUT test-000001/_alias/test
{
  "is_write_index": false
}


PUT test-000002
{
  "aliases": {
    "test": {"is_write_index": true}
  }, 
  "mappings": {
    "dynamic_templates": [
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    ],
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "text"
      }
    }
  }
}

PUT test/_doc/1
{
  "@timestamp": "2021-08-30T12:00:00.000Z",
  "message": "message 2"
}

GET test/_search

PUT test-000002/_alias/test
{
  "is_write_index": false
}

PUT test-000003
{
  "aliases": {
    "test": {"is_write_index": true}
  }, 
  "mappings": {
    "dynamic_templates": [
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    ],
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "text"
      }
    }
  }
}

POST test/_update/1
{
  "doc": {
    "@timestamp": "2021-08-30T12:00:00.000Z",
    "message": "message 3"
  }
}

The 2nd PUT with "message 2" creates a duplicate instead of updating the existing document by its id ("1"). The POST test/_update/1 fails with an error from Elasticsearch, because the document with id "1" is not in the write index.

The call to ruleDataClient.bulk() in the lifecycle executor in fact does the PUT (indexes a doc which can lead to creating a duplicate), see it here:

body: eventsToIndex.flatMap((event) => [{ index: { _id: event[ALERT_UUID]! } }, event]),

@banderror
Copy link
Contributor

@weltenwort has an idea for a hotfix which makes sense to me:

1. Remove require_alias: true from bulk()

Remove require_alias: true from the bulk() method of RuleDataWriter.

See x-pack/plugins/rule_registry/server/rule_data_client/rule_data_client.ts.

2. Pass the concrete index as _index in each document's bulk header

createLifecycleExecutor needs to be adjusted. It combines existing alerts with new ones in a single array eventsToIndex, but they should be separated. The new ones should be written like it is now:

body: eventsToIndex.flatMap((event) => [{ index: { _id: event[ALERT_UUID]! } }, event]),

and the existing ones should be updated via

{ update: { _id: event[ALERT_UUID]!, _index: event._index } }

@banderror
Copy link
Contributor

We also think that exposing a raw bulk method is not a good API for RuleDataWriter. There need to be more specific methods for writing new alerts and updating existing ones. I will create a ticket for that improvement.

@jasonrhodes
Copy link
Member

jasonrhodes commented Aug 31, 2021

Let me make sure I understand this with an example:

  • 09:00 - Indices bootstrapped, "alerts-001" is the backing index for write alias "alerts"
  • 09:45 - Alert triggered and created, written to "alerts" (backed by "alerts-001" with ID "A100")
  • Every 5 min from 09:45-11:00, Alert A100 is still active and the document is updated inside the alerts-001 index
  • 11:02 - For whatever reason, a rollover is triggered, "alerts-002" is created (perhaps with updated mappings) and "alerts" write alias updated to point at "alerts-002"
  • 11:05 - Alert A100 is still active, but when we try to update this alert document, the write will use the "alerts" alias, which points to "alerts-002" which doesn't have a document "Alert A100", so it will be created. Now there are two alert documents with ID A100, one in alerts-001 and another in alerts-002

Is this roughly correct?

If so, is this an accurate description of the proposed solution?

  • 11:05 - Alert A100 is still active, but when we update this alert document, we explicitly tell ES it is in alerts-001 so that it updates the document there. No new alert document will be created in alerts-002 for this ongoing alert event.

That would mean that an ongoing alert won't adopt the new mappings of an updated index while the alert continues to be active. I think that's ok (in fact, I assume this is exactly what we want, right?).

If this is all pretty much accurate, this sounds like a great solution. And a HUGE +1 on abstracting some of this away in a slightly better API for the RuleDataService.

@mgiota
Copy link
Contributor

mgiota commented Sep 1, 2021

@jasonrhodes I was about to comment about this line https://github.com/elastic/kibana/blob/master/x-pack/plugins/rule_registry/server/rule_data_client/rule_data_client.ts#L124, where we explicitly specify the alias which would end up in problems and then I saw your message, which I guess confirms my thoughts as well.

11:05 - Alert A100 is still active, but when we try to update this alert document, the write will use the "alerts" alias,

Indeed if the write uses the alerts alias I think it will happen what you described, we will have two alert documents with same id in different indices. The solution is to not write to the alerts alias, but to the correct index as you suggested:

we explicitly tell ES it is in alerts-001 so that it updates the document there

Can somebody else confirm that @jasonrhodes's proposal to write to a concrete index instead of the alias is more accurate?

@mgiota
Copy link
Contributor

mgiota commented Sep 1, 2021

Before jumping into the implementation I did a bit of testing on Kibana Dev tools with the suggested queries. On purpose I tested the scenario that multiple indices were writable at the same time and as I would expect I got this error.

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_state_exception",
        "reason" : "alias [test] has more than one write index [test-000002,test-000003]"
      }
    ],
    "type" : "illegal_state_exception",
    "reason" : "alias [test] has more than one write index [test-000002,test-000003]"
  },
  "status" : 500
}

I don't know much about the internal implementation, so I was wondering if we already prevent having multiple write indices at the same time. My guess is that either rollover mechanism or maybe the rule registry upgrade mechanism already takes care of that and we shouldn't worry about above error happening to our users. Am I right?

@weltenwort
Copy link
Member Author

@jasonrhodes yes, that's correct

@mgiota
Copy link
Contributor

mgiota commented Sep 2, 2021

When I removed require_alias: true from the bulk() I noticed that .keyword was appended to the field names. Also index name was .alerts-observability.apm.alerts-default, whereas when I had the require_alias option the index was.internal.alerts-observability.apm.alerts-default-000001.

{
        "_index" : ".alerts-observability.apm.alerts-default",
        "_id" : "043cf432-a900-4cac-bbab-a337dab33552",
        "_score" : null,
        "fields" : {
          "kibana.alert.rule.category.keyword" : [
            "Error count threshold"
          ],
          "kibana.alert.id.keyword" : [
            "apm.error_rate_elastic-co-frontend_https://www.elastic.co"
          ],
          "kibana.alert.status" : [
            "active"
          ],
          "kibana.alert.reason.keyword" : [
            "Error count is greater than 0 (current value is 6) for elastic-co-frontend"
          ],
          "kibana.alert.rule.producer.keyword" : [
            "apm"
          ],
          "service.name.keyword" : [
            "elastic-co-frontend"
          ],
          "kibana.alert.rule.producer" : [
            "apm"
          ],
          "kibana.alert.rule.name.keyword" : [
            "apm_error_count_stack"
          ],
          "kibana.alert.rule.rule_type_id" : [
            "apm.error_rate"
          ],
          "kibana.alert.evaluation.value" : [
            6
          ],
          "processor.event" : [
            "error"
          ],
          "event.kind.keyword" : [
            "signal"
          ],
          "kibana.alert.rule.name" : [
            "apm_error_count_stack"
          ],
          "kibana.alert.id" : [
            "apm.error_rate_elastic-co-frontend_https://www.elastic.co"
          ],
          "event.action.keyword" : [
            "open"
          ],
          "event.kind" : [
            "signal"
          ],
          "kibana.version.keyword" : [
            "8.0.0"
          ],
          "kibana.space_ids.keyword" : [
            "default"
          ],
          "service.environment" : [
            "https://www.elastic.co"
          ],
          "kibana.alert.workflow_status" : [
            "open"
          ],
          "kibana.alert.workflow_status.keyword" : [
            "open"
          ],
          "service.name" : [
            "elastic-co-frontend"
          ],
          "kibana.alert.rule.uuid" : [
            "8c867ac0-0b77-11ec-9120-b707eb0435b6"
          ],
          "processor.event.keyword" : [
            "error"
          ],
          "kibana.alert.reason" : [
            "Error count is greater than 0 (current value is 6) for elastic-co-frontend"
          ],
          "kibana.alert.rule.consumer" : [
            "alerts"
          ],
          "kibana.alert.rule.uuid.keyword" : [
            "8c867ac0-0b77-11ec-9120-b707eb0435b6"
          ],
          "kibana.alert.rule.category" : [
            "Error count threshold"
          ],
          "kibana.alert.start" : [
            "2021-09-01T22:54:27.817Z"
          ],
          "event.action" : [
            "open"
          ],
          "kibana.alert.duration.us" : [
            0
          ],
          "@timestamp" : [
            "2021-09-01T22:54:27.817Z"
          ],
          "kibana.alert.status.keyword" : [
            "active"
          ],
          "service.environment.keyword" : [
            "https://www.elastic.co"
          ],
          "kibana.alert.uuid.keyword" : [
            "043cf432-a900-4cac-bbab-a337dab33552"
          ],
          "kibana.alert.rule.rule_type_id.keyword" : [
            "apm.error_rate"
          ],
          "kibana.alert.rule.consumer.keyword" : [
            "alerts"
          ],
          "kibana.alert.uuid" : [
            "043cf432-a900-4cac-bbab-a337dab33552"
          ],
          "kibana.space_ids" : [
            "default"
          ],
          "kibana.version" : [
            "8.0.0"
          ],
          "kibana.alert.evaluation.threshold" : [
            0
          ]
        },

Alerts table doesn't load any data in this case. Here's the error I get

 "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [kibana.alert.rule.producer] in order to load field data by uninverting the inverted index. Note that this can use significant memory."

@banderror @weltenwort any thoughts on that? One thing that is not clear to me is if I need to have this configuration # xpack.ruleRegistry.unsafe.indexUpgrade.enabled: true

@mgiota
Copy link
Contributor

mgiota commented Sep 2, 2021

@banderror We had a look with @weltenwort and we got it working now. Root of the problem was indeed the removal of require_alias: true. When removing this flag, the mappings were not installed correctly. This is because we install the resources only when there are errors https://github.com/elastic/kibana/blob/master/x-pack/plugins/rule_registry/server/rule_data_client/rule_data_client.ts#L139.

So when first writing data, resources were not installed and this ended up in Alerts table not being rendered. So the fix was to:

  • keep the require_alias: true
  • and disable it only when updating a document
await ruleDataClient.getWriter().bulk({
      body: allEventsToIndex.flatMap(({ event, indexName }) => [
        indexName
          ? { index: { _id: event[ALERT_UUID]!, _index: indexName, require_alias: false } }
          : { index: { _id: event[ALERT_UUID]! } },
        event,
      ]),
    });

@banderror
Copy link
Contributor

Before jumping into the implementation I did a bit of testing on Kibana Dev tools with the suggested queries. On purpose I tested the scenario that multiple indices were writable at the same time and as I would expect I got this error.

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_state_exception",
        "reason" : "alias [test] has more than one write index [test-000002,test-000003]"
      }
    ],
    "type" : "illegal_state_exception",
    "reason" : "alias [test] has more than one write index [test-000002,test-000003]"
  },
  "status" : 500
}

I don't know much about the internal implementation, so I was wondering if we already prevent having multiple write indices at the same time. My guess is that either rollover mechanism or maybe the rule registry upgrade mechanism already takes care of that and we shouldn't worry about above error happening to our users. Am I right?

@mgiota Sorry for the late reply. Yes, it kind of takes care about that implicitly. There are two cases where concrete indices can be created: initial index creation and rollover. In both of the cases we use deterministic names (we explicitly specify a name for the initial index and a name for the rollover target index); in both cases it's possible to have race conditions which are handled in the code, so it's not possible that 2 parallel rollovers create 2 different write indices, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Theme: rac label obsolete v7.15.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants