Improve migrations performance by using distinct indices per SO type #104081

pgayvallet · 2021-07-01T11:35:50Z

This is a follow-up on [https://elasticco.atlassian.net/browse/KBNA-7838|https://elasticco.atlassian.net/browse/KBNA-7838|smart-link]

In this phase we propose to split some saved object types out of the .kibana index into separate indices

[https://github.com//issues/104081|https://github.com//issues/104081|smart-link]
While prior work can reduce downtime for some upgrades, if just one saved object type defines a migration all saved objects still need to be migrated. E.g. there might be no cases migrations defined but there is a dashboard migration which then requires us to migrate all 1m cases. This change would mean we would only migrate the 1m cases if there is a cases migration defined. While this reduces the average downtime per upgrade it does introduce unpredictability for users where some upgrades are fast and others cause 10minutes of downtime.

We already thought a lot about splitting the kibana index into multiple ones, either one index by type, or potentially one index per solution (#90817). If we were to go into that direction, we could potentially increase the migration speed significantly, by only performing the migration process on the indices containing documents that need to be migrated.

As an example, let’s say that we got 3 SO types, foo, bar and dolly, respectively in the .kibana_foo_8_0_0, .kibana_bar_8_0_0 and .kibana_dolly_8_0_0 indices

During the migration from v8.0.0 to v8.2.1, we detect that we only have migrations to apply to the foo type. We could then only apply the migration to the .kibana_foo_8_0_0 index, and just clone the other indices to their new version.

We could even go further: If we keep inside an internal index the current version (name) of each index, we could potentially even avoid the cloning of the .kibana_bar_8_0_0 and .kibana_dolly_8_0_0 indices, and just have the next version use the exact same indices.

This could drastically reduce the average migration time, especially if we choose to have one index per type (as opposed to one index per grouping of types)

Note that this would have the additional advantage of handling the problematic of the increasing number of registered fields (and dodging once and for all the Damocles sword of hitting the limit), see #70471

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-07-01T11:35:51Z

Pinging @elastic/kibana-core (Team:Core)

rudolf · 2022-07-11T18:45:28Z

We should remember that the core usage collector for saved objects currently only collects saved objects per type for the "default saved objects index" https://github.com/elastic/kibana/blob/main/src/plugins/kibana_usage_collection/server/plugin.ts#L126

So we would need to refactor this collector to check the saved objects registry during start to find all the indices of all the types.

rudolf · 2022-12-21T12:54:30Z

Some initial thoughts around the design:

There's several options to accomplish this. Some things could easily be done in one state machine by having the action target all indices at once, like the REINDEX_* step. But this becomes complex when different indices require different actions, e.g. some indices need to be created for the first time, others need to be migrated from a legacy version and yet others might already have an index with compatible mappings.

So it feels simplest to keep separate state machines per index but allow each state machine to block before progressing to the next step until all other state machines are past a certain step.

pgayvallet · 2022-12-22T11:25:04Z

Folllowing our discussions from yesterday, especially what @gsoldevila proposed, here's my brainstorm on the kind of trade-offs and optimization we could make:

If we look at the use cases, I think we have two main situations:

we're migrating all types of a given index to the same index
we're migrating some types of a given index to the same index and some other types from the said index to a non-existing index

Now, the main question I have is: Do we want/need to allow doing more than that?

E.g would we want to support something like

during 8.8, I move case-foo from .kibana to .kibana-cases (which is a new index I would be creating during this 8.8 migration)
during 8.9, I move case-bar from .kibana to .kibana-cases (which already exists)

My point being, if we define that:

the action of creating a new index and moving types to it as 'atomical' and bound to a release (so basically you can't move types to an index that already exists)
you can only move from one source index to a new index (e.g we can't move types from both .kibana and .kibana_task_manager to .kibana_foo)

I think it would significantly lower the needs in term of synchronization, as I think we could, as @gsoldevila suggested, have most of the work be performed in 'isolation' from a single migrator.

I'm not speaking in term of technicals here (the migration could instantiate other migrators or something similar) but in term of responsibilities / concerns here, as each migrator wouldn't need to sync or interact with the others

wdyt all? does that make sense?

gsoldevila · 2022-12-22T14:39:04Z

If we look at the use cases, I think we have two main situations:

we're migrating all types of a given index to the same index

we're migrating some types of a given index to the same index and some other types from the said index to a non-existing index

If we think about how all that fits into serverless, none of them are going to take place in the near future anymore.
We won't reindex, and we won't move SO types around to different indices either. So in the end, we'll have a much simpler state machine once serverless is widespread.

I was imagining the "after-split" functioning mode as a sort of v3 migrator that would bring us very close to serverless. A simplified one that only knows about updating mappings in place.

With that in mind, I define 2 main scenarios:

We have a new deployment, or one that it's already in v3 (split) state. In this scenario we start one v3 migrator per "solution" index (or whatever SO type grouping we decide).
We have a v1 or v2 deployment. In this scenario, I'd tweak the v2 migrator to perform the split, using a single migrator in 'isolation'. This migrator would create all indices of the split, and fill each of them with the corresponding documents. This would be a one off, expensive operation. After the migration to v3 is finished, consecutive restarts will always use v3 logic, and no further breaking changes or splitting would be allowed.

This way, we can use the "splitting into separate indices" as an opportunity to take us a step closer to serverless. We'd also isolate the v3 logic in a separate "component", easier to maintain and update.

I think this also fits well with Pierre's suggestion of having moves bound to a release. After giving it some thought, Pierre's proposal is more flexible as it would allow different moves between indices in the future, which we could turn to in case of facing issues like mapping explosions. Nevertheless, that would come at the cost of some downtime, and we would also be keeping some (a priori) unnecessary complexity in the code.

rudolf · 2022-12-23T14:55:11Z

Spoke to @gsoldevila and @jloleysens. We agreed that reducing the code complexity of the migrator (especially the model) would be good. At a high level we could split up v1/v2/v3 migrations into their own state machines. It would also be useful to have a state machine of state machines so that the INIT and CHECK_MIGRATION_VERSION could benefit from all the retry logic and logging of the state action machine. It could also help to split out some parts of e.g. a v2 state machine into sub-state machines.

(illustration from @gsoldevila)

When we do such a refactoring is debatable. If type splitting doesn't require a lot of risky changes to the state machine that would add a lot of additional complexity it feels like we should push for reducing downtime first and then refactor once that's released.

When it comes to what a v3 migrator might look like this is a basic way to visualise the proposal. A single migrator instance that does it's actions for all the indices it manages e.g. creates 3 temp indices, clones 3 temp indices performs the MARK_VERSION_INDEX_READY for 3 indices etc.

Let's call independant migratiors proposal (1) and this proposal (2). Both approaches require injecting the typeRegistry so that REINDEX_SOURCE_TO_TEMP_INDEX_BULK can create bulkWrite operations with the correct destination index for each document.

In (1) the coordination happens between the migrators in (2) the coordination happens within a migrator. But in both cases the coordination is only necessary inside a single Kibana instance, we're not talking about distributed synchronisation across instances.

I think (1) would require much less code changes though. We would leave all actions untouched but introduce two new steps READY_TO_REINDEX_SYNC and DONE_REINDEXING_SYNC. We would also have to inject a function into the migrators that allows them to query the controlState of other migrators this would require some changes but since it doesn't impact the algorithm it's low risk changes. And then in the *_SYNC steps we would wait indefinitely for all migrators to reach this state. To increase visibility we would want to model this as retries every 60s like READY_TO_REINDEX_SYNC -> READY_TO_REINDEX_SYNC so that we can see the log activity instead of it looking like nothing is happening.

The previous illustrations were kinda mixing the state machine controlState and the writing of documents into indexes which made it harder to understand, so I updated that to hopefully make that clearer

I also agree with the previous comments that we don't need to design for a flexibly splitting strategy that can change if we decide to split types differently in the future. This is out of scope and if we can simplify the implementation without it that'd be fine. But I think it's possible without any further change (illustrated here with the orange arrow which moves a document out of the cases index and back to the .kibana index.

rudolf · 2022-12-23T15:04:41Z

We also discussed why a clone is necessary even if it's a newly split index because it avoids lost deletes when theres multiple instances.

@gsoldevila

## Description Fix #104081 This PR move some of the SO types from the `.kibana` index into the following ones: - `.kibana_alerting_cases` - `.kibana_analytics` - `.kibana_security_solution` - `.kibana_ingest` This split/reallocation will occur during the `8.8.0` Kibana upgrade (*meaning: from any version older than `8.8.0` to any version greater or equal to `8.8.0`*) **This PR main changes are:** - implement the changes required in the SO migration algorithm to support this reallocation - update the FTR tools (looking at you esArchiver) to support these new indices - update hardcoded references to `.kibana` and usage of the `core.savedObjects.getKibanaIndex()` to use new APIs to target the correct index/indices - update FTR datasets, tests and utility accordingly ## To reviewers **Overall estimated risk of regressions: low** But, still, please take the time to review changes in your code. The parts of the production code that were the most impacted are the telemetry collectors, as most of them were performing direct requests against the `.kibana` index, so we had to adapt them. Most other contributor-owned changes are in FTR tests and datasets. If you think a type is misplaced (either we missed some types that should be moved to a specific index, or some types were moved and shouldn't have been) please tell us, and we'll fix the reallocation either in this PR or in a follow-up. ## .Kibana split The following new indices are introduced by this PR, with the following SO types being moved to it. (any SO type not listed here will be staying in its current index) Note: The complete **_type => index_** breakdown is available in [this spreadsheet](https://docs.google.com/spreadsheets/d/1b_MG_E_aBksZ4Vkd9cVayij1oBpdhvH4XC8NVlChiio/edit#gid=145920788). #### `.kibana_alerting_cases` - action - action_task_params - alert - api_key_pending_invalidation - cases - cases-comments - cases-configure - cases-connector-mappings - cases-telemetry - cases-user-actions - connector_token - rules-settings - maintenance-window #### `.kibana_security_solution` - csp-rule-template - endpoint:user-artifact - endpoint:user-artifact-manifest - exception-list - exception-list-agnostic - osquery-manager-usage-metric - osquery-pack - osquery-pack-asset - osquery-saved-query - security-rule - security-solution-signals-migration - siem-detection-engine-rule-actions - siem-ui-timeline - siem-ui-timeline-note - siem-ui-timeline-pinned-event #### `.kibana_analytics` - canvas-element - canvas-workpad-template - canvas-workpad - dashboard - graph-workspace - index-pattern - kql-telemetry - lens - lens-ui-telemetry - map - search - search-session - search-telemetry - visualization #### `.kibana_ingest` - epm-packages - epm-packages-assets - fleet-fleet-server-host - fleet-message-signing-keys - fleet-preconfiguration-deletion-record - fleet-proxy - ingest_manager_settings - ingest-agent-policies - ingest-download-sources - ingest-outputs - ingest-package-policies ## Tasks / PRs ### Sub-PRs **Implementation** - 🟣 #154846 - 🟣 #154892 - 🟣 #154882 - 🟣 #154884 - 🟣 #155155 **Individual index split** - 🟣 #154897 - 🟣 #155129 - 🟣 #155140 - 🟣 #155130 ### Improvements / follow-ups - 👷🏼 Extract logic into [runV2Migration](#154151 (comment)) @gsoldevila - Make `getCurrentIndexTypesMap` resillient to intermittent failures #154151 (comment) - 🚧 Build a more structured [MigratorSynchronizer](#154151 (comment)) - 🟣 #155035 - 🟣 #155116 - 🟣 #155366 ## Reallocation tweaks Tweaks to the reallocation can be done after the initial merge, as long as it's done before the public release of 8.8 - `url` should get back to `.kibana` (see [comment](#154888 (comment))) ## Release Note For performance purposes, Kibana is now using more system indices to store its internal data. The following system indices will be created when upgrading to `8.8.0`: - `.kibana_alerting_cases` - `.kibana_analytics` - `.kibana_security_solution` - `.kibana_ingest` --------- Co-authored-by: pgayvallet <pierre.gayvallet@elastic.co> Co-authored-by: Christos Nasikas <christos.nasikas@elastic.co> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Georgii Gorbachev <georgii.gorbachev@elastic.co>

@gsoldevila

…154888) ## Description Fix elastic#104081 This PR move some of the SO types from the `.kibana` index into the following ones: - `.kibana_alerting_cases` - `.kibana_analytics` - `.kibana_security_solution` - `.kibana_ingest` This split/reallocation will occur during the `8.8.0` Kibana upgrade (*meaning: from any version older than `8.8.0` to any version greater or equal to `8.8.0`*) **This PR main changes are:** - implement the changes required in the SO migration algorithm to support this reallocation - update the FTR tools (looking at you esArchiver) to support these new indices - update hardcoded references to `.kibana` and usage of the `core.savedObjects.getKibanaIndex()` to use new APIs to target the correct index/indices - update FTR datasets, tests and utility accordingly ## To reviewers **Overall estimated risk of regressions: low** But, still, please take the time to review changes in your code. The parts of the production code that were the most impacted are the telemetry collectors, as most of them were performing direct requests against the `.kibana` index, so we had to adapt them. Most other contributor-owned changes are in FTR tests and datasets. If you think a type is misplaced (either we missed some types that should be moved to a specific index, or some types were moved and shouldn't have been) please tell us, and we'll fix the reallocation either in this PR or in a follow-up. ## .Kibana split The following new indices are introduced by this PR, with the following SO types being moved to it. (any SO type not listed here will be staying in its current index) Note: The complete **_type => index_** breakdown is available in [this spreadsheet](https://docs.google.com/spreadsheets/d/1b_MG_E_aBksZ4Vkd9cVayij1oBpdhvH4XC8NVlChiio/edit#gid=145920788). #### `.kibana_alerting_cases` - action - action_task_params - alert - api_key_pending_invalidation - cases - cases-comments - cases-configure - cases-connector-mappings - cases-telemetry - cases-user-actions - connector_token - rules-settings - maintenance-window #### `.kibana_security_solution` - csp-rule-template - endpoint:user-artifact - endpoint:user-artifact-manifest - exception-list - exception-list-agnostic - osquery-manager-usage-metric - osquery-pack - osquery-pack-asset - osquery-saved-query - security-rule - security-solution-signals-migration - siem-detection-engine-rule-actions - siem-ui-timeline - siem-ui-timeline-note - siem-ui-timeline-pinned-event #### `.kibana_analytics` - canvas-element - canvas-workpad-template - canvas-workpad - dashboard - graph-workspace - index-pattern - kql-telemetry - lens - lens-ui-telemetry - map - search - search-session - search-telemetry - visualization #### `.kibana_ingest` - epm-packages - epm-packages-assets - fleet-fleet-server-host - fleet-message-signing-keys - fleet-preconfiguration-deletion-record - fleet-proxy - ingest_manager_settings - ingest-agent-policies - ingest-download-sources - ingest-outputs - ingest-package-policies ## Tasks / PRs ### Sub-PRs **Implementation** - 🟣 elastic#154846 - 🟣 elastic#154892 - 🟣 elastic#154882 - 🟣 elastic#154884 - 🟣 elastic#155155 **Individual index split** - 🟣 elastic#154897 - 🟣 elastic#155129 - 🟣 elastic#155140 - 🟣 elastic#155130 ### Improvements / follow-ups - 👷🏼 Extract logic into [runV2Migration](elastic#154151 (comment)) @gsoldevila - Make `getCurrentIndexTypesMap` resillient to intermittent failures elastic#154151 (comment) - 🚧 Build a more structured [MigratorSynchronizer](elastic#154151 (comment)) - 🟣 elastic#155035 - 🟣 elastic#155116 - 🟣 elastic#155366 ## Reallocation tweaks Tweaks to the reallocation can be done after the initial merge, as long as it's done before the public release of 8.8 - `url` should get back to `.kibana` (see [comment](elastic#154888 (comment))) ## Release Note For performance purposes, Kibana is now using more system indices to store its internal data. The following system indices will be created when upgrading to `8.8.0`: - `.kibana_alerting_cases` - `.kibana_analytics` - `.kibana_security_solution` - `.kibana_ingest` --------- Co-authored-by: pgayvallet <pierre.gayvallet@elastic.co> Co-authored-by: Christos Nasikas <christos.nasikas@elastic.co> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Georgii Gorbachev <georgii.gorbachev@elastic.co>

…locating SO documents (#158940) Fixes #158733 The goal of this modification is to enforce migrators of all indices involved in a relocation (e.g. as part of the [dot kibana split](#104081)) to create the index aliases in the same `updateAliases()` call. This way, either: * all the indices involved in the [dot kibana split](#104081) relocation will be completely upgraded (with the appropriate aliases). * or none of them will.

…locating SO documents (elastic#158940) Fixes elastic#158733 The goal of this modification is to enforce migrators of all indices involved in a relocation (e.g. as part of the [dot kibana split](elastic#104081)) to create the index aliases in the same `updateAliases()` call. This way, either: * all the indices involved in the [dot kibana split](elastic#104081) relocation will be completely upgraded (with the appropriate aliases). * or none of them will.

pgayvallet added discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Jul 1, 2021

pgayvallet mentioned this issue Jul 1, 2021

[SO Migration] meta list of the possible improvements by area #104083

Open

pgayvallet mentioned this issue Sep 10, 2021

Support sharding large workloads across multiple Kibana instances #93029

Closed

rudolf mentioned this issue Feb 8, 2022

Only migrate an index if necessary #124946

Closed

rudolf mentioned this issue Oct 26, 2022

Scale upgrade migrations to millions of saved objects #144035

Closed

rudolf added the loe:x-large Extra Large Level of Effort label Oct 26, 2022

exalate-issue-sync bot changed the title ~~[SO Migration] Improve performance by using distinct indices per type~~ migrations: Improve performance by using distinct indices per type Nov 10, 2022

exalate-issue-sync bot changed the title ~~migrations: Improve performance by using distinct indices per type~~ Improve performance by using distinct indices per type Nov 11, 2022

exalate-issue-sync bot changed the title ~~Improve performance by using distinct indices per type~~ Design: Improve performance by using distinct indices per type Nov 11, 2022

rudolf mentioned this issue Dec 13, 2022

Run embeddable migrations on read only #147445

Closed

exalate-issue-sync bot closed this as completed Jan 12, 2023

rudolf changed the title ~~Design: Improve performance by using distinct indices per type~~ Improve performance by using distinct indices per type Feb 6, 2023

rudolf reopened this Feb 6, 2023

exalate-issue-sync bot changed the title ~~Improve performance by using distinct indices per type~~ Improve migrations performance by using distinct indices per SO type Feb 10, 2023

gsoldevila self-assigned this Mar 7, 2023

gsoldevila mentioned this issue Mar 10, 2023

[Migrations] Split .kibana into multiple indices #153111

Closed

This was referenced Mar 31, 2023

Allow relocating SO to different indices during migration #154151

Closed

Split .kibana system index into multiple indices #154444

Closed

gsoldevila mentioned this issue Apr 13, 2023

Split the .kibana saved objects index into multiple indices #154888

Merged

rudolf removed the project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient label Apr 17, 2023

rudolf added the Epic:ScaleMigrations Scale upgrade migrations to millions of saved objects label Apr 17, 2023

gsoldevila closed this as completed in #154888 Apr 25, 2023

This was referenced Jun 1, 2023

8.8.0 migrations temporary ES failures could cause permanent migration failure #158733

Closed

[Migrations] Update all aliases with a single updateAliases() when relocating SO documents #158940

Merged

pgayvallet mentioned this issue Oct 31, 2023

Migrate saved-object types between indices #90817

Closed

gsoldevila mentioned this issue Jun 20, 2024

[Dashboards++] Usage counters enhancements #186530

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve migrations performance by using distinct indices per SO type #104081

Improve migrations performance by using distinct indices per SO type #104081

pgayvallet commented Jul 1, 2021 •

edited by exalate-issue-sync bot

Loading

elasticmachine commented Jul 1, 2021

rudolf commented Jul 11, 2022

rudolf commented Dec 21, 2022

pgayvallet commented Dec 22, 2022 •

edited

Loading

gsoldevila commented Dec 22, 2022 •

edited

Loading

rudolf commented Dec 23, 2022 •

edited

Loading

rudolf commented Dec 23, 2022

Improve migrations performance by using distinct indices per SO type #104081

Improve migrations performance by using distinct indices per SO type #104081

Comments

pgayvallet commented Jul 1, 2021 • edited by exalate-issue-sync bot Loading

elasticmachine commented Jul 1, 2021

rudolf commented Jul 11, 2022

rudolf commented Dec 21, 2022

pgayvallet commented Dec 22, 2022 • edited Loading

gsoldevila commented Dec 22, 2022 • edited Loading

rudolf commented Dec 23, 2022 • edited Loading

rudolf commented Dec 23, 2022

pgayvallet commented Jul 1, 2021 •

edited by exalate-issue-sync bot

Loading

pgayvallet commented Dec 22, 2022 •

edited

Loading

gsoldevila commented Dec 22, 2022 •

edited

Loading

rudolf commented Dec 23, 2022 •

edited

Loading