Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refuse to start with unknown saved object types in 8.0 #107678

Closed
joshdover opened this issue Aug 4, 2021 · 6 comments · Fixed by #118300
Closed

Refuse to start with unknown saved object types in 8.0 #107678

joshdover opened this issue Aug 4, 2021 · 6 comments · Fixed by #118300
Assignees
Labels
Feature:Saved Objects impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@joshdover
Copy link
Contributor

joshdover commented Aug 4, 2021

Update: to protect users data we have decided to prevent upgrades from succeeding if unknown saved object types are encountered.

Context (newest-oldest) on this discussion:

In order to improve the reliability of the Saved Object migration system, we'd need to eliminate situations that can cause data integrity issues. One of those situations is with how we currently handle objects of unknown types. The current behavior will retain these documents in the Kibana index, unmodified from their previous state. Whenever the type becomes known again (by re-enabling a plugin), Kibana will migrate this document to the most recent schema.

Supporting this is becoming ever more challenging as we expand our ability to do different types of migrations. For example, this behavior will result in corrupted state if there are any other objects that reference objects of a disabled type during the multi-namespace migration and then that type is enabled again later.

Data integrity is important for the long-term maintenance of a Kibana installation and is especially important to long-term, wide-scale users of Kibana. If we want to eliminate this type of data corruption, then we need to take some action here to prevent such scenarios.

The scenario where the migration system encounters objects of unknown types can occur in the following situations:

First-party plugins

Third-party plugins

  • When a 3rd party plugin is disabled or uninstalled after storing some data
    • This may be especially common for 3rd party plugins due to how we require plugins to be built specifically for each Kibana version. It's possible that a user may decide to upgrade their cluster without some custom plugins installed and then install them again later when they've been updated. In this case, I think users expect their data to be intact.
  • When a Saved Object type is removed from usage from a 3rd party plugin
    • There's really not a way with the current architecture that we could easily detect this scenario distinctly from the scenario above, so the options are essentially the same.

Solutions

For users that only use 1st party plugins the solution seems quite straightforward. Given that we've already made progress on mitigating the ways this issue could happen, we either:

  1. Refuse to start Kibana if any objects of unknown types exist in the index
  2. Automatically filter for only for documents of known types and either leave the unknown documents in the previous index or move them to a special .kibana-orphaned index (for example).

(2) is preferable from a user perspective since it unblocks users from upgrading Kibana quickly, but has the drawback of 'silently' excluding documents.

The challenge is how can we handle the scenario with 3rd party plugins where they may not be installed during an upgrade. I propose we go with solution (2) and add a mechanism for importing & migrating documents from the .kibana-orphaned index that are detected as now being known by any plugin. This allows us to keep the default, happy path as safe as possible while giving more advanced users a way to recover their data in ways that may or may not be 100% integral. This mechanism could be exposed via either a config option or a UI prompt to import these objects after Kibana has started (which has some drawbacks but may work for most 3rd party plugins).

For customers that absolutely need 100% data integrity, we can recommend they ensure that all 3rd party plugins are installed during their upgrade rather than afterwards. For all others, we have an escape hatch that will probably work most of the time, but is not guaranteed and therefore not enabled by default.

@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Aug 4, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@joshdover
Copy link
Contributor Author

joshdover commented Aug 5, 2021

A decision on this blocks progress on #105272 and #107740

@lukeelmers
Copy link
Member

Discussed this with @rudolf and @pgayvallet -- The main concern with solution (2) is around how to handle SO references:

  1. If there are any inbound references to the SO of an unknown type, they will break when we move the SO to an orphan index
    • Do we orphan the referring SOs as well to prevent breaking references? Depending on the number of references, this could mean removing a lot of extra objects which may be a surprise to the user (especially if the only way to fix it is to manually re-import them later)
  2. As part of the sharing saved objects effort, we'll be regenerating IDs on SOs. This means that if we "quarantine" the unknown SO, outbound references from it will break once the IDs of other objects are changed.
    • As long as SO aliases are around and used to resolve references, then maybe this could actually still work?
    • If the IDs were stable, this would no longer be a concern. So if we wait until a time in the future when we no longer needed to worry about regenerated IDs (late 8.x?), this will no longer be a problem

@rudolf
Copy link
Contributor

rudolf commented Aug 17, 2021

I think it's actually the inbound references to the quarantined object that would break. When a type is unknown we don't know if we should regenerate the id's (it might be a single namespace type which doesn't require regeneration) so inbound references are left intact. When this type later becomes known we might regenerate it's id, but we wouldn't regenerate the inbound references.

As long as SO aliases are around and used to resolve references, then maybe this could actually still work?

For performance we don't want to resolve id's on every operation e.g. an update or get, so resolve only happens in the plugin code where that plugin expects to be handling user input (e.g. an url with an id).

I think we have a few options that we could explore to solve this problem, but all of them means changes to a complex system that's quite hard to reason about and change. So perhaps we first need to establish whether this is a high enough priority from a product perspective?

@rudolf
Copy link
Contributor

rudolf commented Nov 5, 2021

We have reached the following decision to be implemented in 8.0

Doing nothing in 8.0 is problematically lenient: if we don't implement a way to handle these cases, we risk causing data loss or corruption for users. Rather than allowing this, we'd prefer to fail fast with a clear message outlining the problem.

@rudolf rudolf changed the title Handling unknown saved object types in 8.0 Refuse to start with unknown saved object types in 8.0 Nov 5, 2021
@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. labels Nov 9, 2021
@pgayvallet
Copy link
Contributor

FWIW, this is what we were doing before #105213, so implementing this should in theory just be a revert of the linked PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Saved Objects impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants