Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrations v2: Retry tasks that timeout #95305

Merged
merged 14 commits into from
Apr 2, 2021

Conversation

rudolf
Copy link
Contributor

@rudolf rudolf commented Mar 24, 2021

Summary

Closes #95321

v2 migrations will perform reindex and update_by_query operations as background tasks and then wait for these tasks to complete. However, if a task doesn't complete within 60s the migration does not retry to give the operation another 60s to complete. This change will continue to poll the Elasticsearch tasks API until the task fails or succeeds.

Release notes

Fixes a bug which would cause saved objects upgrade migrations to fail if there are a large number of saved objects in the .kibana or .kibana_task_manager indices.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@rudolf rudolf added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient v7.12.1 labels Mar 24, 2021
@rudolf rudolf requested a review from a team as a code owner March 24, 2021 13:30
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

e.body?.error?.type === 'receive_timeout_transport_exception'
) {
return Either.left({
type: 'retryable_es_client_error' as const,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we place this logic in catchRetryableEsClientErrors then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started there, but then thought, this isn't a generic retryable error like a 503 or a socket timeout, these error codes are specific to the _tasks API, so I thought it makes sense to include this logic here. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see it now, the error type name retryable_es_client_error is tied to errors originating from the ES client. I think we should rather have a specific error type so that the signature of waitForTask shows that callers should expect this "your task didn't complete within the timeout" response.

Then the model could handle this and explicitly call delayRetryState. We can change the signature of delayRetryState to accept a maxRetryAttempts parameter. That way we can maybe retry waitForTask steps forever (or maybe just a really large limit equivalent to 24h).

'retryable_es_client_error' could be any generic problem like a cluster behind a proxy being offline, in such a case I don't think we want to keep retrying forever, we'd rather fail relatively fast so there's an error log users can investigate.

But if we're waiting for a task, and we can successfully get the task status from ES and ES is just saying that it's still busy, then everything is working, even if it's slow, so we can keep on waiting for the task to complete even if it takes hours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to add an integration test for the cloneIndex operation if waitForIndexStatusYellow hits the 30s timeout

…dexTask returns retryable error when task has not completed within the timeout
@rudolf rudolf requested a review from mshustov March 29, 2021 22:20
@kibanamachine
Copy link
Contributor

⏳ Build in-progress, with failures

Failed CI Steps

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Member

@Bamieh Bamieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to make the retry count configurable?

@joshdover
Copy link
Contributor

This PR closes #95321, correct?

@rudolf rudolf requested a review from Bamieh March 30, 2021 18:16
@@ -13,12 +13,14 @@ export type SavedObjectsMigrationConfigType = TypeOf<typeof savedObjectsMigratio
export const savedObjectsMigrationConfig = {
path: 'migrations',
schema: schema.object({
batchSize: schema.number({ defaultValue: 100 }),
batchSize: schema.number({ defaultValue: 1000 }),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also increased the default v1 migrations batchSize, in all cases using 1000 document batch has resulted in faster migrations without any adverse impact. So having this number higher makes it a better fallback for incase v2 migrations don't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested how much memory impact it has?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline, but I only tested the Elasticsearch heap with watch curl -s -X GET "elastic:changeme@localhost:9200/_cat/nodes?h=heap*\&v" -H 'Content-Type: application/json'. and didn't see a difference between 100 / 1000. Kibana isn't responding to any requests at this point, so memory should be stable, but I'm not sure how close we might be to exceeding the default heap.

Copy link
Member

@Bamieh Bamieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

'0s'
)();

await cloneIndexPromise.then((res) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: we don't usually use a chain of promises in Kibana code:

const res = await cloneIndexPromise;

src/core/server/saved_objects/migrationsv2/model.ts Outdated Show resolved Hide resolved
@@ -13,12 +13,14 @@ export type SavedObjectsMigrationConfigType = TypeOf<typeof savedObjectsMigratio
export const savedObjectsMigrationConfig = {
path: 'migrations',
schema: schema.object({
batchSize: schema.number({ defaultValue: 100 }),
batchSize: schema.number({ defaultValue: 1000 }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested how much memory impact it has?

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@rudolf rudolf added the auto-backport Deprecated - use backport:version if exact versions are needed label Apr 2, 2021
@rudolf rudolf merged commit bffded3 into elastic:master Apr 2, 2021
@rudolf rudolf deleted the migrationsv2-fix-task-timeout branch April 2, 2021 08:24
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Apr 2, 2021
* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception

* Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout

* stateActionMachine: remove infinite loop failsafe

* Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes

* cloneIndex integration test if clone target exists but does not turn yellow within timeout

* Try to stabilize waitForReindexTask test

* Fix types

* Make v2 migrations retryAttempts configurable

* Improve type safety by narrowing left res types

* Fix test description

* Fix tests
@kibanamachine
Copy link
Contributor

💚 Backport successful

7.12 / #96118

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Apr 2, 2021
* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception

* Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout

* stateActionMachine: remove infinite loop failsafe

* Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes

* cloneIndex integration test if clone target exists but does not turn yellow within timeout

* Try to stabilize waitForReindexTask test

* Fix types

* Make v2 migrations retryAttempts configurable

* Improve type safety by narrowing left res types

* Fix test description

* Fix tests

Co-authored-by: Rudolf Meijering <skaapgif@gmail.com>
mshustov pushed a commit to mshustov/kibana that referenced this pull request Apr 6, 2021
* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception

* Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout

* stateActionMachine: remove infinite loop failsafe

* Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes

* cloneIndex integration test if clone target exists but does not turn yellow within timeout

* Try to stabilize waitForReindexTask test

* Fix types

* Make v2 migrations retryAttempts configurable

* Improve type safety by narrowing left res types

* Fix test description

* Fix tests
mshustov added a commit that referenced this pull request Apr 6, 2021
* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception

* Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout

* stateActionMachine: remove infinite loop failsafe

* Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes

* cloneIndex integration test if clone target exists but does not turn yellow within timeout

* Try to stabilize waitForReindexTask test

* Fix types

* Make v2 migrations retryAttempts configurable

* Improve type safety by narrowing left res types

* Fix test description

* Fix tests

Co-authored-by: Rudolf Meijering <skaapgif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed bug Fixes for quality problems that affect the customer experience project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient release_note:fix Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.12.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7.12.0 upgrade migrations fail with timeout_exception or receive_timeout_transport_exception
7 participants