migrations v2: Retry tasks that timeout #95305

rudolf · 2021-03-24T13:30:29Z

Summary

v2 migrations will perform reindex and update_by_query operations as background tasks and then wait for these tasks to complete. However, if a task doesn't complete within 60s the migration does not retry to give the operation another 60s to complete. This change will continue to poll the Elasticsearch tasks API until the task fails or succeeds.

Release notes

Fixes a bug which would cause saved objects upgrade migrations to fail if there are a large number of saved objects in the .kibana or .kibana_task_manager indices.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

For maintainers

This was checked for breaking API changes and was labeled appropriately

…ansport_exception

elasticmachine · 2021-03-24T13:30:32Z

Pinging @elastic/kibana-core (Team:Core)

mshustov · 2021-03-24T14:22:30Z

src/core/server/saved_objects/migrationsv2/actions/index.ts

+        e.body?.error?.type === 'receive_timeout_transport_exception'
+      ) {
+        return Either.left({
+          type: 'retryable_es_client_error' as const,


shouldn't we place this logic in catchRetryableEsClientErrors then?

I started there, but then thought, this isn't a generic retryable error like a 503 or a socket timeout, these error codes are specific to the _tasks API, so I thought it makes sense to include this logic here. WDYT?

Oh I see it now, the error type name retryable_es_client_error is tied to errors originating from the ES client. I think we should rather have a specific error type so that the signature of waitForTask shows that callers should expect this "your task didn't complete within the timeout" response.

Then the model could handle this and explicitly call delayRetryState. We can change the signature of delayRetryState to accept a maxRetryAttempts parameter. That way we can maybe retry waitForTask steps forever (or maybe just a really large limit equivalent to 24h).

'retryable_es_client_error' could be any generic problem like a cluster behind a proxy being offline, in such a case I don't think we want to keep retrying forever, we'd rather fail relatively fast so there's an error log users can investigate.

But if we're waiting for a task, and we can successfully get the task status from ES and ES is just saying that it's still busy, then everything is working, even if it's slow, so we can keep on waiting for the task to complete even if it takes hours.

we also need to add an integration test for the cloneIndex operation if waitForIndexStatusYellow hits the 30s timeout

…dexTask returns retryable error when task has not completed within the timeout

…T_FOR_TASK steps until the ES task completes

…yellow within timeout

kibanamachine · 2021-03-29T22:38:57Z

⏳ Build in-progress, with failures

continuous-integration/kibana-ci/pull-request
Commit: 278fc53
Storybooks not built yet
This comment will update when the build is complete

Failed CI Steps

Check Doc API Changes

History

💔 Build #115073 failed fc694f5
💚 Build #114887 succeeded fd249da

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Bamieh

does it make sense to make the retry count configurable?

joshdover · 2021-03-30T12:13:54Z

This PR closes #95321, correct?

rudolf · 2021-03-30T18:21:26Z

src/core/server/saved_objects/saved_objects_config.ts

@@ -13,12 +13,14 @@ export type SavedObjectsMigrationConfigType = TypeOf<typeof savedObjectsMigratio
 export const savedObjectsMigrationConfig = {
  path: 'migrations',
  schema: schema.object({
-    batchSize: schema.number({ defaultValue: 100 }),
+    batchSize: schema.number({ defaultValue: 1000 }),


I've also increased the default v1 migrations batchSize, in all cases using 1000 document batch has resulted in faster migrations without any adverse impact. So having this number higher makes it a better fallback for incase v2 migrations don't work.

Have we tested how much memory impact it has?

We discussed this offline, but I only tested the Elasticsearch heap with watch curl -s -X GET "elastic:changeme@localhost:9200/_cat/nodes?h=heap*\&v" -H 'Content-Type: application/json'. and didn't see a difference between 100 / 1000. Kibana isn't responding to any requests at this point, so memory should be stable, but I'm not sure how close we might be to exceeding the default heap.

Bamieh

LGTM

src/core/server/saved_objects/migrationsv2/actions/index.ts

mshustov · 2021-03-31T09:18:44Z

src/core/server/saved_objects/migrationsv2/integration_tests/actions.test.ts

+        '0s'
+      )();
+
+      await cloneIndexPromise.then((res) => {


optional nit: we don't usually use a chain of promises in Kibana code:

const res = await cloneIndexPromise;

src/core/server/saved_objects/migrationsv2/model.test.ts

src/core/server/saved_objects/migrationsv2/model.ts

mshustov · 2021-03-31T09:38:49Z

src/core/server/saved_objects/saved_objects_config.ts

@@ -13,12 +13,14 @@ export type SavedObjectsMigrationConfigType = TypeOf<typeof savedObjectsMigratio
 export const savedObjectsMigrationConfig = {
  path: 'migrations',
  schema: schema.object({
-    batchSize: schema.number({ defaultValue: 100 }),
+    batchSize: schema.number({ defaultValue: 1000 }),


Have we tested how much memory impact it has?

kibanamachine · 2021-04-01T00:43:11Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💔 Build #116665 failed e6cc5ad
💔 Build #116653 failed a7e7fd6
💚 Build #116163 succeeded f924cf1
💚 Build #116101 succeeded 5e72f87
💔 Build #115969 failed c609c62

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception * Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout * stateActionMachine: remove infinite loop failsafe * Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes * cloneIndex integration test if clone target exists but does not turn yellow within timeout * Try to stabilize waitForReindexTask test * Fix types * Make v2 migrations retryAttempts configurable * Improve type safety by narrowing left res types * Fix test description * Fix tests

kibanamachine · 2021-04-02T08:27:07Z

💚 Backport successful

✅ 7.12 / #96118

This backport PR will be merged automatically after passing CI.

* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception * Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout * stateActionMachine: remove infinite loop failsafe * Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes * cloneIndex integration test if clone target exists but does not turn yellow within timeout * Try to stabilize waitForReindexTask test * Fix types * Make v2 migrations retryAttempts configurable * Improve type safety by narrowing left res types * Fix test description * Fix tests Co-authored-by: Rudolf Meijering <skaapgif@gmail.com>

* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception * Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout * stateActionMachine: remove infinite loop failsafe * Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes * cloneIndex integration test if clone target exists but does not turn yellow within timeout * Try to stabilize waitForReindexTask test * Fix types * Make v2 migrations retryAttempts configurable * Improve type safety by narrowing left res types * Fix test description * Fix tests

* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception * Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout * stateActionMachine: remove infinite loop failsafe * Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes * cloneIndex integration test if clone target exists but does not turn yellow within timeout * Try to stabilize waitForReindexTask test * Fix types * Make v2 migrations retryAttempts configurable * Improve type safety by narrowing left res types * Fix test description * Fix tests Co-authored-by: Rudolf Meijering <skaapgif@gmail.com>

Retry tasks that timeout with timeout_exception or receive_timeout_tr…

fd249da

…ansport_exception

rudolf requested a review from a team as a code owner March 24, 2021 13:30

mshustov reviewed Mar 24, 2021

View reviewed changes

rudolf mentioned this pull request Mar 24, 2021

7.12.0 upgrade migrations fail with timeout_exception or receive_timeout_transport_exception #95321

Closed

Integration test: assert waitForPickupUpdatedMappingsTask waitForRein…

fc694f5

…dexTask returns retryable error when task has not completed within the timeout

TinaHeiligers added the release_note:fix label Mar 24, 2021

rudolf added 5 commits March 29, 2021 21:59

stateActionMachine: remove infinite loop failsafe

88e2021

Merge branch 'master' into migrationsv2-fix-task-timeout

ccd32de

Introduce wait_for_task_completion_timeout and keep on retrying *_WAI…

cea2ab3

…T_FOR_TASK steps until the ES task completes

cloneIndex integration test if clone target exists but does not turn …

278fc53

…yellow within timeout

Try to stabilize waitForReindexTask test

c609c62

rudolf requested a review from mshustov March 29, 2021 22:20

Bamieh reviewed Mar 30, 2021

View reviewed changes

rudolf added 2 commits March 30, 2021 12:13

Fix types

2bde33e

Merge branch 'master' into migrationsv2-fix-task-timeout

5e72f87

Make v2 migrations retryAttempts configurable

f924cf1

rudolf requested a review from Bamieh March 30, 2021 18:16

rudolf commented Mar 30, 2021

View reviewed changes

Bamieh approved these changes Mar 31, 2021

View reviewed changes

mshustov approved these changes Mar 31, 2021

View reviewed changes

rudolf added 2 commits March 31, 2021 22:55

Improve type safety by narrowing left res types

a7e7fd6

Fix test description

e6cc5ad

rudolf added 2 commits April 1, 2021 00:35

Fix tests

4fb766e

Merge branch 'master' into migrationsv2-fix-task-timeout

ce724a1

rudolf added the auto-backport Deprecated - use backport:version if exact versions are needed label Apr 2, 2021

rudolf merged commit bffded3 into elastic:master Apr 2, 2021

rudolf deleted the migrationsv2-fix-task-timeout branch April 2, 2021 08:24

kibanamachine mentioned this pull request Apr 2, 2021

[7.12] migrations v2: Retry tasks that timeout (#95305) #96118

Merged

mshustov mentioned this pull request Apr 6, 2021

[7.x] migrations v2: Retry tasks that timeout (#95305) #96305

Merged

rudolf mentioned this pull request Apr 7, 2021

Document SO migrations enableV2 and batchSize config options #96290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrations v2: Retry tasks that timeout #95305

migrations v2: Retry tasks that timeout #95305

rudolf commented Mar 24, 2021 •

edited

Loading

elasticmachine commented Mar 24, 2021

mshustov Mar 24, 2021

rudolf Mar 24, 2021

rudolf Mar 24, 2021

rudolf Mar 24, 2021

kibanamachine commented Mar 29, 2021

Bamieh left a comment

joshdover commented Mar 30, 2021

rudolf Mar 30, 2021

mshustov Mar 31, 2021

rudolf Mar 31, 2021

Bamieh left a comment

mshustov Mar 31, 2021

mshustov Mar 31, 2021

kibanamachine commented Apr 1, 2021

kibanamachine commented Apr 2, 2021

migrations v2: Retry tasks that timeout #95305

migrations v2: Retry tasks that timeout #95305

Conversation

rudolf commented Mar 24, 2021 • edited Loading

Summary

Release notes

Checklist

For maintainers

elasticmachine commented Mar 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibanamachine commented Mar 29, 2021

⏳ Build in-progress, with failures

Failed CI Steps

History

Bamieh left a comment

Choose a reason for hiding this comment

joshdover commented Mar 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bamieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibanamachine commented Apr 1, 2021

💚 Build Succeeded

Metrics [docs]

History

kibanamachine commented Apr 2, 2021

💚 Backport successful

rudolf commented Mar 24, 2021 •

edited

Loading