-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrations v2: Retry tasks that timeout #95305
Conversation
…ansport_exception
Pinging @elastic/kibana-core (Team:Core) |
e.body?.error?.type === 'receive_timeout_transport_exception' | ||
) { | ||
return Either.left({ | ||
type: 'retryable_es_client_error' as const, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we place this logic in catchRetryableEsClientErrors
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started there, but then thought, this isn't a generic retryable error like a 503 or a socket timeout, these error codes are specific to the _tasks
API, so I thought it makes sense to include this logic here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see it now, the error type name retryable_es_client_error
is tied to errors originating from the ES client. I think we should rather have a specific error type so that the signature of waitForTask
shows that callers should expect this "your task didn't complete within the timeout" response.
Then the model could handle this and explicitly call delayRetryState
. We can change the signature of delayRetryState
to accept a maxRetryAttempts parameter. That way we can maybe retry waitForTask
steps forever (or maybe just a really large limit equivalent to 24h).
'retryable_es_client_error' could be any generic problem like a cluster behind a proxy being offline, in such a case I don't think we want to keep retrying forever, we'd rather fail relatively fast so there's an error log users can investigate.
But if we're waiting for a task, and we can successfully get the task status from ES and ES is just saying that it's still busy, then everything is working, even if it's slow, so we can keep on waiting for the task to complete even if it takes hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to add an integration test for the cloneIndex
operation if waitForIndexStatusYellow
hits the 30s timeout
…dexTask returns retryable error when task has not completed within the timeout
…T_FOR_TASK steps until the ES task completes
…yellow within timeout
⏳ Build in-progress, with failures
Failed CI StepsHistory
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to make the retry count configurable?
This PR closes #95321, correct? |
@@ -13,12 +13,14 @@ export type SavedObjectsMigrationConfigType = TypeOf<typeof savedObjectsMigratio | |||
export const savedObjectsMigrationConfig = { | |||
path: 'migrations', | |||
schema: schema.object({ | |||
batchSize: schema.number({ defaultValue: 100 }), | |||
batchSize: schema.number({ defaultValue: 1000 }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also increased the default v1 migrations batchSize, in all cases using 1000 document batch has resulted in faster migrations without any adverse impact. So having this number higher makes it a better fallback for incase v2 migrations don't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested how much memory impact it has?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline, but I only tested the Elasticsearch heap with watch curl -s -X GET "elastic:changeme@localhost:9200/_cat/nodes?h=heap*\&v" -H 'Content-Type: application/json'.
and didn't see a difference between 100 / 1000. Kibana isn't responding to any requests at this point, so memory should be stable, but I'm not sure how close we might be to exceeding the default heap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
'0s' | ||
)(); | ||
|
||
await cloneIndexPromise.then((res) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit: we don't usually use a chain of promises in Kibana code:
const res = await cloneIndexPromise;
@@ -13,12 +13,14 @@ export type SavedObjectsMigrationConfigType = TypeOf<typeof savedObjectsMigratio | |||
export const savedObjectsMigrationConfig = { | |||
path: 'migrations', | |||
schema: schema.object({ | |||
batchSize: schema.number({ defaultValue: 100 }), | |||
batchSize: schema.number({ defaultValue: 1000 }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested how much memory impact it has?
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception * Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout * stateActionMachine: remove infinite loop failsafe * Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes * cloneIndex integration test if clone target exists but does not turn yellow within timeout * Try to stabilize waitForReindexTask test * Fix types * Make v2 migrations retryAttempts configurable * Improve type safety by narrowing left res types * Fix test description * Fix tests
* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception * Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout * stateActionMachine: remove infinite loop failsafe * Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes * cloneIndex integration test if clone target exists but does not turn yellow within timeout * Try to stabilize waitForReindexTask test * Fix types * Make v2 migrations retryAttempts configurable * Improve type safety by narrowing left res types * Fix test description * Fix tests Co-authored-by: Rudolf Meijering <skaapgif@gmail.com>
* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception * Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout * stateActionMachine: remove infinite loop failsafe * Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes * cloneIndex integration test if clone target exists but does not turn yellow within timeout * Try to stabilize waitForReindexTask test * Fix types * Make v2 migrations retryAttempts configurable * Improve type safety by narrowing left res types * Fix test description * Fix tests
* Retry tasks that timeout with timeout_exception or receive_timeout_transport_exception * Integration test: assert waitForPickupUpdatedMappingsTask waitForReindexTask returns retryable error when task has not completed within the timeout * stateActionMachine: remove infinite loop failsafe * Introduce wait_for_task_completion_timeout and keep on retrying *_WAIT_FOR_TASK steps until the ES task completes * cloneIndex integration test if clone target exists but does not turn yellow within timeout * Try to stabilize waitForReindexTask test * Fix types * Make v2 migrations retryAttempts configurable * Improve type safety by narrowing left res types * Fix test description * Fix tests Co-authored-by: Rudolf Meijering <skaapgif@gmail.com>
Summary
Closes #95321
v2 migrations will perform reindex and update_by_query operations as background tasks and then wait for these tasks to complete. However, if a task doesn't complete within 60s the migration does not retry to give the operation another 60s to complete. This change will continue to poll the Elasticsearch tasks API until the task fails or succeeds.
Release notes
Fixes a bug which would cause saved objects upgrade migrations to fail if there are a large number of saved objects in the
.kibana
or.kibana_task_manager
indices.Checklist
Delete any items that are not applicable to this PR.
For maintainers