Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrations v2: Retry tasks that timeout #95305

Merged
merged 14 commits into from
Apr 2, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions src/core/server/saved_objects/migrationsv2/actions/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,20 @@ const waitForTask = (
description: body.task.description,
});
})
.catch((e) => {
rudolf marked this conversation as resolved.
Show resolved Hide resolved
if (
e.body?.error?.type === 'timeout_exception' ||
e.body?.error?.type === 'receive_timeout_transport_exception'
) {
return Either.left({
type: 'retryable_es_client_error' as const,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we place this logic in catchRetryableEsClientErrors then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started there, but then thought, this isn't a generic retryable error like a 503 or a socket timeout, these error codes are specific to the _tasks API, so I thought it makes sense to include this logic here. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see it now, the error type name retryable_es_client_error is tied to errors originating from the ES client. I think we should rather have a specific error type so that the signature of waitForTask shows that callers should expect this "your task didn't complete within the timeout" response.

Then the model could handle this and explicitly call delayRetryState. We can change the signature of delayRetryState to accept a maxRetryAttempts parameter. That way we can maybe retry waitForTask steps forever (or maybe just a really large limit equivalent to 24h).

'retryable_es_client_error' could be any generic problem like a cluster behind a proxy being offline, in such a case I don't think we want to keep retrying forever, we'd rather fail relatively fast so there's an error log users can investigate.

But if we're waiting for a task, and we can successfully get the task status from ES and ES is just saying that it's still busy, then everything is working, even if it's slow, so we can keep on waiting for the task to complete even if it takes hours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to add an integration test for the cloneIndex operation if waitForIndexStatusYellow hits the 30s timeout

message: `[${e.body.error.type}] ${e.body.error.reason}`,
error: e,
});
} else {
throw e;
}
})
.catch(catchRetryableEsClientErrors);
};

Expand Down