Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrations v2: Retry tasks that timeout #95305
migrations v2: Retry tasks that timeout #95305
Changes from 1 commit
fd249da
fc694f5
88e2021
ccd32de
cea2ab3
278fc53
c609c62
2bde33e
5e72f87
f924cf1
a7e7fd6
e6cc5ad
4fb766e
ce724a1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we place this logic in
catchRetryableEsClientErrors
then?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started there, but then thought, this isn't a generic retryable error like a 503 or a socket timeout, these error codes are specific to the
_tasks
API, so I thought it makes sense to include this logic here. WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see it now, the error type name
retryable_es_client_error
is tied to errors originating from the ES client. I think we should rather have a specific error type so that the signature ofwaitForTask
shows that callers should expect this "your task didn't complete within the timeout" response.Then the model could handle this and explicitly call
delayRetryState
. We can change the signature ofdelayRetryState
to accept a maxRetryAttempts parameter. That way we can maybe retrywaitForTask
steps forever (or maybe just a really large limit equivalent to 24h).'retryable_es_client_error' could be any generic problem like a cluster behind a proxy being offline, in such a case I don't think we want to keep retrying forever, we'd rather fail relatively fast so there's an error log users can investigate.
But if we're waiting for a task, and we can successfully get the task status from ES and ES is just saying that it's still busy, then everything is working, even if it's slow, so we can keep on waiting for the task to complete even if it takes hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to add an integration test for the
cloneIndex
operation ifwaitForIndexStatusYellow
hits the 30s timeout