-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(k8s): automatic retry for failed API requests #2386
Conversation
Note: I didn't use |
I think @edvald is best suited to review this one. I'll take a look but un-assign myself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of small comments there, otherwise seems solid!
core/src/plugins/kubernetes/api.ts
Outdated
} catch (err) { | ||
if (shouldRetry(err) && usedRetries < maxRetries) { | ||
const sleepMsec = minTimeoutMs + usedRetries * minTimeoutMs | ||
if (log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to go directly for the root logger as a fallback. Also, I think this should be a warn log. This is a bit too hidden atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this so that the wrapped API has access to the log entry passed to the KubeApi.factory
method.
I also added a description parameter to the requestWithRetry
function for including in the warning messages, which should add a bit of clarity to the log output.
core/src/plugins/kubernetes/api.ts
Outdated
} else { | ||
if (usedRetries === maxRetries) { | ||
if (log) { | ||
log.debug(`Kubernetes API: Maximum retry count exceeded, throwing error`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"throwing error" is maybe a bit odd for a user-facing message :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yeah, good point—I've removed that text.
httpStatusCodes.SERVICE_UNAVAILABLE, | ||
] | ||
|
||
const errorMessageRegexesForRetry = [/getaddrinfo ENOTFOUND/, /getaddrinfo EAI_AGAIN/] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this available on the error object in a more structured form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be, but the wrapError
helper we're using doesn't include those fields (since they're not part of the KubernetesError
class).
I thought matching against the error message was neater than adding several optional fields to the KubernetesError
class, since we'd probably end up adding more and more of those fields going forward as we handle more of these cases.
I haven't reproduced it on my end, but here's an example of these error objects (from here):
{ [Error: getaddrinfo ENOTFOUND host]
code: 'ENOTFOUND',
errno: 'ENOTFOUND',
syscall: 'getaddrinfo',
hostname: 'host' }
That said, I'm totally open to adding more fields to the wrapped error (or maybe putting them all in an inheritedFields
field with type any
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not overthink it, no worries. I take it you tested this to make sure it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did—I had the logic throw errors and verified that the logging/retrying worked as expected.
The logging could be a bit nicer, but I think addressing that would require a bit of a refactor.
Here's a brief example snippet (where I hardcoded a throw
with a getaddrinfo ENOTFOUND
error, which matches the retry criteria):
✔ backend → Syncing module sources (2 files)... → Done (took 0 sec)
⠦ frontend → Getting build status for v-a57012fdb5...
⠼ backend → Getting build status for v-6fbe386fc0...
Kubernetes API: listNamespace failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5)
Kubernetes API: /api/v1 failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5)
Kubernetes API: /api/v1/namespaces/demo-project-testing-ths/secrets/garden-docker-auth-41e2da failed with error test error: getaddrinfo ENOTFOUND, sleeping
for 4000ms and retrying (#1/5)
Kubernetes API: patchNamespace failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5)
The log lines get updated during retries (so there's only one log line per retry sequence).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool. I'd suggest "sleeping for 4000ms and retrying" -> "retrying in 4000ms", otherwise looks good.
Also, is 4000ms the value for the first retry? Seems a touch high, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool. I'd suggest "sleeping for 4000ms and retrying" -> "retrying in 4000ms", otherwise looks good.
Yeah, that's better—will make that change.
Also, is 4000ms the value for the first retry? Seems a touch high, no?
It's 2000ms for the first retry, actually (2000 + 2000 * retries
)—this should say "#2/5".
Should we go for 500 + 500 * retries
instead? [edit: Just made that change.]
We now automatically retry failed Kubernetes API requests if the reason for the failure matches certain conditions. For example, timeouts or DNS-related errors will result in retries, but not 404/not found errors (and so forth), which will be thrown without retrying. We can easily add more error codes and/or conditions to this logic if we discover further error cases that should result in retries.
What this PR does / why we need it:
We now automatically retry failed Kubernetes API requests if the reason for the failure matches certain conditions.
For example, timeouts or DNS-related errors will result in retries, but not 404/not found errors (and so forth), which will be thrown without retrying.
We can easily add more error codes and/or conditions to this logic if we discover further error cases that should result in retries.
Which issue(s) this PR fixes:
Fixes #2255.
Should also fix the sporadic API timeouts that some users have been encountering in day-to-day usage.
Special notes for your reviewer:
Do we want to include any more status codes and/or error messages in the
shouldRetry
helper? Are there other kinds of conditions for which we want to retry?Also, feel free to comment on the general approach here.