Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(k8s): automatic retry for failed API requests #2386

Merged
merged 1 commit into from
May 17, 2021
Merged

Conversation

thsig
Copy link
Collaborator

@thsig thsig commented May 12, 2021

What this PR does / why we need it:

We now automatically retry failed Kubernetes API requests if the reason for the failure matches certain conditions.

For example, timeouts or DNS-related errors will result in retries, but not 404/not found errors (and so forth), which will be thrown without retrying.

We can easily add more error codes and/or conditions to this logic if we discover further error cases that should result in retries.

Which issue(s) this PR fixes:

Fixes #2255.

Should also fix the sporadic API timeouts that some users have been encountering in day-to-day usage.

Special notes for your reviewer:

Do we want to include any more status codes and/or error messages in the shouldRetry helper? Are there other kinds of conditions for which we want to retry?

Also, feel free to comment on the general approach here.

@thsig
Copy link
Collaborator Author

thsig commented May 12, 2021

Note: I didn't use p-retry here, since it doesn't play nice with our custom error classes (which include extra fields in addition to the message).

@eysi09
Copy link
Collaborator

eysi09 commented May 13, 2021

I think @edvald is best suited to review this one. I'll take a look but un-assign myself.

@eysi09 eysi09 removed their assignment May 13, 2021
Copy link
Collaborator

@edvald edvald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small comments there, otherwise seems solid!

} catch (err) {
if (shouldRetry(err) && usedRetries < maxRetries) {
const sleepMsec = minTimeoutMs + usedRetries * minTimeoutMs
if (log) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to go directly for the root logger as a fallback. Also, I think this should be a warn log. This is a bit too hidden atm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this so that the wrapped API has access to the log entry passed to the KubeApi.factory method.

I also added a description parameter to the requestWithRetry function for including in the warning messages, which should add a bit of clarity to the log output.

} else {
if (usedRetries === maxRetries) {
if (log) {
log.debug(`Kubernetes API: Maximum retry count exceeded, throwing error`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"throwing error" is maybe a bit odd for a user-facing message :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yeah, good point—I've removed that text.

httpStatusCodes.SERVICE_UNAVAILABLE,
]

const errorMessageRegexesForRetry = [/getaddrinfo ENOTFOUND/, /getaddrinfo EAI_AGAIN/]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this available on the error object in a more structured form?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be, but the wrapError helper we're using doesn't include those fields (since they're not part of the KubernetesError class).

I thought matching against the error message was neater than adding several optional fields to the KubernetesError class, since we'd probably end up adding more and more of those fields going forward as we handle more of these cases.

I haven't reproduced it on my end, but here's an example of these error objects (from here):

{ [Error: getaddrinfo ENOTFOUND host]
  code: 'ENOTFOUND',
  errno: 'ENOTFOUND',
  syscall: 'getaddrinfo',
  hostname: 'host' }

That said, I'm totally open to adding more fields to the wrapped error (or maybe putting them all in an inheritedFields field with type any).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not overthink it, no worries. I take it you tested this to make sure it works?

Copy link
Collaborator Author

@thsig thsig May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did—I had the logic throw errors and verified that the logging/retrying worked as expected.

The logging could be a bit nicer, but I think addressing that would require a bit of a refactor.

Here's a brief example snippet (where I hardcoded a throw with a getaddrinfo ENOTFOUND error, which matches the retry criteria):

✔ backend                   → Syncing module sources (2 files)... → Done (took 0 sec)
⠦ frontend                  → Getting build status for v-a57012fdb5...
⠼ backend                   → Getting build status for v-6fbe386fc0...
Kubernetes API: listNamespace failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5)
Kubernetes API: /api/v1 failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5)
Kubernetes API: /api/v1/namespaces/demo-project-testing-ths/secrets/garden-docker-auth-41e2da failed with error test error: getaddrinfo ENOTFOUND, sleeping
for 4000ms and retrying (#1/5)
Kubernetes API: patchNamespace failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5)

The log lines get updated during retries (so there's only one log line per retry sequence).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool. I'd suggest "sleeping for 4000ms and retrying" -> "retrying in 4000ms", otherwise looks good.

Also, is 4000ms the value for the first retry? Seems a touch high, no?

Copy link
Collaborator Author

@thsig thsig May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool. I'd suggest "sleeping for 4000ms and retrying" -> "retrying in 4000ms", otherwise looks good.

Yeah, that's better—will make that change.

Also, is 4000ms the value for the first retry? Seems a touch high, no?

It's 2000ms for the first retry, actually (2000 + 2000 * retries)—this should say "#2/5".

Should we go for 500 + 500 * retries instead? [edit: Just made that change.]

We now automatically retry failed Kubernetes API requests if the reason
for the failure matches certain conditions.

For example, timeouts or DNS-related errors will result in retries, but
not 404/not found errors (and so forth), which will be thrown without
retrying.

We can easily add more error codes and/or conditions to this logic if we
discover further error cases that should result in retries.
@thsig thsig merged commit 72165da into master May 17, 2021
@thsig thsig deleted the k8s-api-retry branch May 17, 2021 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intermittent failure of DNS resolution when running garden agains domain based k8s
3 participants