fix(k8s): automatic retry for failed API requests #2386

thsig · 2021-05-12T12:32:47Z

What this PR does / why we need it:

We now automatically retry failed Kubernetes API requests if the reason for the failure matches certain conditions.

For example, timeouts or DNS-related errors will result in retries, but not 404/not found errors (and so forth), which will be thrown without retrying.

We can easily add more error codes and/or conditions to this logic if we discover further error cases that should result in retries.

Which issue(s) this PR fixes:

Fixes #2255.

Should also fix the sporadic API timeouts that some users have been encountering in day-to-day usage.

Special notes for your reviewer:

Do we want to include any more status codes and/or error messages in the shouldRetry helper? Are there other kinds of conditions for which we want to retry?

Also, feel free to comment on the general approach here.

thsig · 2021-05-12T12:34:43Z

Note: I didn't use p-retry here, since it doesn't play nice with our custom error classes (which include extra fields in addition to the message).

eysi09 · 2021-05-13T09:35:05Z

I think @edvald is best suited to review this one. I'll take a look but un-assign myself.

edvald

Couple of small comments there, otherwise seems solid!

edvald · 2021-05-13T13:25:09Z

core/src/plugins/kubernetes/api.ts

+    } catch (err) {
+      if (shouldRetry(err) && usedRetries < maxRetries) {
+        const sleepMsec = minTimeoutMs + usedRetries * minTimeoutMs
+        if (log) {


I'd prefer to go directly for the root logger as a fallback. Also, I think this should be a warn log. This is a bit too hidden atm.

I refactored this so that the wrapped API has access to the log entry passed to the KubeApi.factory method.

I also added a description parameter to the requestWithRetry function for including in the warning messages, which should add a bit of clarity to the log output.

edvald · 2021-05-13T13:25:12Z

core/src/plugins/kubernetes/api.ts

+      } else {
+        if (usedRetries === maxRetries) {
+          if (log) {
+            log.debug(`Kubernetes API: Maximum retry count exceeded, throwing error`)


"throwing error" is maybe a bit odd for a user-facing message :)

Haha yeah, good point—I've removed that text.

edvald · 2021-05-13T13:25:16Z

core/src/plugins/kubernetes/api.ts

+  httpStatusCodes.SERVICE_UNAVAILABLE,
+]
+
+const errorMessageRegexesForRetry = [/getaddrinfo ENOTFOUND/, /getaddrinfo EAI_AGAIN/]


Isn't this available on the error object in a more structured form?

It might be, but the wrapError helper we're using doesn't include those fields (since they're not part of the KubernetesError class).

I thought matching against the error message was neater than adding several optional fields to the KubernetesError class, since we'd probably end up adding more and more of those fields going forward as we handle more of these cases.

I haven't reproduced it on my end, but here's an example of these error objects (from here):

{ [Error: getaddrinfo ENOTFOUND host] code: 'ENOTFOUND', errno: 'ENOTFOUND', syscall: 'getaddrinfo', hostname: 'host' }

That said, I'm totally open to adding more fields to the wrapped error (or maybe putting them all in an inheritedFields field with type any).

Let's not overthink it, no worries. I take it you tested this to make sure it works?

Yeah, I did—I had the logic throw errors and verified that the logging/retrying worked as expected.

The logging could be a bit nicer, but I think addressing that would require a bit of a refactor.

Here's a brief example snippet (where I hardcoded a throw with a getaddrinfo ENOTFOUND error, which matches the retry criteria):

✔ backend → Syncing module sources (2 files)... → Done (took 0 sec) ⠦ frontend → Getting build status for v-a57012fdb5... ⠼ backend → Getting build status for v-6fbe386fc0... Kubernetes API: listNamespace failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5) Kubernetes API: /api/v1 failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5) Kubernetes API: /api/v1/namespaces/demo-project-testing-ths/secrets/garden-docker-auth-41e2da failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5) Kubernetes API: patchNamespace failed with error test error: getaddrinfo ENOTFOUND, sleeping for 4000ms and retrying (#1/5)

The log lines get updated during retries (so there's only one log line per retry sequence).

Ok cool. I'd suggest "sleeping for 4000ms and retrying" -> "retrying in 4000ms", otherwise looks good.

Also, is 4000ms the value for the first retry? Seems a touch high, no?

Ok cool. I'd suggest "sleeping for 4000ms and retrying" -> "retrying in 4000ms", otherwise looks good.

Yeah, that's better—will make that change.

Also, is 4000ms the value for the first retry? Seems a touch high, no?

It's 2000ms for the first retry, actually (2000 + 2000 * retries)—this should say "#2/5".

Should we go for 500 + 500 * retries instead? [edit: Just made that change.]

We now automatically retry failed Kubernetes API requests if the reason for the failure matches certain conditions. For example, timeouts or DNS-related errors will result in retries, but not 404/not found errors (and so forth), which will be thrown without retrying. We can easily add more error codes and/or conditions to this logic if we discover further error cases that should result in retries.

thsig assigned edvald and eysi09 May 12, 2021

thsig force-pushed the k8s-api-retry branch from 09839a2 to 423a7d7 Compare May 12, 2021 12:33

thsig force-pushed the k8s-api-retry branch from 423a7d7 to f8131ce Compare May 13, 2021 07:24

eysi09 removed their assignment May 13, 2021

edvald requested changes May 13, 2021

View reviewed changes

thsig force-pushed the k8s-api-retry branch from f8131ce to e216a46 Compare May 17, 2021 09:57

thsig force-pushed the k8s-api-retry branch from e216a46 to f60ec3b Compare May 17, 2021 14:55

edvald approved these changes May 17, 2021

View reviewed changes

thsig merged commit 72165da into master May 17, 2021

thsig deleted the k8s-api-retry branch May 17, 2021 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(k8s): automatic retry for failed API requests #2386

fix(k8s): automatic retry for failed API requests #2386

thsig commented May 12, 2021

thsig commented May 12, 2021

eysi09 commented May 13, 2021

edvald left a comment

edvald May 13, 2021

thsig May 17, 2021

edvald May 13, 2021

thsig May 17, 2021

edvald May 13, 2021

thsig May 17, 2021

edvald May 17, 2021

thsig May 17, 2021 •

edited

Loading

edvald May 17, 2021

thsig May 17, 2021 •

edited

Loading

fix(k8s): automatic retry for failed API requests #2386

fix(k8s): automatic retry for failed API requests #2386

Conversation

thsig commented May 12, 2021

thsig commented May 12, 2021

eysi09 commented May 13, 2021

edvald left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thsig May 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thsig May 17, 2021 • edited Loading

Choose a reason for hiding this comment

thsig May 17, 2021 •

edited

Loading

thsig May 17, 2021 •

edited

Loading