Change CRD retry-backoff #788

lkysow · 2021-10-18T05:19:09Z

Previously, if a custom resource failed to sync with Consul, we used the
default retry backoff. This was an exponential backoff starting at 5ms
and maxing out at 1000s (16m).

This backoff was a poor UX for our common error case where one config
entry cannot be applied due to a prerequisite, for example an ingress
gateway entry cannot be applied until the protocol is set to http. The
usual workflow to resolve this would be to look up the error, figure out
the correct ServiceDefaults/ProxyDefaults, and then apply that resource.
Once applied, the user needs to wait fo the ingress gateway (in this
example) resource to be retried. With the default backoff config,
because the user will have taken on the order of minutes to figure out
the correct config, the exponential backoff will now be upwards of
five minutes. The user will have to wait for a long time for the ingress
gateway resource to be retried.

This PR changes the backoff to start out at 200ms and max out at 5s.
This fits our use-case better because the user will have to wait at max
5ms, usually if there's an error, retrying within milliseconds does nothing, so
waiting 200ms to start is fine, and finally, Consul servers can accept
tens of thousands of writes per second so even if there are a ton of
resources retrying every 5s, it won't be an issue.

Fixes #587

How I've tested this PR:

# ingress-gateway.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: IngressGateway
metadata:
  name: ingress-gateway
spec:
  listeners:
    - port: 8080
      protocol: http
      services:
        - name: frontend
          hosts: ["localhost"]

# proxy-defaults.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  name: global
spec:
  config:
    protocol: http

Apply the igw, then the defaults with a small gap in-between, then wait and see how long it takes to retry:

kubectl apply -f ./ingress-gateway.yaml && sleep 30 && kubectl apply -f ./proxy-defaults.yaml && kubectl get ingressgateway ingress-gateway -w
NAME              SYNCED   LAST SYNCED   AGE
ingress-gateway   False                  33s
ingress-gateway   True     0s            112s

See that it takes 80 seconds to re-sync. If we slept for longer then the retry would be even longer, up to 16 minutes!

Now using ghcr.io/lkysow/consul-k8s-dev:oct17 retry the same:

NAME              SYNCED   LAST SYNCED   AGE
ingress-gateway   False                  30s
ingress-gateway   True     0s            35s

It re-syncs in 5s!

How I expect reviewers to test this PR:

can try out the steps above
I didn't add tests because it is essentially a configuration change. I could possibly add acceptance tests?

Checklist:

CHANGELOG entry added

HashiCorp engineers only, community PRs should not add a changelog entry.
Entries should use present tense (e.g. Add support for...)

ndhanushkodi

Thanks for being so detailed with the reasoning for the backoff parameters you picked in the code comments and PR comment!

I didn't run all the test steps, just read through and code reviewed.

ishustava

Looks great! +1 to Nitya's comment - thank you so much for writing such detailed description!!

Just one minor comment from me but not blocking.

control-plane/controller/configentry_controller.go

Previously, if a custom resource failed to sync with Consul, we used the default retry backoff. This was an exponential backoff starting at 5ms and maxing out at 1000s (16m). This backoff was a poor UX for our common error case where one config entry cannot be applied due to a prerequisite, for example an ingress gateway entry cannot be applied until the protocol is set to http. The usual workflow to resolve this would be to look up the error, figure out the correct ServiceDefaults/ProxyDefaults, and then apply that resource. Once applied, the user needs to wait fo the ingress gateway (in this example) resource to be retried. With the default backoff config, because the user will have taken on the order of minutes to figure out the correct config, the exponential backoff will now be upwards of five minutes. The user will have to wait for a long time for the ingress gateway resource to be retried. This PR changes the backoff to start out at 200ms and max out at 5s. This fits our use-case better because the user will have to wait at max 5ms, usually if there's an error, retrying within 5ms does nothing, so waiting 200ms to start is fine, and finally, Consul servers can accept tens of thousands of writes per second so even if there are a ton of resources retrying every 5s, it won't be an issue.

lkysow requested review from a team, ndhanushkodi and ishustava and removed request for a team October 18, 2021 05:19

ndhanushkodi approved these changes Oct 18, 2021

View reviewed changes

ishustava approved these changes Oct 18, 2021

View reviewed changes

control-plane/controller/configentry_controller.go Outdated Show resolved Hide resolved

lkysow force-pushed the lkysow/crd-backoff branch 2 times, most recently from 14acdc0 to 09f4e45 Compare October 18, 2021 17:24

lkysow force-pushed the lkysow/crd-backoff branch from 09f4e45 to 7089f16 Compare October 18, 2021 18:40

lkysow merged commit 55d497d into main Oct 18, 2021

lkysow deleted the lkysow/crd-backoff branch October 18, 2021 21:59

ottaviohartman pushed a commit to ottaviohartman/consul-k8s that referenced this pull request Nov 3, 2021

Update changelog (hashicorp#788)

b58d56a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change CRD retry-backoff #788

Change CRD retry-backoff #788

lkysow commented Oct 18, 2021 •

edited

Loading

ndhanushkodi left a comment •

edited

Loading

ishustava left a comment

Change CRD retry-backoff #788

Change CRD retry-backoff #788

Conversation

lkysow commented Oct 18, 2021 • edited Loading

ndhanushkodi left a comment • edited Loading

Choose a reason for hiding this comment

ishustava left a comment

Choose a reason for hiding this comment

lkysow commented Oct 18, 2021 •

edited

Loading

ndhanushkodi left a comment •

edited

Loading