Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change CRD retry-backoff #788

Merged
merged 1 commit into from
Oct 18, 2021
Merged

Change CRD retry-backoff #788

merged 1 commit into from
Oct 18, 2021

Conversation

lkysow
Copy link
Member

@lkysow lkysow commented Oct 18, 2021

Previously, if a custom resource failed to sync with Consul, we used the
default retry backoff. This was an exponential backoff starting at 5ms
and maxing out at 1000s (16m).

This backoff was a poor UX for our common error case where one config
entry cannot be applied due to a prerequisite, for example an ingress
gateway entry cannot be applied until the protocol is set to http. The
usual workflow to resolve this would be to look up the error, figure out
the correct ServiceDefaults/ProxyDefaults, and then apply that resource.
Once applied, the user needs to wait fo the ingress gateway (in this
example) resource to be retried. With the default backoff config,
because the user will have taken on the order of minutes to figure out
the correct config, the exponential backoff will now be upwards of
five minutes. The user will have to wait for a long time for the ingress
gateway resource to be retried.

This PR changes the backoff to start out at 200ms and max out at 5s.
This fits our use-case better because the user will have to wait at max
5ms, usually if there's an error, retrying within milliseconds does nothing, so
waiting 200ms to start is fine, and finally, Consul servers can accept
tens of thousands of writes per second so even if there are a ton of
resources retrying every 5s, it won't be an issue.

Fixes #587

How I've tested this PR:

# ingress-gateway.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: IngressGateway
metadata:
  name: ingress-gateway
spec:
  listeners:
    - port: 8080
      protocol: http
      services:
        - name: frontend
          hosts: ["localhost"]
# proxy-defaults.yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  name: global
spec:
  config:
    protocol: http

Apply the igw, then the defaults with a small gap in-between, then wait and see how long it takes to retry:

kubectl apply -f ./ingress-gateway.yaml && sleep 30 && kubectl apply -f ./proxy-defaults.yaml && kubectl get ingressgateway ingress-gateway -w
NAME              SYNCED   LAST SYNCED   AGE
ingress-gateway   False                  33s
ingress-gateway   True     0s            112s

See that it takes 80 seconds to re-sync. If we slept for longer then the retry would be even longer, up to 16 minutes!

Now using ghcr.io/lkysow/consul-k8s-dev:oct17 retry the same:

NAME              SYNCED   LAST SYNCED   AGE
ingress-gateway   False                  30s
ingress-gateway   True     0s            35s

It re-syncs in 5s!

How I expect reviewers to test this PR:

  • can try out the steps above
  • I didn't add tests because it is essentially a configuration change. I could possibly add acceptance tests?

Checklist:

  • CHANGELOG entry added

    HashiCorp engineers only, community PRs should not add a changelog entry.
    Entries should use present tense (e.g. Add support for...)

@lkysow lkysow requested review from a team, ndhanushkodi and ishustava and removed request for a team October 18, 2021 05:19
Copy link
Contributor

@ndhanushkodi ndhanushkodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being so detailed with the reasoning for the backoff parameters you picked in the code comments and PR comment!

I didn't run all the test steps, just read through and code reviewed.

Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! +1 to Nitya's comment - thank you so much for writing such detailed description!!

Just one minor comment from me but not blocking.

control-plane/controller/configentry_controller.go Outdated Show resolved Hide resolved
@lkysow lkysow force-pushed the lkysow/crd-backoff branch 2 times, most recently from 14acdc0 to 09f4e45 Compare October 18, 2021 17:24
Previously, if a custom resource failed to sync with Consul, we used the
default retry backoff. This was an exponential backoff starting at 5ms
and maxing out at 1000s (16m).

This backoff was a poor UX for our common error case where one config
entry cannot be applied due to a prerequisite, for example an ingress
gateway entry cannot be applied until the protocol is set to http. The
usual workflow to resolve this would be to look up the error, figure out
the correct ServiceDefaults/ProxyDefaults, and then apply that resource.
Once applied, the user needs to wait fo the ingress gateway (in this
example) resource to be retried. With the default backoff config,
because the user will have taken on the order of minutes to figure out
the correct config, the exponential backoff will now be upwards of
five minutes. The user will have to wait for a long time for the ingress
gateway resource to be retried.

This PR changes the backoff to start out at 200ms and max out at 5s.
This fits our use-case better because the user will have to wait at max
5ms, usually if there's an error, retrying within 5ms does nothing, so
waiting 200ms to start is fine, and finally, Consul servers can accept
tens of thousands of writes per second so even if there are a ton of
resources retrying every 5s, it won't be an issue.
@lkysow lkysow merged commit 55d497d into main Oct 18, 2021
@lkysow lkysow deleted the lkysow/crd-backoff branch October 18, 2021 21:59
ottaviohartman pushed a commit to ottaviohartman/consul-k8s that referenced this pull request Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider tweaking CRD controller backoff
3 participants