Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-dns pod keeps restarting with aws route53 Throttling: Rate exceeded error #4067

Closed
shreyas-3 opened this issue Nov 27, 2023 · 8 comments · Fixed by #4166 or #4886
Closed

external-dns pod keeps restarting with aws route53 Throttling: Rate exceeded error #4067

shreyas-3 opened this issue Nov 27, 2023 · 8 comments · Fixed by #4166 or #4886
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@shreyas-3
Copy link
Contributor

What happened:
After deploying v0.13.6 version observed external-dns pos keep restarting with CrashLoopBackOff .
Whenever there is aws throttling error pod went to CrashLoopBackOff state.

Error in log :
time="2023-11-27T05:49:33Z" level=fatal msg="records retrieval failed: failed to list hosted zones: Throttling: Rate exceeded\n\tstatus code: 400

POD State when error observed
bash-4.4# kubectl get pod -n kube-system | grep -i ext
external-dns-456d8799b-1xcvv 0/1 CrashLoopBackOff 41 (71s ago) 3h48m

When there is no error in log, POD state is running.

This behaviour did not observe when external-dns:v0.11.0 was deployed. R53 Throtting was present earlier as well.

What you expected to happen:
POD should not restart or go to CrashLoopBackOff state.
POD should be in running state even if there is error due AWS R53 throttling rate exceeded.

How to reproduce it (as minimally and precisely as possible):
Pre-req : AWS r53 throttling rate exceeded error should be there . Can be generated with muliple calls to R53 in less time.
Step : Deploy external-dns v0.13.6 version and check pod state.

Anything else we need to know?:
Error observed when there was below error in logs,
time="2023-11-27T05:49:33Z" level=fatal msg="records retrieval failed: failed to list hosted zones: Throttling: Rate exceeded\n\tstatus code: 400
Pod came back to Normal running state when there was not throttling error.

Environment:

  • External-DNS version (use external-dns --version): 0.13.6
  • DNS provider: AWS Route53
  • Others: EKS cluster
@shreyas-3 shreyas-3 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 27, 2023
@Jayd603
Copy link

Jayd603 commented Nov 27, 2023

I'm seeing a lot of restarts on digital ocean too with 0.14.0. Some error happens and the pod restarts. Sometimes it goes 12 or more hours without a restart. This could be related to external-dns updating records every single time even though no records need to be updated. 3977 , is your pod updating records every minute or does it say ~"All records up to date"

@shreyas-3
Copy link
Contributor Author

shreyas-3 commented Nov 27, 2023

I'm seeing a lot of restarts on digital ocean too with 0.14.0. Some error happens and the pod restarts. Sometimes it goes 12 or more hours without a restart. This could be related to external-dns updating records every single time even though no records need to be updated. 3977 , is your pod updating records every minute or does it say ~"All records up to date"

Nope pod is not updating records every minute.
it just say all records up to date when there is no error "aws rotue53 throttling rate limit exceeded"
and pod get restarts when it observe rate limit exceeded error which is frequent

@shreyas-3
Copy link
Contributor Author

Suggestion is can we not crash the pod even if there is thorttling error ?

@matthewbyrne
Copy link

@shreyas-3 This has been caused by this change:
#3009

From the comments on that change, you are not alone.

We've attempted to reduce our route53 checks, and update on event, but we still regularly get pods restarting.

@matthewbyrne
Copy link

Looks like they've reverted the change for v0.14.0
https://github.com/olemarkus/external-dns/blob/master/controller/controller.go#L194

@gregsidelinger
Copy link
Contributor

Looks like they've reverted the change for v0.14.0 https://github.com/olemarkus/external-dns/blob/master/controller/controller.go#L194

A PR reverting this was never submitted. You are looking at an old fork from before the patch was submitted from a branch.

Granted maybe someone wants to submit a PR to either revert or add an option to let the user decide if this should be treated as fatal. Getting rate limited from AWS should never cause a restart as far as I'm concerned.

@BaudouinH
Copy link

Hello, we are meeting the same issue on GCP, due not to throttling but to transient authentication errors:
{"level":"fatal","msg":"googleapi: Error 503: Authentication backend unavailable., backendError","time":"(...)"}

These are transient errors, and are bound to happen. It is a good thing that external-dns log them, but I do not think it should crash over them.

szuecs added a commit that referenced this issue Jan 9, 2024
… error and not fatal

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@szuecs
Copy link
Contributor

szuecs commented Jan 11, 2024

Hello, we are meeting the same issue on GCP, due not to throttling but to transient authentication errors: {"level":"fatal","msg":"googleapi: Error 503: Authentication backend unavailable., backendError","time":"(...)"}

These are transient errors, and are bound to happen. It is a good thing that external-dns log them, but I do not think it should crash over them.

Yes maybe 503 and 429 are good cases to not fail here. I am not sure if we get the status code from SDKs, likely only errors that we maybe can check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
6 participants