Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-dns stops updating route53 on error #421

Open
lemaral opened this issue Dec 13, 2017 · 48 comments · Fixed by #1209
Open

external-dns stops updating route53 on error #421

lemaral opened this issue Dec 13, 2017 · 48 comments · Fixed by #1209

Comments

@lemaral
Copy link

lemaral commented Dec 13, 2017

Issue: an (obviously broken) service hostname annotation is creating the following error on route53:
"InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.sudbdomain.domain.com'
status code: 400, request id: 115b8642-e047-11e7-83a0-a9e81aaef2a3"
and apparently this is preventing any further modification of the route53 zone from external-dns.
If this is the expected behaviour (any error in the batch means no further modification can be processed), it does not seem very robust as it is (it's also difficult to diagnose).
I'm interested in confirmation and workarounds :)
Thanks

@ideahitme
Copy link

ideahitme commented Dec 14, 2017

I see it the opposite way, allowing only partial update is a way to enter a scenario where state of records is unknown and hence unmanageable. External-DNS has to guarantee safety of the updates e.g. not overwriting records belonging to other parties and created via other means. If we do not do batch/atomic updates this is no longer guaranteed. We can introduce an insecure mode , but with that kind of mode you are on your own and in case of misbehaving system diagnosing would become really difficult (as opposed to checking the logs and deleting a resource with apparently invalid hostname).

@srflaxu40
Copy link

Yah, I am seeing only initially created records updated, and not existing.

@linki
Copy link
Member

linki commented Jan 3, 2018

@lemaral That is correct and thanks for bringing it up.

I kind of disagree with @ideahitme here: A single broken Endpoint should not lead to the overall malfunction of ExternalDNS for literally all other modifications. There's no inherent reason why one broken Endpoint should effect other completely non-related Endpoints.

However, there's a strong connection between an Endpoint's A/CNAME/ALIAS record and its corresponding TXT ownership record, so these two must be modified in a transaction.

@lemaral
Copy link
Author

lemaral commented Jan 3, 2018

@linki, thanks for the follow-up, indeed A/CNAME/ALIAS and the corresponding TXT record must stay in the same batch to keep operations atomic. My concern (as you highlighted it) is definitely about one failing endpoint stopping all further modifications.
I understand the purpose of stacking multiple endpoints in the same batch for the purpose of optimisation, however in case of error the valid endpoints should still be processed correctly IMHO. If there is an agreement on this and no available bandwidth within the current contributors, I'm willing to take this and see how it can be done.

@linki
Copy link
Member

linki commented Jan 8, 2018

@szuecs @Raffo @ideahitme @mikkeloscar Have a look @lemaral's comment and let's see if we can agree on this.

@szuecs
Copy link
Contributor

szuecs commented Jan 8, 2018

@lemaral @linki the problem is more complicated. What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?

Then I would also like to discuss, how the implementation should look like, because we have to be efficient in AWS calls (less AWS calls the better, because of rate limiting). I guess the same is true for other CloudProviders. I thought about split the changeset into 2 and apply both (binary search) and iterate on the errored one to split again until you found the broken record to mark the resource in kubernetes as "external-dns broken record".

@lemaral I would be happy to get PRs from you for this change, but please try to describe how this would work here, such that we can agree on this and might spot a design issue early.

@mikkeloscar
Copy link
Contributor

What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?

I might misunderstand the point you are making, but switching a record from one application to another would just mean changing the hostname of two resources (e.g. two ingresses) and external-dns would just pick this up and eventually arrive at the new desired state, which in this case might be a broken state if the hostname defined by the user is invalid.

@lemaral
Copy link
Author

lemaral commented Jan 29, 2018

@szuecs, indeed binary search seems the most efficient wrt api rate limiting, we only need to ensure we're not looping forever.
For the question you brought up, can you elaborate as I don't see why adding this error handling can make things worse.

@szuecs
Copy link
Contributor

szuecs commented Jan 30, 2018

@lemaral Think about a changeset which will move a record 1 delete, 1 create. If we apply delete, then we get ratelimitted and then we do the create we will have created a downtime for our users.

@gtie
Copy link

gtie commented Mar 15, 2018

This issue is has bitten us in an unexpected way. While a binary search would indeed be optimal wrt API calls usage, a simpler solution would be to implement it as a fallback strategy - try out a batch update first, and then switch to individual pairs of updates if the batch one fails

@zioalex
Copy link

zioalex commented Aug 2, 2018

Hi guys I have similar problem:
time="2018-08-02T13:50:08Z" level=error msg="InvalidChangeBatch: The request contains an invalid set of changes for a resource record set 'A test.example..com.' status code: 400, request id: f7a6fff3-965a-11e8-8239-47abe2865262"
Though I do not see any real problem with the changeset.
And no other update can be done. Any update here?

@szuecs
Copy link
Contributor

szuecs commented Aug 3, 2018

@zioalex which version you are using and can you post some logs when external-dns starts?

@zioalex
Copy link

zioalex commented Aug 6, 2018

Hi @szuecs follow some logs:
time="2018-08-02T13:15:19Z" level=info msg="config: &{Master: KubeConfig: Sources:[service ingress] Namespace: FQDNTemplate: Compatibility: Provider:aws GoogleProject: DomainFilter:[] AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: Policy:upsert-only Registry: txt TXTOwnerID:my_domain.com TXTPrefix: Interval:1m0s Once:false DryRun:false LogFormat:text MetricsAddress::7979 Debug:false}" time="2018-08-02T13:15:19Z" level=info msg="Connected to cluster at https://100.64.0.1:443" time="2018-08-02T13:15:21Z" level=info msg="Changing records: CREATE { Action: "CREATE",

The actual version is v0.4.3

@cristim
Copy link

cristim commented Aug 27, 2018

Since the error message given by the AWS provider contains the offender DNS record, I think we could just parse it, skip it from the batch and run another Upsert API call with the rest of the records in the initial batch.

@cnuber
Copy link

cnuber commented Jan 2, 2019

We are hitting this issue and currently have 57 ingresses that use the fqdns we are monitoring with external-dns. On smaller clusters we have typically 10-15 ingresses monitored and do not hit this issue.

@szuecs
Copy link
Contributor

szuecs commented Jan 2, 2019

@cnuber thanks for adding more information here!

@linki @njuettner we should work on a fix the numbers are quite low, so I bet many people hit this issue.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019
@szuecs
Copy link
Contributor

szuecs commented Apr 28, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2019
@szuecs
Copy link
Contributor

szuecs commented Jul 27, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2019
@peimanja
Copy link

Can we open this?! it is still a valid issue

@szuecs
Copy link
Contributor

szuecs commented Nov 22, 2022

/reopen

@k8s-ci-robot
Copy link
Contributor

@szuecs: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 22, 2022
@szuecs
Copy link
Contributor

szuecs commented Nov 22, 2022

/important-soon

@nitrocode
Copy link
Contributor

nitrocode commented Aug 7, 2023

Can we reopen this issue please and exclude it from the bot?

It would be nice to have a flag such as --early-exit=true which defaults to true and can be overridden to false.

Edit: i see the implementation from #1209 where it separates failed changes from successful changes and reruns the failed changes in a separate batch. Very clever! I'll try the latest version 0.13.5 again and see if it resolves my issue. Thank you

@linki
Copy link
Member

linki commented Aug 8, 2023

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Aug 8, 2023
@k8s-ci-robot
Copy link
Contributor

@linki: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2024
@nitrocode
Copy link
Contributor

Is there a way to exclude this ticket from getting auto closed? It's a notable bug

@nitrocode
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2024
@alfredkrohmer
Copy link
Contributor

I think this might be solved with #4166, no? At least it seem to be implemented already for AWS, OCI and Azure.

@Liz4v
Copy link

Liz4v commented May 6, 2024

@alfredkrohmer Well it's more of a workaround than an error. Now it just loops these spurious records but still no explanation as to why it decided we need them. I have no ingress or service with an empty name, so why is it trying to do that?

time="2024-05-06T20:33:16Z" level=info msg="Applying provider record filter for domains: [green.staging.example.com. .green.staging.example.com.]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE .green.staging.example.com A [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE .green.staging.example.com TXT [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE cname-.green.staging.example.com TXT [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=error msg="Failure in zone green.staging.example.com. [Id: /hostedzone/ZABCABCABC] when submitting change batch: InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.green.staging.example.com'\n\tstatus code: 400, request id: 3065fc1c-06cc-4550-b230-652e7934d00f"
time="2024-05-06T20:33:17Z" level=error msg="Failed to do run once: soft error\nfailed to submit all changes for the following zones: [/hostedzone/ZABCABCABC]"

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 4, 2024
@nitrocode
Copy link
Contributor

/remove-lifecycle rotten

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 4, 2024
@nitrocode
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 4, 2024
@alfredkrohmer
Copy link
Contributor

@Liz4v This issue is about external-dns stopping to update Route 53 when it encounters an error. I think this bug is fixed for good by now - do you agree? I think the problem you have (external-dns trying to create a record with an empty name) is unrelated to this bug (the only connection I see is the initial issue description of the issue author where the bug was triggered by an endpoint with an empty name) and without seeing your external-dns configuration and all possible endpoint sources in your cluster it's also not debuggable where this endpoint is coming from.

@nitrocode What's your reason for keeping the ticket open?

I think this might be solved with #4166, no? At least it seem to be implemented already for AWS, OCI and Azure.

I think that this is actually does fix the issue. Have you updated external-dns to a recent version and verified this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet