external-dns stops updating route53 on error #421

lemaral · 2017-12-13T21:00:57Z

Issue: an (obviously broken) service hostname annotation is creating the following error on route53:
"InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.sudbdomain.domain.com'
status code: 400, request id: 115b8642-e047-11e7-83a0-a9e81aaef2a3"
and apparently this is preventing any further modification of the route53 zone from external-dns.
If this is the expected behaviour (any error in the batch means no further modification can be processed), it does not seem very robust as it is (it's also difficult to diagnose).
I'm interested in confirmation and workarounds :)
Thanks

ideahitme · 2017-12-14T11:20:44Z

I see it the opposite way, allowing only partial update is a way to enter a scenario where state of records is unknown and hence unmanageable. External-DNS has to guarantee safety of the updates e.g. not overwriting records belonging to other parties and created via other means. If we do not do batch/atomic updates this is no longer guaranteed. We can introduce an insecure mode , but with that kind of mode you are on your own and in case of misbehaving system diagnosing would become really difficult (as opposed to checking the logs and deleting a resource with apparently invalid hostname).

srflaxu40 · 2017-12-18T18:46:16Z

Yah, I am seeing only initially created records updated, and not existing.

linki · 2018-01-03T12:36:57Z

@lemaral That is correct and thanks for bringing it up.

I kind of disagree with @ideahitme here: A single broken Endpoint should not lead to the overall malfunction of ExternalDNS for literally all other modifications. There's no inherent reason why one broken Endpoint should effect other completely non-related Endpoints.

However, there's a strong connection between an Endpoint's A/CNAME/ALIAS record and its corresponding TXT ownership record, so these two must be modified in a transaction.

lemaral · 2018-01-03T21:07:52Z

@linki, thanks for the follow-up, indeed A/CNAME/ALIAS and the corresponding TXT record must stay in the same batch to keep operations atomic. My concern (as you highlighted it) is definitely about one failing endpoint stopping all further modifications.
I understand the purpose of stacking multiple endpoints in the same batch for the purpose of optimisation, however in case of error the valid endpoints should still be processed correctly IMHO. If there is an agreement on this and no available bandwidth within the current contributors, I'm willing to take this and see how it can be done.

linki · 2018-01-08T10:53:54Z

@szuecs @Raffo @ideahitme @mikkeloscar Have a look @lemaral's comment and let's see if we can agree on this.

szuecs · 2018-01-08T14:22:54Z

@lemaral @linki the problem is more complicated. What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?

Then I would also like to discuss, how the implementation should look like, because we have to be efficient in AWS calls (less AWS calls the better, because of rate limiting). I guess the same is true for other CloudProviders. I thought about split the changeset into 2 and apply both (binary search) and iterate on the errored one to split again until you found the broken record to mark the resource in kubernetes as "external-dns broken record".

@lemaral I would be happy to get PRs from you for this change, but please try to describe how this would work here, such that we can agree on this and might spot a design issue early.

mikkeloscar · 2018-01-08T19:34:16Z

What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?

I might misunderstand the point you are making, but switching a record from one application to another would just mean changing the hostname of two resources (e.g. two ingresses) and external-dns would just pick this up and eventually arrive at the new desired state, which in this case might be a broken state if the hostname defined by the user is invalid.

lemaral · 2018-01-29T21:12:07Z

@szuecs, indeed binary search seems the most efficient wrt api rate limiting, we only need to ensure we're not looping forever.
For the question you brought up, can you elaborate as I don't see why adding this error handling can make things worse.

szuecs · 2018-01-30T10:15:24Z

@lemaral Think about a changeset which will move a record 1 delete, 1 create. If we apply delete, then we get ratelimitted and then we do the create we will have created a downtime for our users.

gtie · 2018-03-15T21:26:15Z

This issue is has bitten us in an unexpected way. While a binary search would indeed be optimal wrt API calls usage, a simpler solution would be to implement it as a fallback strategy - try out a batch update first, and then switch to individual pairs of updates if the batch one fails

zioalex · 2018-08-02T14:09:08Z

Hi guys I have similar problem:
time="2018-08-02T13:50:08Z" level=error msg="InvalidChangeBatch: The request contains an invalid set of changes for a resource record set 'A test.example..com.' status code: 400, request id: f7a6fff3-965a-11e8-8239-47abe2865262"
Though I do not see any real problem with the changeset.
And no other update can be done. Any update here?

szuecs · 2018-08-03T09:45:58Z

@zioalex which version you are using and can you post some logs when external-dns starts?

zioalex · 2018-08-06T14:58:39Z

Hi @szuecs follow some logs:
time="2018-08-02T13:15:19Z" level=info msg="config: &{Master: KubeConfig: Sources:[service ingress] Namespace: FQDNTemplate: Compatibility: Provider:aws GoogleProject: DomainFilter:[] AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: Policy:upsert-only Registry: txt TXTOwnerID:my_domain.com TXTPrefix: Interval:1m0s Once:false DryRun:false LogFormat:text MetricsAddress::7979 Debug:false}" time="2018-08-02T13:15:19Z" level=info msg="Connected to cluster at https://100.64.0.1:443" time="2018-08-02T13:15:21Z" level=info msg="Changing records: CREATE { Action: "CREATE",

The actual version is v0.4.3

cristim · 2018-08-27T16:11:04Z

Since the error message given by the AWS provider contains the offender DNS record, I think we could just parse it, skip it from the batch and run another Upsert API call with the rest of the records in the initial batch.

cnuber · 2019-01-02T14:48:32Z

We are hitting this issue and currently have 57 ingresses that use the fqdns we are monitoring with external-dns. On smaller clusters we have typically 10-15 ingresses monitored and do not hit this issue.

szuecs · 2019-01-02T15:03:22Z

@cnuber thanks for adding more information here!

@linki @njuettner we should work on a fix the numbers are quite low, so I bet many people hit this issue.

fejta-bot · 2019-04-27T23:03:02Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

szuecs · 2019-04-28T08:36:24Z

/remove-lifecycle stale

fejta-bot · 2019-07-27T09:25:09Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

szuecs · 2019-07-27T12:54:46Z

/remove-lifecycle stale

peimanja · 2022-03-30T23:14:18Z

Can we open this?! it is still a valid issue

szuecs · 2022-11-22T18:38:10Z

/reopen

k8s-ci-robot · 2022-11-22T18:38:13Z

@szuecs: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

szuecs · 2022-11-22T18:39:08Z

/important-soon

nitrocode · 2023-08-07T18:47:22Z

Can we reopen this issue please and exclude it from the bot?

It would be nice to have a flag such as --early-exit=true which defaults to true and can be overridden to false.

Edit: i see the implementation from #1209 where it separates failed changes from successful changes and reruns the failed changes in a separate batch. Very clever! I'll try the latest version 0.13.5 again and see if it resolves my issue. Thank you

linki · 2023-08-08T08:30:52Z

/reopen

k8s-ci-robot · 2023-08-08T08:30:56Z

@linki: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-01-26T04:26:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-25T05:25:26Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nitrocode · 2024-02-25T12:33:28Z

Is there a way to exclude this ticket from getting auto closed? It's a notable bug

nitrocode · 2024-02-25T12:33:56Z

/remove-lifecycle rotten

alfredkrohmer · 2024-03-12T12:32:45Z

I think this might be solved with #4166, no? At least it seem to be implemented already for AWS, OCI and Azure.

Liz4v · 2024-05-06T20:35:27Z

@alfredkrohmer Well it's more of a workaround than an error. Now it just loops these spurious records but still no explanation as to why it decided we need them. I have no ingress or service with an empty name, so why is it trying to do that?

time="2024-05-06T20:33:16Z" level=info msg="Applying provider record filter for domains: [green.staging.example.com. .green.staging.example.com.]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE .green.staging.example.com A [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE .green.staging.example.com TXT [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE cname-.green.staging.example.com TXT [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=error msg="Failure in zone green.staging.example.com. [Id: /hostedzone/ZABCABCABC] when submitting change batch: InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.green.staging.example.com'\n\tstatus code: 400, request id: 3065fc1c-06cc-4550-b230-652e7934d00f"
time="2024-05-06T20:33:17Z" level=error msg="Failed to do run once: soft error\nfailed to submit all changes for the following zones: [/hostedzone/ZABCABCABC]"

k8s-triage-robot · 2024-08-04T21:18:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

nitrocode · 2024-08-05T03:40:00Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-09-04T03:51:30Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nitrocode · 2024-09-04T11:28:49Z

/remove-lifecycle rotten

alfredkrohmer · 2024-10-01T14:58:26Z

@Liz4v This issue is about external-dns stopping to update Route 53 when it encounters an error. I think this bug is fixed for good by now - do you agree? I think the problem you have (external-dns trying to create a record with an empty name) is unrelated to this bug (the only connection I see is the initial issue description of the issue author where the bug was triggered by an endpoint with an empty name) and without seeing your external-dns configuration and all possible endpoint sources in your cluster it's also not debuggable where this endpoint is coming from.

@nitrocode What's your reason for keeping the ticket open?

I think this might be solved with #4166, no? At least it seem to be implemented already for AWS, OCI and Azure.

I think that this is actually does fix the issue. Have you updated external-dns to a recent version and verified this?

gtie mentioned this issue May 16, 2018

Upper-case hostname annotations cause breakage with Route53 #561

Closed

hjacobs added the provider/aws label May 22, 2018

peterbale mentioned this issue Jun 19, 2018

Only TXT records are stored in cache causing out-of-sync state #605

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019

This was referenced Jun 27, 2019

ExternalDNS: reduce Route53 batch size to 100 zalando-incubator/kubernetes-on-aws#2266

Merged

Single broken entity breaks the whole batch #104

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2019

alfredkrohmer mentioned this issue Sep 24, 2019

Route53: retry single changes in a batch if the batch fails #1209

Merged

k8s-ci-robot reopened this Nov 22, 2022

katyanna mentioned this issue Nov 23, 2022

Retry failed batch by shortening it #3184

Closed

2 tasks

k8s-ci-robot closed this as completed in #1209 Jan 17, 2023

k8s-ci-robot reopened this Aug 8, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 4, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 4, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 4, 2024

external-dns stops updating route53 on error #421

external-dns stops updating route53 on error #421

Comments

lemaral commented Dec 13, 2017

ideahitme commented Dec 14, 2017 • edited Loading

srflaxu40 commented Dec 18, 2017

linki commented Jan 3, 2018

lemaral commented Jan 3, 2018

linki commented Jan 8, 2018

szuecs commented Jan 8, 2018 • edited Loading

mikkeloscar commented Jan 8, 2018

lemaral commented Jan 29, 2018

szuecs commented Jan 30, 2018

gtie commented Mar 15, 2018

zioalex commented Aug 2, 2018

szuecs commented Aug 3, 2018

zioalex commented Aug 6, 2018

cristim commented Aug 27, 2018 • edited Loading

cnuber commented Jan 2, 2019

szuecs commented Jan 2, 2019 • edited Loading

fejta-bot commented Apr 27, 2019

szuecs commented Apr 28, 2019

fejta-bot commented Jul 27, 2019

szuecs commented Jul 27, 2019

peimanja commented Mar 30, 2022

szuecs commented Nov 22, 2022

k8s-ci-robot commented Nov 22, 2022

szuecs commented Nov 22, 2022

nitrocode commented Aug 7, 2023 • edited Loading

linki commented Aug 8, 2023

k8s-ci-robot commented Aug 8, 2023

k8s-triage-robot commented Jan 26, 2024

k8s-triage-robot commented Feb 25, 2024

nitrocode commented Feb 25, 2024

nitrocode commented Feb 25, 2024

alfredkrohmer commented Mar 12, 2024

Liz4v commented May 6, 2024

k8s-triage-robot commented Aug 4, 2024

nitrocode commented Aug 5, 2024

k8s-triage-robot commented Sep 4, 2024

nitrocode commented Sep 4, 2024

alfredkrohmer commented Oct 1, 2024

ideahitme commented Dec 14, 2017 •

edited

Loading

szuecs commented Jan 8, 2018 •

edited

Loading

cristim commented Aug 27, 2018 •

edited

Loading

szuecs commented Jan 2, 2019 •

edited

Loading

nitrocode commented Aug 7, 2023 •

edited

Loading