-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
external-dns stops updating route53 on error #421
Comments
I see it the opposite way, allowing only partial update is a way to enter a scenario where state of records is unknown and hence unmanageable. External-DNS has to guarantee safety of the updates e.g. not overwriting records belonging to other parties and created via other means. If we do not do batch/atomic updates this is no longer guaranteed. We can introduce an insecure mode , but with that kind of mode you are on your own and in case of misbehaving system diagnosing would become really difficult (as opposed to checking the logs and deleting a resource with apparently invalid hostname). |
Yah, I am seeing only initially created records updated, and not existing. |
@lemaral That is correct and thanks for bringing it up. I kind of disagree with @ideahitme here: A single broken However, there's a strong connection between an |
@linki, thanks for the follow-up, indeed A/CNAME/ALIAS and the corresponding TXT record must stay in the same batch to keep operations atomic. My concern (as you highlighted it) is definitely about one failing endpoint stopping all further modifications. |
@szuecs @Raffo @ideahitme @mikkeloscar Have a look @lemaral's comment and let's see if we can agree on this. |
@lemaral @linki the problem is more complicated. What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction? Then I would also like to discuss, how the implementation should look like, because we have to be efficient in AWS calls (less AWS calls the better, because of rate limiting). I guess the same is true for other CloudProviders. I thought about split the changeset into 2 and apply both (binary search) and iterate on the errored one to split again until you found the broken record to mark the resource in kubernetes as "external-dns broken record". @lemaral I would be happy to get PRs from you for this change, but please try to describe how this would work here, such that we can agree on this and might spot a design issue early. |
I might misunderstand the point you are making, but switching a record from one application to another would just mean changing the hostname of two resources (e.g. two ingresses) and external-dns would just pick this up and eventually arrive at the new desired state, which in this case might be a broken state if the hostname defined by the user is invalid. |
@szuecs, indeed binary search seems the most efficient wrt api rate limiting, we only need to ensure we're not looping forever. |
@lemaral Think about a changeset which will move a record 1 delete, 1 create. If we apply delete, then we get ratelimitted and then we do the create we will have created a downtime for our users. |
This issue is has bitten us in an unexpected way. While a binary search would indeed be optimal wrt API calls usage, a simpler solution would be to implement it as a fallback strategy - try out a batch update first, and then switch to individual pairs of updates if the batch one fails |
Hi guys I have similar problem: |
@zioalex which version you are using and can you post some logs when external-dns starts? |
Hi @szuecs follow some logs: The actual version is v0.4.3 |
Since the error message given by the AWS provider contains the offender DNS record, I think we could just parse it, skip it from the batch and run another Upsert API call with the rest of the records in the initial batch. |
We are hitting this issue and currently have 57 ingresses that use the fqdns we are monitoring with external-dns. On smaller clusters we have typically 10-15 ingresses monitored and do not hit this issue. |
@cnuber thanks for adding more information here! @linki @njuettner we should work on a fix the numbers are quite low, so I bet many people hit this issue. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Can we open this?! it is still a valid issue |
/reopen |
@szuecs: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/important-soon |
Can we reopen this issue please and exclude it from the bot? It would be nice to have a flag such as Edit: i see the implementation from #1209 where it separates failed changes from successful changes and reruns the failed changes in a separate batch. Very clever! I'll try the latest version 0.13.5 again and see if it resolves my issue. Thank you |
/reopen |
@linki: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Is there a way to exclude this ticket from getting auto closed? It's a notable bug |
/remove-lifecycle rotten |
I think this might be solved with #4166, no? At least it seem to be implemented already for AWS, OCI and Azure. |
@alfredkrohmer Well it's more of a workaround than an error. Now it just loops these spurious records but still no explanation as to why it decided we need them. I have no
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
@Liz4v This issue is about external-dns stopping to update Route 53 when it encounters an error. I think this bug is fixed for good by now - do you agree? I think the problem you have (external-dns trying to create a record with an empty name) is unrelated to this bug (the only connection I see is the initial issue description of the issue author where the bug was triggered by an endpoint with an empty name) and without seeing your external-dns configuration and all possible endpoint sources in your cluster it's also not debuggable where this endpoint is coming from. @nitrocode What's your reason for keeping the ticket open?
I think that this is actually does fix the issue. Have you updated external-dns to a recent version and verified this? |
Issue: an (obviously broken) service hostname annotation is creating the following error on route53:
"InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.sudbdomain.domain.com'
status code: 400, request id: 115b8642-e047-11e7-83a0-a9e81aaef2a3"
and apparently this is preventing any further modification of the route53 zone from external-dns.
If this is the expected behaviour (any error in the batch means no further modification can be processed), it does not seem very robust as it is (it's also difficult to diagnose).
I'm interested in confirmation and workarounds :)
Thanks
The text was updated successfully, but these errors were encountered: