Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Node local DNS #1492

Closed
palma21 opened this issue Mar 11, 2020 · 25 comments
Closed

[Feature] Node local DNS #1492

palma21 opened this issue Mar 11, 2020 · 25 comments
Assignees
Labels
stale Stale issue

Comments

@palma21
Copy link
Member

palma21 commented Mar 11, 2020

Re-architect azure CNI for more resilient DNS

@palma21 palma21 self-assigned this Mar 11, 2020
@timja
Copy link

timja commented Jun 8, 2020

Any update on this?

We deployed nodelocaldns and it made a massive difference for us, (#1326 (comment))

@palma21
Copy link
Member Author

palma21 commented Jun 10, 2020

Good to know, we are designing it as we speak

@jstewart612
Copy link

@palma21 one thing you may want to take back to your folks is the currently proposed nodelocal daemonset is not tolerating itself sufficiently to prevent from inadvertent inability to assign to any nodes with custom taints on them.

I used these tolerations and they resolved that issue for us:

  • operator: Exists
    effect: NoExecute
  • operator: Exists
    effect: NoSchedule

@JennyLJY
Copy link

JennyLJY commented Aug 6, 2020

Any updates on this feature? When is the ETA?
@palma21

@palma21
Copy link
Member Author

palma21 commented Aug 7, 2020

It's currently on the committed items for this semester and under design review. Will have more ideas on the concrete eta by the end of the month.

@curtdept
Copy link

Hopefully this is higher in the priority list, then would be amazing to have. I get DNS latency bursts constantly.

@bergerx
Copy link

bergerx commented Sep 11, 2020

Any update on the ETA? If there is now a rough timeline for ETA, we may prefer wait a bit more rather than investing engineering time into something which would be obsolete soon after.

@curtdept
Copy link

@bergerx this will make you happy :)

https://github.com/curtdept/aks_nodelocaldns/blob/main/nodelocaldns.yaml

@MiyKh
Copy link

MiyKh commented Oct 19, 2020

My AKS cluster is actually running with kubenet, the DNS service IP is something like 172.17.80.10, but I cant' figure out what is <node-local-address> parameter, is it an IP from the service CIDR? pod CIDR?

@PSanetra
Copy link

@MiyKh the <node-local-address> is just some random and not-in-use IP in the 169.254.20.0/16 subnet. It will be used on each node and will not be reachable from outside the node.
See https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/#configuration

The local listen IP address for NodeLocal DNSCache can be any IP in the 169.254.20.0/16 space or any other IP address that can be guaranteed to not collide with any existing IP. This document uses 169.254.20.10 as an example.

@MiyKh
Copy link

MiyKh commented Oct 27, 2020

@PSanetra Thanks for the clarification.The node local dns is deployed into the kube-system namespace but it is deleted by Azure sync system, how can we force AKS to skip deletion on this component?

EDIT: This can be done by removing the labels:
kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile

@PSanetra
Copy link

@MiyKh interesting. We still have those labels set on the daemonset, but it is not getting deleted. We are running AKS 1.17.11.

@curtdept
Copy link

@MiyKh interesting. We still have those labels set on the daemonset, but it is not getting deleted. We are running AKS 1.17.11.

Same here on 1.19

@MiyKh
Copy link

MiyKh commented Oct 27, 2020

I have the same behavior than this issue: #1435

@curtdept
Copy link

curtdept commented Oct 27, 2020

I have the same behavior than this issue: #1435

What AKS version?

@MiyKh
Copy link

MiyKh commented Oct 27, 2020

I have the same behavior than this issue: #1435

What AKS version?

This is a 1.18.8 cluster with kubenet networking, removing the labels was the solution but I guess, I would still need to redeploy it after cluter upgrade.

@4c74356b41
Copy link

@curtdept @palma21 hey folks, do you have any ETA for us? Will this be backported to 1.17.x or not? thanks!

@djsly
Copy link
Contributor

djsly commented Dec 11, 2020

Any news on an ETA ?

@joaguas
Copy link

joaguas commented Feb 5, 2021

@4c74356b41 @djsly
Azure CNI now defaults to transparent mode which uses layer 3 routing which means pod-to-pod no longer relies on conntrack and will not be affected by its race condition. This should mitigate the 5s delay on DNS lookups

Upgrading a cluster still using bridge mode will make it use transparent mode instead

@djsly
Copy link
Contributor

djsly commented Feb 5, 2021

@joaguas thanks! I asked a question about querying the status. Wondering how to check if our clusters consumed the update or not yet, since we did performed a few updates.

@4c74356b41
Copy link

you can just do az aks nodepool upgrade --node-image-only, this would rollout latest os base image with this fix for sure (no kubernetes version upgrade), if you already have it - it will just return success in 30ish seconds

@djsly
Copy link
Contributor

djsly commented Feb 5, 2021

ok so the CNI transparent mode is baked in the image. good to know. I will try to know which base image contains the fix then.

thanks @4c74356b41

@joaguas
Copy link

joaguas commented Feb 5, 2021

Hi @djsly , an easy way will be to get a shell in one of the nodes and check either the interfaces or route tables.
If there's an interface called azure0 that means it's still using a bridge.

You can also check the route table (ip route show). If you see multiple routes for the veth interfaces (azvXXXXXXXX) that means the node is already running the L3 routing (transparent mode)

Thanks for the tip @4c74356b41

@ghost ghost added the stale Stale issue label Apr 7, 2021
@ghost
Copy link

ghost commented Apr 7, 2021

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ghost ghost closed this as completed Apr 22, 2021
@ghost
Copy link

ghost commented Apr 22, 2021

This issue will now be closed because it hasn't had any activity for 15 days after stale. palma21 feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@ghost ghost locked as resolved and limited conversation to collaborators May 22, 2021
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale Stale issue
Projects
Development

No branches or pull requests