Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Local DNS Cache #828

Closed
Tracked by #988
alex-dabija opened this issue Feb 17, 2022 · 21 comments
Closed
Tracked by #988

Node Local DNS Cache #828

alex-dabija opened this issue Feb 17, 2022 · 21 comments
Assignees
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service kind/story provider/aws Related to cloud provider Amazon AWS provider/aws-china Related to Amazon AWS in China provider/azure Related to cloud provider Microsoft Azure target-release/17.3.0 team/phoenix Team Phoenix

Comments

@alex-dabija
Copy link

alex-dabija commented Feb 17, 2022

User Story

- As a cluster admin, I want enable node local DNS cache in order to have a reliable DNS solution under high load for the applications running on the cluster.

Details, Background

Under high load the current DNS solution is not able to keep up. Also, applications which don't cache DNS queries (e.g. NodeJS ones) end up putting a lot of strain on the current solution.

Resources

Changes

@alex-dabija alex-dabija added area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service provider/azure Related to cloud provider Microsoft Azure provider/aws Related to cloud provider Amazon AWS kind/story provider/aws-china Related to Amazon AWS in China team/phoenix Team Phoenix labels Feb 17, 2022
@whites11 whites11 self-assigned this Mar 3, 2022
@whites11
Copy link

whites11 commented Mar 3, 2022

ok the app works.
We might need adjustments to network policies of any other application that has egress rules to connect to coredns.
The label selector for the DNS service changes when the local node cache app is installed.
This applies so far to the following applications:

  • net-exporter

Unfortunately this means that installing this application to an old cluster will make net-exporter page.

@whites11
Copy link

whites11 commented Mar 4, 2022

unfortunately this does not seem to work on AWS. Still not sure why.

@whites11
Copy link

whites11 commented Mar 5, 2022

Potentially related upstream issues

kubernetes/dns#480
kubernetes

@whites11
Copy link

whites11 commented Mar 7, 2022

so, as soon as I install the local dns thingie on an AWS cluster, kube-state-metrics starts to fail with

E0307 10:14:05.453909       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.0/tools/cache/reflector.go:167: Failed to watch *v1.ValidatingWebhookConfiguration: failed to list *v1.ValidatingWebhookConfiguration: Get "https://172.31.0.1:443/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations?resourceVersion=26524": dial tcp 172.31.0.1:443: i/o timeout

and that causes a cascade of errors

@whites11
Copy link

whites11 commented Mar 7, 2022

I see martian source errors in the destination machine.
I feel like for some reason packets are using the wrong network interface to leave the node when the local cache thingie is in place.
Not sure why node to node works though

@whites11
Copy link

whites11 commented Mar 7, 2022

so it seems like that when the app is running, all node-to-node traffic is not working from pods to the nodes.
No ping, No TCP, No UDP.
That also applies to resources in the VPC such as the AWS DNS enpoint and the instance metadata endpoint.
Deleting the app is not enough to fix it, a reboot is also needed. Still unsure why a reboot is needed (I.E. what resources are being recreated/fixed with a reboot).

@whites11
Copy link

whites11 commented Mar 7, 2022

maybe this is a known bug, but I am still unsure: aws/amazon-vpc-cni-k8s#1662

@whites11
Copy link

whites11 commented Mar 7, 2022

so I came to the conclusion that the iptables rules set up by the upstream component are not working with aws-cni.
I saw some references of EKS uses talking about this but no solution so far unfortunately.
Will further think about it

@whites11
Copy link

whites11 commented Mar 8, 2022

I feel like we are hitting this limitation: https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html

see last point in the Considerations section

@whites11
Copy link

whites11 commented Mar 8, 2022

I couldn't pinpoint the root cause, but I am sure this is not working with aws-cni for some reason.
Releasing to make the thing available with azure, but otherwise blocked.

@whites11
Copy link

Upstream aws-cni issue still open, it is expected to be released in version 1.11.0

@whites11
Copy link

whites11 commented Apr 6, 2022

I think the supposed fix was merged in AWS-CNI: aws/amazon-vpc-cni-k8s#1907

@alex-dabija
Copy link
Author

Waiting for v1.11.0 to be released.

@whites11
Copy link

with 1.11.0 the node local dns cache thing seems to be working, but for some reason kiam crashes.

@alex-dabija
Copy link
Author

You gain one, you lose one :( ...

@whites11
Copy link

Problem is that the node-local coredns instance can't talk to the "traditional" coredns pod

[ERROR] plugin/errors: 2 kubernetes.default.svc.cluster.local. A: dial tcp 172.31.88.218:53: connect: connection refused

@whites11
Copy link

ok fixed that (network policy problem) now for some reason only full names resolve:

/ # dig kiam-server +short
/ # dig kiam-server.kube-system.svc.cluster.local +short
10.10.11.6

@whites11
Copy link

ok fixed that (network policy problem) now for some reason only full names resolve:

/ # dig kiam-server +short
/ # dig kiam-server.kube-system.svc.cluster.local +short
10.10.11.6

This is actually not a problem, but how dig works.

@whites11
Copy link

whites11 commented Apr 20, 2022

This will be available on AWS from release 17.3.0 on.

@alex-dabija
Copy link
Author

This will be available on AWS from release 11.3.0 on.

I think it should be 17.3.0 instead of 11.3.0.

@whites11
Copy link

This will be available on AWS from release 11.3.0 on.

I think it should be 17.3.0 instead of 11.3.0.

yeah, thanks fixed the comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service kind/story provider/aws Related to cloud provider Amazon AWS provider/aws-china Related to Amazon AWS in China provider/azure Related to cloud provider Microsoft Azure target-release/17.3.0 team/phoenix Team Phoenix
Projects
None yet
Development

No branches or pull requests

2 participants