-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS lookup timeouts #667
Comments
FYI, I have not testing this personally -- this supposedly will not work for Alpine based containers since MUSL lib that Alpine uses does not support this option. |
We had the same issue using alpine image. I would recommend to read this: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/ We changed our base image to jessie-slim and added this to the pod manifest.
more details : https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/ I think it is a cleaner solution than adding a postStart hook. |
I'm not sure what to think about this. Is the advice going to be: 'you can't run Alpine and need to customise DNS' if you expect proper DNS performance? Could someone from Microsoft weigh in here? |
This also works, and it works on alpine pods: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy |
We have the same problem, will AKS rollout a solution for us? I don't see an elegant workaround so far. |
@AXington Your solution reduces the delay from 5 seconds to 2.5 :), but still not fully work. |
@juan-lee your workaround didn't work perfectly for the alpine based images, any better idea? |
My understanding is that we've rolled the fix out during this week's release. You will have to upgrade your cluster to a newer version of k8s to get the changes. |
Thanks very much, @juan-lee, I will let you know how does it work. |
@juan-lee The 1.11.4 solves this problem, thanks for your help. Br, |
I checked official change log can't find specific change that addresses this bug in the change log, but amazingly the upgrade seems to fix this issue for our AKS cluster. I hope it will not come back on next version bump. |
The fix isn't in k8s, but rather a kernel patch that is now default for new agent nodes. The reason you're seeing the problem go away with an upgrade is because your old agent nodes are replaced with ones that have the new kernel. |
Ok I see, thanks for explanation@juan-lee |
@juan-lee is the fix only present in the new 'MobyImage' that you have to register for, or is it present in all 1.11.4 images? |
Hi people! Thanks |
@weinong any info on when the second race will be addressed? |
I just hit this, and |
Hi people. Is this issue fixed with kubernetes 1.13.5? I wanted to upgrade my AKS cluster to the said version, but I am using multiple deployments with alpine-based images. |
Still seeing the issue with kernel-version 4.15.0-1063-azure |
I also am seeing this issue in kernel version 4.15.0-1060-azure |
@jemag @Vandersteen Recently I've seen DNS timeouts being caused by high resource usage on the nodes where the coredns pods reside. Could either of you check your monitoring and see how saturated your CPU, memory, and disk are? |
@juan-lee I have replicated this problem under various loads. Currently with all nodes under 18% cpu usage and memory at around 70% for each node the problem still happens intermittently. Disk usage is extremely low. |
@juan-lee I have replicated this on 2 different clusters. 1 has 40% cpu usage & 60% memory usage. The other one has 20% cpu usage and 42% memory usage. |
@jemag @Vandersteen we are looking into it. By chance are your clusters using azure cni? I'm only able to reproduce the issue with azure cni clusters. |
@juan-lee mine are indeed using azure cni |
@juan-lee Yes we are using azure cni |
We are still working to get to the bottom of this issue. In the meantime, adding the following to your specs in most cases can help.
|
This started to happen for me when we upgraded the cluster to v.1.14 (from 1.13). We had the same problem a year ago and it was fixed after the kernel was patched with the mitigations from https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts is there a way we can run coredns in AKS as daemonset to try to fix this (at least try)? We need azure cni and we cant use |
Please also see this issue for intermittent nodenotready, DNS latency and other crashed related to system load: #1373 |
I tried this, but keep getting this issue. I have a container doing a curl statement in a loop with a 2 second pause in between attempt and get it about once every 20 attempts. I seems to happen on 3 clusters (OT, ACC and PROD). On those last two this seems to help pretty well. On the OT-cluster (1.16.9) it does not. All clusters are using Azure CNI and use private networking and a VPN connection to on-prem DNS-servers. |
This is now effecting us. the workaround
seems to work, but obviously this is effecting lots of people (most of which wont even realise), and needs sorting! Ive even tried creating a new v1.19 AKS cluster, and using the official |
We used the dnsCOnfig work around and initially it didnt work for us. It turned out that it only works with the non-alpine images. |
Is there a solution/workaround for this issue for Alpine based images? We tried Kubernetes cluster version: A newer version of OS image (18.04) has been deployed very recently, would it help to upgrade? |
@sikri-eic deploying nodelocaldns fixed it for us. |
@timja Thank you for the prompt response. I will take a look at |
@timja Did you have to do anything special to install nodelocaldns on aks ? Seems like you need to mess around with IPtables or something |
It depends if you're using kubenet or azure cni, kubenet works with just the standard config, azure cni: add the -setupebtables flag ref: example config for azure cni (note it took months for a release after our changes so we're using a forked image but looks like upstream have released now): |
Here is a version for AKS, runs on everything except virtual node. https://github.com/curtdept/aks_nodelocaldns/blob/main/nodelocaldns.yaml This worked amazing btw, cleaned up tons of issues I had with heartbeats and clustered services. |
Marking these as duplicate/known-issue and adding the in-progress feature for it as well as well as @curtdept current workaround (thanks!) https://github.com/curtdept/aks_nodelocaldns/blob/main/nodelocaldns.yaml |
This issue has been marked as duplicate and has not had any activity for 1 day. It will be closed for housekeeping purposes. |
Symptoms
Outbound requests from pods can see a 5 second delay during DNS lookups. This is known to impact containers based off of the alpine image.
Root Cause
https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts
Workaround
Add the following to your impacted pod's manifest.
What is single-request-reopen?
The text was updated successfully, but these errors were encountered: