Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent DNS resolution issue when using security groups for Pods and EKS AMI 1.19 which has 5.4 kernel version #1402

Closed
SaranBalaji90 opened this issue Mar 10, 2021 · 2 comments
Labels

Comments

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Mar 10, 2021

What happened:
On 1.19 Clusters, pods using security group is seeing high latency when performing DNS resolution.

Attach logs

[ec2-user@ip-10-10-35-247 ~]$ sudo tcpdump -i eth2 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
00:38:52.883343 IP ip-10-10-42-165.eu-west-1.compute.internal.55758 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 8261+ A? app1.test-dns.svc.cluster.local.test-dns.svc.cluster.local. (76)
00:38:52.883362 IP ip-10-10-42-165.eu-west-1.compute.internal.55758 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 8609+ AAAA? app1.test-dns.svc.cluster.local.test-dns.svc.cluster.local. (76)
00:38:52.884027 IP ip-10-10-57-24.eu-west-1.compute.internal.domain > ip-10-10-42-165.eu-west-1.compute.internal.55758: 8261 NXDomain*- 0/1/0 (169)
00:38:55.385621 IP ip-10-10-42-165.eu-west-1.compute.internal.55758 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 8609+ AAAA? app1.test-dns.svc.cluster.local.test-dns.svc.cluster.local. (76)
00:38:55.385977 IP ip-10-10-57-24.eu-west-1.compute.internal.domain > ip-10-10-42-165.eu-west-1.compute.internal.55758: 8609 NXDomain*- 0/1/0 (169)
00:38:55.386073 IP ip-10-10-42-165.eu-west-1.compute.internal.46259 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 32938+ A? app1.test-dns.svc.cluster.local.svc.cluster.local. (67)
00:38:55.386085 IP ip-10-10-42-165.eu-west-1.compute.internal.46259 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 33304+ AAAA? app1.test-dns.svc.cluster.local.svc.cluster.local. (67)
00:38:55.386274 IP ip-10-10-57-24.eu-west-1.compute.internal.domain > ip-10-10-42-165.eu-west-1.compute.internal.46259: 32938 NXDomain*- 0/1/0 (160)
00:38:57.888876 IP ip-10-10-42-165.eu-west-1.compute.internal.46259 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 33304+ AAAA? app1.test-dns.svc.cluster.local.svc.cluster.local. (67)
00:38:57.889175 IP ip-10-10-57-24.eu-west-1.compute.internal.domain > ip-10-10-42-165.eu-west-1.compute.internal.46259: 33304 NXDomain*- 0/1/0 (160)
00:38:57.889262 IP ip-10-10-42-165.eu-west-1.compute.internal.33329 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 45991+ A? app1.test-dns.svc.cluster.local.cluster.local. (63)
00:38:57.889274 IP ip-10-10-42-165.eu-west-1.compute.internal.33329 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 46373+ AAAA? app1.test-dns.svc.cluster.local.cluster.local. (63)
00:38:57.889455 IP ip-10-10-57-24.eu-west-1.compute.internal.domain > ip-10-10-42-165.eu-west-1.compute.internal.33329: 46373 NXDomain*- 0/1/0 (156)
00:38:57.889556 IP ip-10-10-57-24.eu-west-1.compute.internal.domain > ip-10-10-42-165.eu-west-1.compute.internal.33329: 45991 NXDomain*- 0/1/0 (156)
00:38:57.889608 IP ip-10-10-42-165.eu-west-1.compute.internal.56650 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 10901+ A? app1.test-dns.svc.cluster.local.eu-west-1.compute.internal. (76)
00:38:57.889619 IP ip-10-10-42-165.eu-west-1.compute.internal.56650 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 11481+ AAAA? app1.test-dns.svc.cluster.local.eu-west-1.compute.internal. (76)
00:38:57.889752 IP ip-10-10-57-24.eu-west-1.compute.internal.domain > ip-10-10-42-165.eu-west-1.compute.internal.56650: 10901 NXDomain* 0/1/0 (189)
00:39:00.391787 IP ip-10-10-42-165.eu-west-1.compute.internal.56650 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 11481+ AAAA? app1.test-dns.svc.cluster.local.eu-west-1.compute.internal. (76)
00:39:00.392055 IP ip-10-10-57-24.eu-west-1.compute.internal.domain > ip-10-10-42-165.eu-west-1.compute.internal.56650: 11481 NXDomain* 0/1/0 (189)
00:39:00.392145 IP ip-10-10-42-165.eu-west-1.compute.internal.54742 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 6742+ A? app1.test-dns.svc.cluster.local. (49)
00:39:00.392152 IP ip-10-10-42-165.eu-west-1.compute.internal.54742 > ip-10-10-57-24.eu-west-1.compute.internal.domain: 6991+ AAAA? app1.test-dns.svc.cluster.local. (49)

What you expected to happen:
DNS Resolution to happen within few milliseconds.

How to reproduce it (as minimally and precisely as possible):
Run CoreDNS and pod using security group on same node and ping any kubernetes service from the pod.

Anything else we need to know?:
This happens only when CoreDNS and pods using security group run on same worker node and also only on 1.19 clusters.

Conntrack stats

[ec2-user@ip-10-10-35-247 ~]$ sudo conntrack -S
cpu=0       found=859 invalid=253 ignore=43011 insert=0 insert_failed=99 drop=99 early_drop=0 error=1 search_restart=6 
cpu=1       found=833 invalid=273 ignore=43232 insert=0 insert_failed=93 drop=93 early_drop=0 error=2 search_restart=16 
[ec2-user@ip-10-10-35-247 ~]$ sudo conntrack -S
cpu=0       found=866 invalid=253 ignore=43100 insert=0 insert_failed=100 drop=100 early_drop=0 error=1 search_restart=6 
cpu=1       found=857 invalid=274 ignore=43299 insert=0 insert_failed=97 drop=97 early_drop=0 error=2 search_restart=16 

Any workarounds?
by adding following in the pod spec, I was able to work around the issue.

dnsConfig:  
    options:  
      - name: single-request-reopen # or single-request

References:
https://medium.com/techmindtickle/intermittent-delays-in-kubernetes-e9de8239e2fa
https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/
https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

Environment:
[ec2-user@ip-10-10-35-247 ~]$ uname -a
Linux ip-10-10-35-247.eu-west-1.compute.internal 5.4.95-42.163.amzn2.x86_64 #1 SMP Thu Feb 4 12:50:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

@SaranBalaji90 SaranBalaji90 changed the title DNS resolution from pods using security group is high on 1.19 clusters High latency for DNS resolution on pods using security group on 1.19 clusters Mar 10, 2021
@SaranBalaji90 SaranBalaji90 changed the title High latency for DNS resolution on pods using security group on 1.19 clusters High latency for DNS resolution on pods using security groups on 1.19 clusters Mar 10, 2021
@SaranBalaji90 SaranBalaji90 changed the title High latency for DNS resolution on pods using security groups on 1.19 clusters Intermittent DNS resolution issue when using security groups for Pods and EKS AMI 1.19 which has 5.4 kernel version Mar 10, 2021
@SaranBalaji90
Copy link
Contributor Author

SaranBalaji90 commented Mar 10, 2021

During my testing I found that kernel <= 5.4.63-33.124 and 5.10.* doesn't have problem but >= 5.4.64-33.120 does have the issue.

Fix is being tracked here - awslabs/amazon-eks-ami#357

@SaranBalaji90
Copy link
Contributor Author

New AMI has been released awslabs/amazon-eks-ami#659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant