Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad udp cksum when using vxlan #4865

Closed
manuelbuil opened this issue Aug 24, 2021 · 2 comments
Closed

bad udp cksum when using vxlan #4865

manuelbuil opened this issue Aug 24, 2021 · 2 comments

Comments

@manuelbuil
Copy link
Contributor

When using Ubuntu 20 with kernel 5.8 (after apt upgrade), we see again the issue rancher/rke2#1541. The workaround is disabling the checksum offload in the calico.vxlan interface.

The issue appears when from a node we try to access a service which is implemented by a pod in another node, e.g. coredns service. In that case, the traffic must traverse de vxlan tunnel and in the receiving node, we see in tcpdump:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 16380, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.7103 > 10.0.10.10.4789: [bad udp cksum 0xbd67 -> 0x4e7f!] VXLAN, flags [I] (0x08), vni 4096

Expected Behavior

When deploying calico as cni plugin in Ubuntu 20 with kernel 5.8 (after apt upgrade), I expect to access successfully all kubernetes services from any node (e.g. dns). When doing tcpdump at the receiver node, I expect to see:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 21070, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.50078 > 10.0.10.10.4789: [udp sum ok] VXLAN, flags [I] (0x08), vni 4096

Current Behavior

When deploying calico as cni plugin in Ubuntu 20 with kernel 5.8 (after apt upgrade), I can only access services (e.g. dns) from the node, if the pod implementing the service is in that node, i.e. as soon as the traffic must take the vxlan tunnel, things don't work. When doing tcpdump in the receiver node, I see:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 16380, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.7103 > 10.0.10.10.4789: [bad udp cksum 0xbd67 -> 0x4e7f!] VXLAN, flags [I] (0x08), vni 4096

Possible Solution

sudo ethtool -K vxlan.calico tx-checksum-ip-generic off
or
featureDetectOverride: "ChecksumOffloadBroken=true"

But this have a performance impact

Steps to Reproduce (for bugs)

1.Deploy Kubernetes on Ubuntu 20 with kernel 5.8 in 2 or more nodes
2. Run dig @10.43.0.10 www.google.com in all nodes. It will only work in one (assuming 10.43.0.10 is the clusterIP of the dns service)
3.
4.

Context

Your Environment

@caseydavenport
Copy link
Member

@fasaxc recently put in a fix to automatically disable checksum offload based on the kernel version, but it sounds like perhaps there are some kernels for which that fix isn't working properly?

I think the best solution I'm aware of at the moment is to explicitly disable the offload as you suggested in your post.

@caseydavenport
Copy link
Member

I'm not sure there is much else we can do on our side here - users either need to upgrade to a kernel that has the checksum fix included, or use one of the options above to turn off checksum offloading, or turn off --random-fully masquerade IIRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants