Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 Cluster running Calico seemingly losing UDP traffic when transiting through service IP to remotely located pod #1541

Closed
aiyengar2 opened this issue Aug 5, 2021 · 27 comments
Assignees
Labels
area/cni kind/bug Something isn't working kind/dev-validation Dev will be validating this issue

Comments

@aiyengar2
Copy link

Environmental Info:
RKE2 Version: v1.21.3-rc3+rke2r2

Node(s) CPU architecture, OS, and Version:

Linux arvind-rke2-1 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 server nodes. Also reproducible on 3 etcd, 1 controlplane, and 3 worker nodes

Describe the bug:

Steps To Reproduce:

  • Installed RKE2: curl https://get.rke2.io | INSTALL_RKE2_CHANNEL=testing INSTALL_RKE2_METHOD=tar INSTALL_RKE2_VERSION=v1.21.3-rc3+rke2r2 sh -
  • Updated the config:
root@arvind-rke2-0:~# cat /etc/rancher/rke2/config.yaml
cni: calico

Other nodes are the same except with server + token fields

  • Run dig @10.43.0.10 google.com

Expected behavior:

All nodes should resolve the DNS

Actual behavior:

Only one node (the one that rke2-coredns is running on) resolves the DNS

Additional context / logs:

This issue was diagnosed in rancher/rancher#33052 but reproduced independently of Rancher using the above steps.

@aiyengar2 aiyengar2 changed the title RKE2 cluster seems to have DNS issues RKE2 cluster seems to have issues with pods communicating across nodes Aug 5, 2021
@aiyengar2
Copy link
Author

As noted in rancher/rancher#33052 (comment), it seems like this was a regression that was broken, fixed, and then broken once more, possibly due to different versions of RKE2

@Oats87
Copy link
Contributor

Oats87 commented Aug 5, 2021

I debugged this with Arvind and we found interesting behavior where UDP DNS queries are unable to be resolved when transiting via the service IP for CoreDNS i.e. 10.43.0.10. If we directly addressed the coredns pod, we could make our DNS queries with no issue. It did not matter whether we were in or out of a pod, i.e. on the node or not.

The DNS service IP 10.43.0.10 worked when CoreDNS was located on the same node as we were testing on.

This is when using the Calico CNI.

This only occurs on Ubuntu 20.04 in our testing. On my CentOS 7 testing boxes, we did not run into this issue.

@Oats87
Copy link
Contributor

Oats87 commented Aug 5, 2021

ufw was on but

# ufw status
Status: inactive

For good measure, systemctl disable ufw --now && reboot did not help either.

@Oats87 Oats87 changed the title RKE2 cluster seems to have issues with pods communicating across nodes RKE2 Cluster running Calico seemingly losing UDP traffic when transiting through service IP to remotely located pod Aug 5, 2021
@brandond
Copy link
Member

brandond commented Aug 5, 2021

Does it make any difference if you switch the host iptables between legacy/nftables or uninstall the host iptables+nftables so that we use the embedded ones?

@aiyengar2
Copy link
Author

Does it make any difference if you switch the host iptables between legacy/nftables or uninstall the host iptables+nftables so that we use the embedded ones?

Not sure about this. cc: @Oats87

However, I was able to test this on a v1.21.2+rke2r1 cluster and verify that this issue still exists in that version, so #1541 (comment) is not accurate.

@aiyengar2
Copy link
Author

In a v1.21.3-rc3+rke2r2 cluster with two Ubuntu 18.04 nodes (as opposed to 20.04 listed above), I was able to reproduce this same behavior on the nodes.

$ uname -a
Linux arvind-ubuntu-1804-0 4.15.0-144-generic #148-Ubuntu SMP Sat May 8 02:33:43 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

@aiyengar2
Copy link
Author

Without specifying cni: calico in the RKE2 cluster (v1.21.3-rc3+rke2r2), the dig call worked perfectly fine on all nodes.

Seems like this is definitely related to Calico, as indicated on the ticket title.

@manuelbuil
Copy link
Contributor

manuelbuil commented Aug 6, 2021

Editing comment. Things seem to work when the pod is the client. The problem comes when the host tries to access the service. This is also happening on v1.21.3+rke2r1

@manuelbuil
Copy link
Contributor

manuelbuil commented Aug 6, 2021

When tracking the packet, I see it going through the correct iptables of kube-proxy:

-A KUBE-SERVICES -d 10.43.0.10/32 -p udp -m comment --comment "kube-system/rke2-coredns-rke2-coredns:udp-53 cluster IP" -m udp --dport 53 -j KUBE-SVC-YFPH5LFNKP7E3G4L

-A KUBE-SVC-YFPH5LFNKP7E3G4L -m comment --comment "kube-system/rke2-coredns-rke2-coredns:udp-53" -j KUBE-SEP-F54GWJZTXPXAPHRS

-A KUBE-SEP-F54GWJZTXPXAPHRS -p udp -m comment --comment "kube-system/rke2-coredns-rke2-coredns:udp-53" -m udp -j DNAT --to-destination 10.42.182.4:53

I can see the packet leaving the node:

18:48:21.980669 IP 10.0.10.14.24169 > 10.0.10.10.4789: VXLAN, flags [I] (0x08), vni 4096
IP 10.42.222.64.60066 > 10.42.182.4.53: 61063+ [1au] A? google.com. (51)

And I can see the packet reaching to the other node (the one where coredns is):

IP 10.42.222.64.55933 > 10.42.182.4.53: 5083+ [1au] A? google.com. (51)
18:49:56.672731 IP 10.0.10.14.58192 > 10.0.10.10.4789: VXLAN, flags [I] (0x08), vni 4096

Then the packet disappears

@manuelbuil
Copy link
Contributor

Sniffing a packet targeting the service. Node with coredns, interface eth0:

19:04:16.499057 IP 10.0.10.14.49959 > 10.0.10.10.4789: VXLAN, flags [I] (0x08), vni 4096
IP 10.42.222.64.51417 > 10.42.182.4.53: 29795+ [1au] A? google.com. (51)

Sniffing a packet targeting the pod implementing the service. Node with coredns, interface eth0:

19:04:11.194801 IP 10.0.10.14.34353 > 10.0.10.10.4789: VXLAN, flags [I] (0x08), vni 4096
IP 10.42.222.64.46410 > 10.42.182.4.53: 14722+ [1au] A? google.com. (51)
19:04:11.195126 IP 10.0.10.10.51238 > 10.0.10.14.4789: VXLAN, flags [I] (0x08), vni 4096
IP 10.42.182.4.53 > 10.42.222.64.46410: 14722* 1/0/1 A 142.250.178.142 (77)

Sniffing a packet targeting the service. Node with coredns, interface vxlan.calico: nothing
Sniffing a packet targeting the pod implementing the service. Node with coredns, interface vxlan.calico:

19:03:32.327538 IP 10.42.222.64.44687 > 10.42.182.4.53: 19939+ [1au] A? google.com. (51)
19:03:32.328436 IP 10.42.182.4.53 > 10.42.222.64.44687: 19939 1/0/1 A 142.250.178.142 (65)

@manuelbuil
Copy link
Contributor

manuelbuil commented Aug 9, 2021

After looking at different things I noticed that when accessing the service directly to the pod, we see this:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 21070, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.50078 > 10.0.10.10.4789: [udp sum ok] VXLAN, flags [I] (0x08), vni 4096

But if we access the service via the clusterIP, we see this:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 16380, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.7103 > 10.0.10.10.4789: [bad udp cksum 0xbd67 -> 0x4e7f!] VXLAN, flags [I] (0x08), vni 4096

Note the bad udp cksum.

After investigating a bit, I read that this is a known kernel bug that was fixed in 5.7. Apparently, the kernel driver miscalculates the checksum when the vxlan offloading is on if the packet is natted, which is our case when accessing the service via the ClusterIP. Centos and RHEL 8 have backported the fix but not Ubuntu, that's why we only see it in Ubuntu (note that Ubuntu 20 uses 5.4.0). This is the kernel fix: torvalds/linux@ea64d8d.

Manual fix:
Disable the vxlan offloading in the vxlan interface for all nodes: sudo ethtool -K vxlan.calico tx-checksum-ip-generic off. I tested and it works :).

Calico's recommended fix:
Calico includes a env variable that when passed to the agent, disables the feature that creates this problem (MASQFullyRandom): projectcalico/calico#3145 (comment). Needs to be tested

TO DO:

  • Follow the Calico recommendation and disable the MASQFullyRandom feature. Check that the problem does not appear
  • Fully understand what are the disadvantages of disabling MASQFullyRandom compared to disabling the vxlan offloading
  • Check if SUSE backported this fix to 15 sp3
  • Find a way to limit the impact of this by only disabling MASQFullyRandom or vxlan offloading in Ubuntu systems (and suse?)

@manuelbuil
Copy link
Contributor

Disabling MASQFullyRandom feature does not help. Asking Tigera, perhaps something else must be change. Note that there is a recent PR to fix this on Calico but it does not seem enabled in our version ==> projectcalico/felix#2811

@manuelbuil
Copy link
Contributor

manuelbuil commented Aug 9, 2021

Same issue in opensuse SP3:

10.0.10.9.26831 > 10.0.10.7.4789: [bad udp cksum 0x69a5 -> 0xc8d6!] VXLAN, flags [I] (0x08), vni 4096

Fixed after running sudo ethtool -K vxlan.calico tx-checksum-ip-generic off

@vadorovsky
Copy link
Contributor

@manuelbuil Are you sure that the kernel commit you linked is the only one?

It's applied in SLE 15 SP3 / Leap 15.3 already:
SUSE/kernel@3dc74ef

and you seem to have issues on SLE/openSUSE anyway.

@manuelbuil
Copy link
Contributor

@manuelbuil Are you sure that the kernel commit you linked is the only one?

It's applied in SLE 15 SP3 / Leap 15.3 already:
SUSE/kernel@3dc74ef

and you seem to have issues on SLE/openSUSE anyway.

I got the link from projectcalico/calico#3145 (comment).

I reported some issues in openSUSE but they were related to a dirty env. Once I freshly deployed, I was able to see the same problem as in Ubuntu

@vadorovsky vadorovsky self-assigned this Aug 20, 2021
manuelbuil added a commit to manuelbuil/rke2-charts that referenced this issue Aug 23, 2021
This fixes rancher/rke2#1541 even for
kernel version > 5.7

Signed-off-by: Manuel Buil <mbuil@suse.com>
@rancher-max
Copy link
Contributor

Reopening for testing in rke2

@manuelbuil
Copy link
Contributor

@rancher-max apart from doing the dig @10.43.0.10 www.google.com in all nodes, verify that kubectl get felixconfigurations.crd.projectcalico.org default -o yaml gives you this spec:

spec:
  bpfLogLevel: ""
  featureDetectOverride: ChecksumOffloadBroken=true
  logSeverityScreen: Info
  reportingInterval: 0s
  vxlanEnabled: true

We are only passing featureDetectOverride: ChecksumOffloadBroken=true and the rest of parameters should be filled by the operator

@rancher-max
Copy link
Contributor

Leaving this open to validate on 1.22 release line, but confirmed working in v1.21.3-rc7+rke2r2

Validated the dig command works on all nodes, and felixconfigurations are set as mentioned above. Also confirmed running sudo ethtool -k vxlan.calico | grep tx-checksum-ip-generic on all nodes returns expected tx-checksum-ip-generic: off.

@galal-hussein
Copy link
Contributor

Validated on master commit 09bb5c2

  • Install rke2 server on three nodes
  • configure rke2 server to run calico as its cni
  • run the dig command on all nodes
# dig @10.43.0.10 google.com

; <<>> DiG 9.16.1-Ubuntu <<>> @10.43.0.10 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5294
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 6c2344a672e1a9de (echoed)
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		30	IN	A	142.250.69.206

;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Wed Sep 22 22:35:47 UTC 2021
;; MSG SIZE  rcvd: 77
  • make sure that calico is configured correctly
# kubectl get felixconfigurations.crd.projectcalico.org default -o yaml
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  annotations:
    meta.helm.sh/release-name: rke2-calico
    meta.helm.sh/release-namespace: kube-system
    projectcalico.org/metadata: '{"uid":"9f051b21-813b-475b-9615-c23692d89279","generation":1,"creationTimestamp":"2021-09-22T22:32:15Z","managedFields":[{"manager":"helm","operation":"Update","apiVersion":"crd.projectcalico.org/v1","time":"2021-09-22T22:32:15Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}},"f:spec":{".":{},"f:featureDetectOverride":{}}}}]}'
  creationTimestamp: "2021-09-22T22:32:15Z"
  generation: 3
  labels:
    app.kubernetes.io/managed-by: Helm
  name: default
  resourceVersion: "968"
  uid: 9f051b21-813b-475b-9615-c23692d89279
spec:
  bpfLogLevel: ""
  featureDetectOverride: ChecksumOffloadBroken=true
  logSeverityScreen: Info
  reportingInterval: 0s
  vxlanEnabled: true

@strelok899
Copy link

i have the opposite issue.
the fix gone to the helm and i trying to disable it to have the hardware offload as i have limited resources and my kube-proxy crashing on lack of cpu issues.

how can i make the offloading work?

@brandond
Copy link
Member

Enabling hardware offload will not address issues with insufficient CPU resources. Also, please don't revive old resolved issues to ask unrelated questions, open a new issue or discussion.

@rancher rancher locked and limited conversation to collaborators Feb 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/cni kind/bug Something isn't working kind/dev-validation Dev will be validating this issue
Projects
None yet
Development

No branches or pull requests