Calico upgrade from 3.21.4 to 3.22.x causes issues (nat, bpf) #5957

igcherkaev · 2022-04-21T18:25:00Z

Expected Behavior

Upgrading from minor version to another should be as smooth as possible. All services and workloads in cluster should be reachable as they were prior to upgrade.

Current Behavior

We're seeing a weird behavior where wrong backends would be chosen by BPF programs, happens randomly on each node of the cluster. I.e., for one service it may pick the right backend, for another it may pick a wrong backend. On the next node this may vary and apply to another set of services/backends. Some extra details may still be available in this Slack thread: https://calicousers.slack.com/archives/C0BCA117T/p1650303706854169

Here's some snippets from the thread (in case it gets archived and/or deleted). Background information:

lvdkbm502 - one of the node in the cluster (role: master, ip address: 10.152.53.32)
10.158.8.246:443 - in-cluster service, backed with a pod with ip address/port 10.158.223.191:9443
10.158.144.220:9101 is a pod for a different service in a different namespace, completely unrelated to 10.158.8.246.

lvdkbm502 ~ # curl -kv http://10.158.8.246:443/ABC
*   Trying 10.158.8.246:443...
* Connected to 10.158.8.246 (10.158.144.220) port 443 (#0)

tc exec bpf debug produced the following for this curl call:

curl-2165191 [001] d... 1964894.410344: bpf_trace_printk: CALI-C: NAT: 1st level lookup addr=a9e08f6 port=443 protocol=6.
            curl-2165191 [001] d... 1964894.410345: bpf_trace_printk: CALI-C: NAT: 1st level hit; id=25
            curl-2165191 [001] d... 1964894.410346: bpf_trace_printk: CALI-C: NAT: 1st level hit; id=25 ordinal=0
            curl-2165191 [001] d... 1964894.410348: bpf_trace_printk: CALI-C: NAT: backend selected a9e90dc:9101
            curl-2165191 [001] d... 1964894.410349: bpf_trace_printk: CALI-C: Store: ip=a9e90dc port=9101 cookie=2f9a21

Where addr=a9e08f6 port=443 is 10.158.8.246:443, but selected backend is ip=a9e90dc port=9101 = 10.158.144.220:9101, which is totally incorrect.

Dumped NAT table from the respective calico-node seemed to be correct:

kubectl -n kube-system exec -ti calico-node-wg5x5 -- calico-node -bpf nat dump
...
10.158.8.246 port 443 proto 6 id 0 count 1 local 0
        0:0      10.158.223.191:9443
...

bpf conntrack table was just confirming incorrectly chosen backend and would look like:

ConntrackKey{proto=6 10.152.53.32:40514 <-> 10.158.144.220:9101} -> Entry{Type:0, Created:1965499842169890, LastSeen:1965499847562511, Flags: <none> Data: {A2B:{Seqno:3632491565 SynSeen:true AckSeen:true FinSeen:true RstSeen:false Whitelisted:false Opener:true Ifindex:0} B2A:{Seqno:2784815132 SynSeen:true AckSeen:true FinSeen:true RstSeen:false Whitelisted:true Opener:false Ifindex:0} OrigDst:0.0.0.0 OrigPort:0 OrigSPort:0 TunIP:0.0.0.0}} Age: 6.235599374s Active ago 6.230206753s CLOSED

Possible Solution

What fixes it for us is rebooting each node after upgrade. Which isn't a feasible solution in large clusters with tens and hundreds of nodes and production use.

Since reboot is fixing it, that makes me think of some leftovers being not cleaned up properly in between upgrade/on start up of calico-node in 3.22.x.

Steps to Reproduce (for bugs)

Install calico 3.21.4.
Test multiple services with curl from different nodes.
Upgrade calico to 3.22.1 or 3.22.2 without rebooting nodes.
Test the services again from multiple nodes. In our case the behavior is as described above. More often than not requests will go to wrong backend.

Context

Basically we're trying to upgrade calico so we are not too far behind from the latest. This is the first time in 5+ years when we could not upgrade calico in place without rebooting everything.

Your Environment

Calico version:

$ calicoctl version
Client Version:    v3.22.2
Git commit:        14cf6d6ea
Cluster Version:   v3.22.1
Cluster Type:      typha,kdd,k8s,bgp,lvd

Orchestrator version (e.g. kubernetes, mesos, rkt):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.8", GitCommit:"7061dbbf75f9f82e8ab21f9be7e8ffcaae8e0d44", GitTreeState:"clean", BuildDate:"2022-03-16T14:04:34Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Operating System and version:

$ kubectl get nodes -o wide
NAME        STATUS                     ROLES         AGE      VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                             KERNEL-VERSION     CONTAINER-RUNTIME
lvdkbm501   Ready,SchedulingDisabled   master        22d      v1.22.8   10.152.53.31    <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbm502   Ready,SchedulingDisabled   master        22d      v1.22.8   10.152.53.32    <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbm503   Ready,SchedulingDisabled   master        22d      v1.22.8   10.152.53.33    <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbt501   Ready                      temp-worker   546d     v1.22.8   10.152.53.198   <none>        Flatcar Container Linux by Kinvolk 3033.2.4 (Oklo)   5.10.107-flatcar   containerd://1.5.10
lvdkbt502   Ready                      temp-worker   546d     v1.22.8   10.152.53.213   <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbw501   Ready                      worker        252d     v1.22.8   10.152.53.34    <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbw502   Ready                      worker        3y252d   v1.22.8   10.152.53.35    <none>        Flatcar Container Linux by Kinvolk 3033.2.4 (Oklo)   5.10.107-flatcar   containerd://1.5.10
lvdkbw503   Ready                      mas,worker    3y252d   v1.22.8   10.152.53.36    <none>        Flatcar Container Linux by Kinvolk 3033.2.4 (Oklo)   5.10.107-flatcar   containerd://1.5.10
lvdkbw504   Ready                      worker        281d     v1.22.8   10.152.53.37    <none>        Flatcar Container Linux by Kinvolk 3033.2.4 (Oklo)   5.10.107-flatcar   containerd://1.5.10

The text was updated successfully, but these errors were encountered:

igcherkaev · 2022-04-22T16:24:40Z

Verified yesterday: same behavior if upgraded to calico 3.22.0.

tomastigera · 2022-05-06T20:26:35Z

To report some progress, I can reproduce the issue. Sorry it took so long.

tomastigera · 2022-05-10T22:43:59Z

This issue is not present in the upcoming 3.23 and will be eventually fixed in the next 3.22.x

tomastigera added kind/bug impact/high area/bpf eBPF Dataplane issues labels Apr 21, 2022

caseydavenport added the likelihood/high label Apr 21, 2022

tomastigera mentioned this issue Apr 28, 2022

calico with eBPF dataplane not working #5979

Closed

tomastigera self-assigned this Apr 28, 2022

caseydavenport mentioned this issue May 2, 2022

Deploying helm chart 3.22.2 fails for a new cluster while 3.21.4 works #5775

Closed

mazdakn mentioned this issue May 5, 2022

host-networked pods cannot access clusterIP services in clusters running on cgroups v2 defaults #5852

Closed

tomastigera mentioned this issue May 10, 2022

[release-v3.22] Auto pick #5498: felix/bpf: fix err handling in bpf_program_attach_cgroup #6056

Merged

3 tasks

tomastigera closed this as completed May 10, 2022

tomastigera mentioned this issue May 11, 2022

Why can't access local node service using NodePort by eBPF mode on arm64 #6065

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico upgrade from 3.21.4 to 3.22.x causes issues (nat, bpf) #5957

Calico upgrade from 3.21.4 to 3.22.x causes issues (nat, bpf) #5957

igcherkaev commented Apr 21, 2022

igcherkaev commented Apr 22, 2022

tomastigera commented May 6, 2022

tomastigera commented May 10, 2022

Calico upgrade from 3.21.4 to 3.22.x causes issues (nat, bpf) #5957

Calico upgrade from 3.21.4 to 3.22.x causes issues (nat, bpf) #5957

Comments

igcherkaev commented Apr 21, 2022

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

igcherkaev commented Apr 22, 2022

tomastigera commented May 6, 2022

tomastigera commented May 10, 2022