Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico upgrade from 3.21.4 to 3.22.x causes issues (nat, bpf) #5957

Closed
igcherkaev opened this issue Apr 21, 2022 · 3 comments
Closed

Calico upgrade from 3.21.4 to 3.22.x causes issues (nat, bpf) #5957

igcherkaev opened this issue Apr 21, 2022 · 3 comments
Assignees

Comments

@igcherkaev
Copy link

Expected Behavior

Upgrading from minor version to another should be as smooth as possible. All services and workloads in cluster should be reachable as they were prior to upgrade.

Current Behavior

We're seeing a weird behavior where wrong backends would be chosen by BPF programs, happens randomly on each node of the cluster. I.e., for one service it may pick the right backend, for another it may pick a wrong backend. On the next node this may vary and apply to another set of services/backends. Some extra details may still be available in this Slack thread: https://calicousers.slack.com/archives/C0BCA117T/p1650303706854169

Here's some snippets from the thread (in case it gets archived and/or deleted). Background information:

  • lvdkbm502 - one of the node in the cluster (role: master, ip address: 10.152.53.32)
  • 10.158.8.246:443 - in-cluster service, backed with a pod with ip address/port 10.158.223.191:9443
  • 10.158.144.220:9101 is a pod for a different service in a different namespace, completely unrelated to 10.158.8.246.
lvdkbm502 ~ # curl -kv http://10.158.8.246:443/ABC
*   Trying 10.158.8.246:443...
* Connected to 10.158.8.246 (10.158.144.220) port 443 (#0)

tc exec bpf debug produced the following for this curl call:

curl-2165191 [001] d... 1964894.410344: bpf_trace_printk: CALI-C: NAT: 1st level lookup addr=a9e08f6 port=443 protocol=6.
            curl-2165191 [001] d... 1964894.410345: bpf_trace_printk: CALI-C: NAT: 1st level hit; id=25
            curl-2165191 [001] d... 1964894.410346: bpf_trace_printk: CALI-C: NAT: 1st level hit; id=25 ordinal=0
            curl-2165191 [001] d... 1964894.410348: bpf_trace_printk: CALI-C: NAT: backend selected a9e90dc:9101
            curl-2165191 [001] d... 1964894.410349: bpf_trace_printk: CALI-C: Store: ip=a9e90dc port=9101 cookie=2f9a21

Where addr=a9e08f6 port=443 is 10.158.8.246:443, but selected backend is ip=a9e90dc port=9101 = 10.158.144.220:9101, which is totally incorrect.

Dumped NAT table from the respective calico-node seemed to be correct:

kubectl -n kube-system exec -ti calico-node-wg5x5 -- calico-node -bpf nat dump
...
10.158.8.246 port 443 proto 6 id 0 count 1 local 0
        0:0      10.158.223.191:9443
...

bpf conntrack table was just confirming incorrectly chosen backend and would look like:

ConntrackKey{proto=6 10.152.53.32:40514 <-> 10.158.144.220:9101} -> Entry{Type:0, Created:1965499842169890, LastSeen:1965499847562511, Flags: <none> Data: {A2B:{Seqno:3632491565 SynSeen:true AckSeen:true FinSeen:true RstSeen:false Whitelisted:false Opener:true Ifindex:0} B2A:{Seqno:2784815132 SynSeen:true AckSeen:true FinSeen:true RstSeen:false Whitelisted:true Opener:false Ifindex:0} OrigDst:0.0.0.0 OrigPort:0 OrigSPort:0 TunIP:0.0.0.0}} Age: 6.235599374s Active ago 6.230206753s CLOSED

Possible Solution

What fixes it for us is rebooting each node after upgrade. Which isn't a feasible solution in large clusters with tens and hundreds of nodes and production use.

Since reboot is fixing it, that makes me think of some leftovers being not cleaned up properly in between upgrade/on start up of calico-node in 3.22.x.

Steps to Reproduce (for bugs)

  1. Install calico 3.21.4.
  2. Test multiple services with curl from different nodes.
  3. Upgrade calico to 3.22.1 or 3.22.2 without rebooting nodes.
  4. Test the services again from multiple nodes. In our case the behavior is as described above. More often than not requests will go to wrong backend.

Context

Basically we're trying to upgrade calico so we are not too far behind from the latest. This is the first time in 5+ years when we could not upgrade calico in place without rebooting everything.

Your Environment

  • Calico version:
$ calicoctl version
Client Version:    v3.22.2
Git commit:        14cf6d6ea
Cluster Version:   v3.22.1
Cluster Type:      typha,kdd,k8s,bgp,lvd
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.8", GitCommit:"7061dbbf75f9f82e8ab21f9be7e8ffcaae8e0d44", GitTreeState:"clean", BuildDate:"2022-03-16T14:04:34Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
  • Operating System and version:
$ kubectl get nodes -o wide
NAME        STATUS                     ROLES         AGE      VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                             KERNEL-VERSION     CONTAINER-RUNTIME
lvdkbm501   Ready,SchedulingDisabled   master        22d      v1.22.8   10.152.53.31    <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbm502   Ready,SchedulingDisabled   master        22d      v1.22.8   10.152.53.32    <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbm503   Ready,SchedulingDisabled   master        22d      v1.22.8   10.152.53.33    <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbt501   Ready                      temp-worker   546d     v1.22.8   10.152.53.198   <none>        Flatcar Container Linux by Kinvolk 3033.2.4 (Oklo)   5.10.107-flatcar   containerd://1.5.10
lvdkbt502   Ready                      temp-worker   546d     v1.22.8   10.152.53.213   <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbw501   Ready                      worker        252d     v1.22.8   10.152.53.34    <none>        Flatcar Container Linux by Kinvolk 3139.2.0 (Oklo)   5.15.32-flatcar    containerd://1.5.11
lvdkbw502   Ready                      worker        3y252d   v1.22.8   10.152.53.35    <none>        Flatcar Container Linux by Kinvolk 3033.2.4 (Oklo)   5.10.107-flatcar   containerd://1.5.10
lvdkbw503   Ready                      mas,worker    3y252d   v1.22.8   10.152.53.36    <none>        Flatcar Container Linux by Kinvolk 3033.2.4 (Oklo)   5.10.107-flatcar   containerd://1.5.10
lvdkbw504   Ready                      worker        281d     v1.22.8   10.152.53.37    <none>        Flatcar Container Linux by Kinvolk 3033.2.4 (Oklo)   5.10.107-flatcar   containerd://1.5.10
@igcherkaev
Copy link
Author

Verified yesterday: same behavior if upgraded to calico 3.22.0.

@tomastigera
Copy link
Contributor

To report some progress, I can reproduce the issue. Sorry it took so long.

@tomastigera
Copy link
Contributor

This issue is not present in the upcoming 3.23 and will be eventually fixed in the next 3.22.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants