You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Upgrading from minor version to another should be as smooth as possible. All services and workloads in cluster should be reachable as they were prior to upgrade.
Current Behavior
We're seeing a weird behavior where wrong backends would be chosen by BPF programs, happens randomly on each node of the cluster. I.e., for one service it may pick the right backend, for another it may pick a wrong backend. On the next node this may vary and apply to another set of services/backends. Some extra details may still be available in this Slack thread: https://calicousers.slack.com/archives/C0BCA117T/p1650303706854169
Here's some snippets from the thread (in case it gets archived and/or deleted). Background information:
lvdkbm502 - one of the node in the cluster (role: master, ip address: 10.152.53.32)
10.158.8.246:443 - in-cluster service, backed with a pod with ip address/port 10.158.223.191:9443
10.158.144.220:9101 is a pod for a different service in a different namespace, completely unrelated to 10.158.8.246.
lvdkbm502 ~ # curl -kv http://10.158.8.246:443/ABC
* Trying 10.158.8.246:443...
* Connected to 10.158.8.246 (10.158.144.220) port 443 (#0)
tc exec bpf debug produced the following for this curl call:
What fixes it for us is rebooting each node after upgrade. Which isn't a feasible solution in large clusters with tens and hundreds of nodes and production use.
Since reboot is fixing it, that makes me think of some leftovers being not cleaned up properly in between upgrade/on start up of calico-node in 3.22.x.
Steps to Reproduce (for bugs)
Install calico 3.21.4.
Test multiple services with curl from different nodes.
Upgrade calico to 3.22.1 or 3.22.2 without rebooting nodes.
Test the services again from multiple nodes. In our case the behavior is as described above. More often than not requests will go to wrong backend.
Context
Basically we're trying to upgrade calico so we are not too far behind from the latest. This is the first time in 5+ years when we could not upgrade calico in place without rebooting everything.
Expected Behavior
Upgrading from minor version to another should be as smooth as possible. All services and workloads in cluster should be reachable as they were prior to upgrade.
Current Behavior
We're seeing a weird behavior where wrong backends would be chosen by BPF programs, happens randomly on each node of the cluster. I.e., for one service it may pick the right backend, for another it may pick a wrong backend. On the next node this may vary and apply to another set of services/backends. Some extra details may still be available in this Slack thread: https://calicousers.slack.com/archives/C0BCA117T/p1650303706854169
Here's some snippets from the thread (in case it gets archived and/or deleted). Background information:
tc exec bpf debug produced the following for this curl call:
Where
addr=a9e08f6 port=443
is 10.158.8.246:443, but selected backend isip=a9e90dc port=9101
= 10.158.144.220:9101, which is totally incorrect.Dumped NAT table from the respective calico-node seemed to be correct:
bpf conntrack table was just confirming incorrectly chosen backend and would look like:
Possible Solution
What fixes it for us is rebooting each node after upgrade. Which isn't a feasible solution in large clusters with tens and hundreds of nodes and production use.
Since reboot is fixing it, that makes me think of some leftovers being not cleaned up properly in between upgrade/on start up of calico-node in 3.22.x.
Steps to Reproduce (for bugs)
Context
Basically we're trying to upgrade calico so we are not too far behind from the latest. This is the first time in 5+ years when we could not upgrade calico in place without rebooting everything.
Your Environment
The text was updated successfully, but these errors were encountered: