Pods cannot talk to cluster IPs on Ubuntu 2204 #2103

olemarkus · 2022-10-05T17:55:05Z

What happened:

After upgrading clusters to use Ubuntu 22.04 by default, the kOps e2e tests started failing for this CNI: https://testgrid.k8s.io/kops-network-plugins#kops-aws-cni-amazon-vpc

What seems to happen is that Pods do receive IPs, but they fail to talk across nodes. Calling e.g a ClusterIP service from the host works, but not from a Pod. Kube-proxy therefore should be working just fine.

I cannot see anything wrong in any logs. But what I do see is that there are AWS-related rules in the legacy iptables, while kube-proxy uses nftables. So my guess is that this is the cause of this behavior. nft and legacy iptables must not be mixed anyway.

Attach logs
Example logs here: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/e2e-kops-aws-cni-amazon-vpc/1577618499142946816/artifacts/i-0d90e121da8bff687/

How to reproduce it (as minimally and precisely as possible):

kops create cluster --name test.kops-dev.srsandbox.io --cloud aws --networking=amazonvpc --zones=eu-central-1a,eu-central-1b,eu-central-1c --channel=alpha --master-count=3 --yes --kubernetes-version 1.25.0 --discovery-store=$KOPS_STATE_STORE/discovery --image=099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20220921.1

The text was updated successfully, but these errors were encountered:

jayanthvn · 2022-10-05T19:14:11Z

This looks similar to this issue - #1847 (comment).

This workaround - #1847 (comment) has helped.

As suggested by @achevuru - Amazon Linux 2 images use iptables-legacy by default as well. We will check and update if there is something we can do to address this scenario.

olemarkus · 2022-10-05T19:24:56Z

Let me try that workaround. If it works, I think it would be helpfull if an iptables-nft image could be published. I imagine it wouldn't be too much work to do that.

jayanthvn · 2022-10-05T19:27:25Z

Thanks, please let us know if it works.

olemarkus · 2022-10-05T19:52:44Z

Unfortunately, no luck. The workaround does remove the rules from iptables-legacy and I do see them now in nftables. Pods still cannot talk to cluster IPs.

Can also confirm I still see nothing interesting in the logs, and Pods do get their IPs.

olemarkus · 2022-10-18T06:55:02Z

Any idea how to progress on this? Anywhere we should look for potential issues?

jayanthvn · 2022-10-18T14:30:59Z

@olemarkus - Sorry for the delay. Since you mentioned pod to pod communication is broken. Wondering if you already verified this, if not can you please run tcp dump on the sender pod host side veth, sender node, receiving node and receiving node host side veth? This should provide context on where the traffic is getting dropped.

olemarkus · 2022-10-18T14:37:49Z

As far as I can tell, pod to pod comms works when they are interacting directly. It's pod -> clusterIP that does not work except when the Pod is running in hostNetworking mode.

achevuru · 2022-10-18T20:39:50Z

I expect pod->clusterIP to work from within the pod if it is working fine from the node (or) hostNetworking pod, because the DNAT rules that replace clusterIP with one of the backend endpoints are installed in the root(host) network namespace (by kube-proxy) and not inside the pod network namespace.

Were you able to track the packet via tcpdump?

olemarkus · 2022-10-19T11:01:18Z

Right.
To make this simple, I created one pod (A) without host networking and one pod with (B).
Pod A tries to reach pod B behind a clusterIP service. This should minimize the levels of indirection in the networking to reproduce this.

Running tcp dump against A's veth, I see the packets going from pod IP to the service IP. However, tcpdumping any host interfaces, shows no packets coming in from A's IP.

This is with a custom build of aws-node using NFT iptables.

The DNAT rules seems to be working fine since connecting from the host towards the cluster IP works.

achevuru · 2022-10-19T16:22:47Z

So, if I understood it right - Connection to a Pod via ClusterIP from a pod without hostNetworking fails but the same works the other way around?

When you say tcpdump against Pod's veth - Are you referring to the veth interface inside the pod's network namespace (or) the veth interface on the host network namespace? Would you be able to share iptables o/p (both NAT and filter) tables with us @ k8s-awscni-triage@amazon.com (from both host and pod network namespaces)..

olemarkus · 2022-10-19T18:08:33Z

I have not tested from B to A.
I referred to the host side veth interface. Since I saw traffic on the host side, I assumed it had already existed the pod namespace.
I'll send you the output.

Also worth mentioning that this is very easy to reproduce with latest kOps using e.g --image=099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20220921.1

achevuru · 2022-10-20T01:16:33Z

Interesting, if you see the packet on host side veth then we know it landed on the host network end and now the behavior should be similar to a connection that we initiate from the node. Are there any active network policies on the node? We will check the logs/iptables output once we receive them and will update here.

We will see if we can reproduce with the above image as well.

olemarkus · 2022-10-20T10:00:42Z

Sent the iptables output.

Also tried disabling rp filtering on the veth interface, but didn't seem to have much effect.

There are no network policies or similar on the node other than what ubuntu 2204 may be doing by default.

veshij · 2022-10-22T01:04:38Z

troubleshooting similar issue on our cluster. Configuration which works on u20 doesn't work after an upgrade to U22.
So far I found that outbound packets from pods are dropped in kernel, likely here.

veshij · 2022-10-22T02:29:29Z

I think I found the issue.
Looks like cni plugin adds incorrect static arp entry (or veth mac changes later on) inside pod network namespace for 169.254.1.1.

achevuru · 2022-10-22T03:38:25Z

@veshij VPC CNI does add a static arp entry for default GW (169.254.1.1 - pointing to host side veth) inside the pod network namespace. So, it is essentially for the host side veth ..

https://github.com/aws/amazon-vpc-cni-k8s/blob/314625892e91ce36fba87211694537071b590c92/docs/cni-proposal.md#inside-a-pod

amazon-vpc-cni-k8s/cmd/routed-eni-cni-plugin/driver/driver.go

Line 162 in d43309b

// Add a connected route to a dummy next hop (169.254.1.1 or fe80::1)

amazon-vpc-cni-k8s/cmd/routed-eni-cni-plugin/driver/driver.go

Line 208 in d43309b

// add static ARP entry for default gateway

Are you saying the packet is dropped at host veth because of L2 header discrepancy (i.e.,) mismatch with host veth's MAC? As you can see, we derive the hostVeth MAC and are using it...So, the veth MAC must be changing. We can see the veth MAC inside the pod network namespace and compare it against the current value..

neigh := &netlink.Neigh{
		LinkIndex:    contVeth.Attrs().Index,
		State:        netlink.NUD_PERMANENT,
		IP:           gwNet.IP,
		HardwareAddr: hostVeth.Attrs().HardwareAddr,
	}

veshij · 2022-10-22T04:36:42Z

Yes, that's exactly what happens in my system. I can confirm that on u22 (running newer kernel) the mac address of host's veth doesn't match static arp record inside the pod. And more to say - this mac address is not used on any other interface. Exactly the same cni binary running on u20 (and older kernel) has no issues.

I'm troubleshooting it a bit further. I don't think it's a bug in CNI code, currently I'm suspecting either an issue with netlink implementation/kernel netlink interface or mac address changes over time on veth interface (smth similar to ipv6's privacy extensions).

root@bwi1a-i-0fee9155633983366:~# ip netns exec cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1 ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
...
3: eth0@if127: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 4a:84:9e:7e:1a:40 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.13.51/32 scope global eth0
    ...

root@bwi1a-i-0fee9155633983366:~# ping -c 1 10.244.13.51
PING 10.244.13.51 (10.244.13.51) 56(84) bytes of data.

--- 10.244.13.51 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

root@bwi1a-i-0fee9155633983366:~# ip netns exec cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1 arp -na
? (169.254.1.1) at 9e:67:e4:1f:87:74 [ether] PERM on eth0
? (10.244.9.159) at 32:10:f0:bf:bd:44 [ether] on eth0
root@bwi1a-i-0fee9155633983366:~# ip link | grep cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1
    link/ether 32:10:f0:bf:bd:44 brd ff:ff:ff:ff:ff:ff link-netns cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1
root@bwi1a-i-0fee9155633983366:~# ip netns exec cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1 arp -d 169.254.1.1
root@bwi1a-i-0fee9155633983366:~# ip netns exec cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1 arp -s 169.254.1.1 32:10:f0:bf:bd:44
root@bwi1a-i-0fee9155633983366:~# ping -c 1 10.244.13.51
PING 10.244.13.51 (10.244.13.51) 56(84) bytes of data.
64 bytes from 10.244.13.51: icmp_seq=1 ttl=64 time=0.023 ms

--- 10.244.13.51 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms

veshij · 2022-10-22T06:02:52Z

Testcase triggers the issue both in AWS and on-prem, kernel 5.15.
Doesn't trigger the issue on 5.13.

package main

import (
	"flag"
	"fmt"
	"github.com/containernetworking/plugins/pkg/ns"
	"github.com/vishvananda/netlink"
	"net"
	"path/filepath"
	"time"
)

const (
	vethName = "veth0"
)

func main() {
	namespace := flag.String("ns", "", "")
	flag.Parse()

	err := ns.WithNetNSPath(filepath.Join("/var/run/netns", *namespace), func(hostNs ns.NetNS) error {
		veth := &netlink.Veth{
			LinkAttrs: netlink.LinkAttrs{
				Name:  "eth0",
				Flags: net.FlagUp,
				MTU:   1234,
			},
			PeerName: "veth0",
		}
		if err := netlink.LinkAdd(veth); err != nil {
			panic(err)
		}

		hostLink, err := netlink.LinkByName(vethName)
		if err != nil {
			panic(err)
		}
		fmt.Printf("Host link mac address inside netns: %+v\n", hostLink.Attrs().HardwareAddr)

		// Move it to root namespace.
		if err = netlink.LinkSetNsFd(hostLink, int(hostNs.Fd())); err != nil {
			panic(err)
		}

		return nil
	})
	if err != nil {
		panic(err)
	}

	hostLink, err := netlink.LinkByName(vethName)
	if err != nil {
		panic(err)
	}
	fmt.Printf("Host link (id=%d) mac address in root ns: %+v\n", hostLink.Attrs().Index, hostLink.Attrs().HardwareAddr)
	time.Sleep(time.Second)

	hostLink, err = netlink.LinkByName("veth0")
	if err != nil {
		panic(err)
	}
	fmt.Printf("Host link (id=%d) mac address in root ns after sleep: %+v\n", hostLink.Attrs().Index, hostLink.Attrs().HardwareAddr)
}

# ip netns del test >/dev/null;  ip netns add test; sleep 1; ./arp -ns test

root@iad8a-rk36-17a:~# ip netns del test >/dev/null;  ip netns add test; sleep 1; ./arp -ns test
I1022 06:03:04.589008 96147 main.go:34] Host link mac address inside netns: 8a:65:fc:48:7b:b2
I1022 06:03:04.633549 96147 main.go:71] Host link mac address in root ns: 8a:65:fc:48:7b:b2
I1022 06:03:05.633731 96147 main.go:78] Host link mac address in root ns after sleep: 7a:0f:ba:38:59:59

veshij · 2022-10-22T07:12:28Z

Looks like it's udev.

root@iad8a-rk36-17a:~# cat /usr/lib/systemd/network/99-default.link | grep -v '^#'

[Match]
OriginalName=*

[Link]
NamePolicy=keep kernel database onboard slot path
AlternativeNamesPolicy=database onboard slot path
MACAddressPolicy=persistent

https://www.freedesktop.org/software/systemd/man/systemd.link.html

MACAddressPolicy=persistent

This feature depends on ID_NET_NAME_* properties to exist for the link. On hardware where these properties are not set, the generation of a persistent MAC address will fail.

u20:

root@sjc8d-rl10-7a:~# udevadm test /devices/virtual/net/veth0 |& grep ID_NET_NAME
root@sjc8d-rl10-7a:~#

u22:

root@iad8a-rk36-17a:/usr/lib/systemd/network# udevadm test /devices/virtual/net/veth0 |& grep ID_NET_NAME
ID_NET_NAME=veth0

with udevadm control --stop-exec-queue mac address remains constant.

We likely want to fix implementation on cni side, I suppose changing order to creatre veth pair in root namespace first and moving device to netns should be a reasonable workaround.

veshij · 2022-10-24T21:50:57Z

@jayanthvn @achevuru what do you think?

More conventional approach:

create veth pair in host namespace
move pod's veth to pod namespace
sleep a bit to make sure udev is done
configure pod's namespace with a correct mac address

Another option is to leave almost everything as is:

create veth pair in namespace
move host veth to host namespace
return to host namespace
sleep a bit to make sure udev is done
return back to pod namespace and configure arp entry/route

Unfortunately I'm not sure how to make it work without some sleep with magic duration (it takes a 100-200ms on my system, but it can be worse if host is heavily loaded).

veshij · 2022-10-24T22:01:42Z

~~Actually I can't even repro with the first approach, will probably implement that.~~ I can repro it.

veshij · 2022-10-25T02:43:21Z

>More conventional approach:
>Another option is to leave almost everything as is:
scratch that.
The more correct configuration which does not require to mess with arp entries and udev:

configure link-local address 169.254.1.1 on host's veth

ip addr add 169.254.1.1/32 dev eni2b77055a6e0 scope link

configure onlink route inside the container:

ip ro add default via 169.254.1.1 dev eth0 onlink

veshij · 2022-10-25T03:35:30Z

#2118

kwohlfahrt · 2022-11-21T08:40:51Z

FWIW, I found this workaround during node setup helped resolve the issue:

mkdir -p /etc/systemd/network/99-default.link.d/
cat <<EOF > /etc/systemd/network/99-default.link.d/aws-cni-workaround.conf
[Link]
MACAddressPolicy=none
EOF

jayanthvn · 2022-11-21T15:47:18Z

Thanks @kwohlfahrt

The proposed PR #2118 changes the order of CNI to creatre veth pair in root namespace and then move device to netns.

heybronson · 2023-05-15T23:50:43Z

Was this resolved?

kishorj · 2023-07-21T17:48:25Z

/reopen

github-actions · 2023-09-20T00:03:18Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

hakman · 2023-10-02T00:44:57Z

This still needs a fix

jdn5126 · 2023-10-02T15:24:57Z

@hakman unless #2118 gets revived and made usable, the fix is to set MACAddressPolicy=none on Ubuntu 22.x

hakman · 2023-10-02T15:27:41Z

@hakman unless #2118 gets revived and made usable, the fix is to set MACAddressPolicy=none on Ubuntu 22.x

Thanks @jdn5126!

pmankad96 · 2023-10-13T00:09:36Z

It also prevents new kops cluster with networking=amazonvpc to come up healthy. In my case core-dns-xx and ebs-csi-node pods kept crashing. For core-dns the log read: plugin/error timeout when trying to connect to Amazon Provided DNS server. For ebs-csi-node the error was related to unable to get the Node (was trying on 100.64. - not sure why). The workaround is to use 20.04 image instead. The error messages are so cryptic that it took me a while to figure out.

btalbot · 2023-10-18T23:50:18Z

So running in AWS with spec.networking.amazonvpc and also using awsEBSCSIDriver is broken? Trying to upgrade my test cluster from 1.25 to 1.26 and the ebs-sci-node pod's ebs-plugin container in the new masters keep crash-looping with this log

+ kube-system ebs-csi-node-sjkv7 › ebs-plugin
kube-system ebs-csi-node-sjkv7 ebs-plugin I1018 23:46:11.891261       1 metadata.go:101] kubernetes api is available
kube-system ebs-csi-node-sjkv7 ebs-plugin panic: error getting Node i-04bddcf2fcb369bae: Get "https://100.64.0.1:443/api/v1/nodes/i-04bddcf2fcb369bae": dial tcp 100.64.0.1:443: i/o timeout
kube-system ebs-csi-node-sjkv7 ebs-plugin
kube-system ebs-csi-node-sjkv7 ebs-plugin goroutine 1 [running]:
kube-system ebs-csi-node-sjkv7 ebs-plugin github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc00003f540)
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:94 +0x345
kube-system ebs-csi-node-sjkv7 ebs-plugin github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver({0xc00054df30, 0x8, 0x3684458?})
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:95 +0x393
kube-system ebs-csi-node-sjkv7 ebs-plugin main.main()
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:46 +0x37d
- kube-system ebs-csi-node-sjkv7 › ebs-plugin

Running amazonvpc networking with ebs-csi seems like a pretty common use case to be so broken.

jdn5126 · 2023-10-19T14:44:16Z

@pmankad96 @btalbot I suggest filing a support case for this so that it can be investigated further

github-actions · 2023-12-19T00:03:12Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

btalbot · 2023-12-19T02:01:48Z

I haven't seen any comments or commits on this so I presume that Ubuntu 2204 is still broken on AWS running amazonvpc?

hakman · 2023-12-19T04:37:05Z

I haven't seen any comments or commits on this so I presume that Ubuntu 2204 is still broken on AWS running amazonvpc?

Not yet 🥲

jdn5126 · 2023-12-19T14:56:53Z

@btalbot Ubuntu 22.04 works on EKS, you just have to set MACAddressPolicy=none like the official EKS AMI does: https://github.com/awslabs/amazon-eks-ami/blob/master/scripts/install-worker.sh#L104

jdn5126 · 2024-01-25T16:52:28Z

Closing this as complete, since the troubleshooting doc informs people to set MACAddressPolicy=none. AL2023 also does this automatically in the AMI, so it can be referenced by other distros: https://github.com/awslabs/amazon-eks-ami/blob/master/scripts/install-worker.sh#L104

github-actions · 2024-01-25T16:52:44Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

olemarkus added the bug label Oct 5, 2022

olemarkus mentioned this issue Oct 5, 2022

Ubuntu 22.04 Jammy Jellyfish tracker kubernetes/kops#14140

Open

4 tasks

veshij mentioned this issue Oct 25, 2022

[driver] refactor pod network configuration not to use static ARP entry #2118

Closed

hakman mentioned this issue Nov 15, 2022

kOps: Use Ubuntu 20.04 for AWS VPC CNI tests kubernetes/test-infra#28006

Merged

jdn5126 mentioned this issue Dec 8, 2022

Add ENABLE_NFTABLES to VPC CNI #2155

Merged

hakman mentioned this issue Jun 29, 2023

Fix cni plugin for AWS scale tests kubernetes/test-infra#29968

Merged

kishorj reopened this Jul 21, 2023

github-actions bot removed the stale Issue or PR is stale label Jul 22, 2023

hakman mentioned this issue Jul 30, 2023

amazonvpc is not working with Ubuntu 22.04(Jammy) kubernetes/kops#15720

Closed

dims mentioned this issue Aug 15, 2023

Switch back to aws vpc cni - Revert systemd-udev change (MACAddressPolicy) that disables networking from pods kubernetes-sigs/provider-aws-test-infra#119

Merged

github-actions bot added the stale Issue or PR is stale label Sep 20, 2023

github-actions bot removed the stale Issue or PR is stale label Oct 3, 2023

github-actions bot added the stale Issue or PR is stale label Dec 19, 2023

jdn5126 removed the stale Issue or PR is stale label Dec 19, 2023

moshevayner mentioned this issue Dec 23, 2023

chore(networking): bump aws-vpc-cni version to 1.15.5 kubernetes/kops#16191

Closed

Deshke mentioned this issue Jan 16, 2024

AWS VPC CNI Ubuntu 22.04 MACAddressPolicy kubernetes/kops#16255

Closed

jdn5126 closed this as completed Jan 25, 2024

This was referenced Feb 2, 2024

fix(nodeup): set MACAddressPolicy=none when using AWS VPC CNI kubernetes/kops#16313

Merged

docs: Remove warning about Amazon VPC CNI not being compatible with Ubuntu 22.04 kubernetes/kops#16326

Merged

cmwylie19 mentioned this issue Apr 22, 2024

Pepr Watch is not Responding to Changes after 90 mins defenseunicorns/pepr#745

Open

Pods cannot talk to cluster IPs on Ubuntu 2204 #2103

Pods cannot talk to cluster IPs on Ubuntu 2204 #2103

Comments

olemarkus commented Oct 5, 2022

jayanthvn commented Oct 5, 2022

olemarkus commented Oct 5, 2022

jayanthvn commented Oct 5, 2022

olemarkus commented Oct 5, 2022 • edited Loading

olemarkus commented Oct 18, 2022

jayanthvn commented Oct 18, 2022

olemarkus commented Oct 18, 2022

achevuru commented Oct 18, 2022 • edited Loading

olemarkus commented Oct 19, 2022

achevuru commented Oct 19, 2022

olemarkus commented Oct 19, 2022

achevuru commented Oct 20, 2022

olemarkus commented Oct 20, 2022

veshij commented Oct 22, 2022

veshij commented Oct 22, 2022

achevuru commented Oct 22, 2022

veshij commented Oct 22, 2022 • edited Loading

veshij commented Oct 22, 2022 • edited Loading

veshij commented Oct 22, 2022 • edited Loading

veshij commented Oct 24, 2022

veshij commented Oct 24, 2022 • edited Loading

veshij commented Oct 25, 2022

veshij commented Oct 25, 2022

kwohlfahrt commented Nov 21, 2022

jayanthvn commented Nov 21, 2022

heybronson commented May 15, 2023

kishorj commented Jul 21, 2023

github-actions bot commented Sep 20, 2023

hakman commented Oct 2, 2023

jdn5126 commented Oct 2, 2023

hakman commented Oct 2, 2023

pmankad96 commented Oct 13, 2023 • edited Loading

btalbot commented Oct 18, 2023 • edited Loading

jdn5126 commented Oct 19, 2023

github-actions bot commented Dec 19, 2023

btalbot commented Dec 19, 2023

hakman commented Dec 19, 2023

jdn5126 commented Dec 19, 2023

jdn5126 commented Jan 25, 2024

github-actions bot commented Jan 25, 2024

olemarkus commented Oct 5, 2022 •

edited

Loading

achevuru commented Oct 18, 2022 •

edited

Loading

veshij commented Oct 22, 2022 •

edited

Loading

veshij commented Oct 22, 2022 •

edited

Loading

veshij commented Oct 22, 2022 •

edited

Loading

veshij commented Oct 24, 2022 •

edited

Loading

pmankad96 commented Oct 13, 2023 •

edited

Loading

btalbot commented Oct 18, 2023 •

edited

Loading