Some pods don't get host -> pod route on newer Fedora CoreOS version #1514

masterzen · 2021-06-17T16:35:03Z

What happened:
We're testing a Fedora CoreOS upgrade (from 33.20210426.3.0 to 34.20210529.3.0) on a test k8s cluster (non EKS), and we have some pods that are stuck in CrashLoopBackoff when the machine first boots because the route that goes from the host to the pod hasn't been set.

For instance the pod that has IP 10.102.128.4 has none of this route:

# ip route show table main
default via 10.102.128.1 dev ens5 proto dhcp metric 100
default via 10.102.128.1 dev ens6 proto dhcp metric 102
10.102.128.0/18 dev ens5 proto kernel scope link src 10.102.168.153 metric 100
10.102.128.0/18 dev ens6 proto kernel scope link src 10.102.154.115 metric 102
10.102.134.191 dev eniec4c2d67f18 scope link
10.102.163.181 dev eni3408eb5d67f scope link
10.102.165.80 dev eni4cbf85aa67f scope link
10.102.169.89 dev enife9c45d505d scope link
10.102.187.132 dev eni50cdd2eb92b scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

There's no entry for 10.102.128.4.

Note that killing the pod is sometimes enough to make the network work again.

Note that the plugin doesn't report any errors when setting the route:

{"level":"info","ts":"2021-06-17T13:02:20.236Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received CNI add request: ContainerID(7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c) Netns(/proc/9747/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=gatekeeper-system;K8S_POD_NAME=gatekeeper-audit-84964f86f-r9bqv;K8S_POD_INFRA_CONTAINER_ID=7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2021-06-17T13:02:20.236Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2021-06-17T13:02:20.245Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received add network response for container 7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c interface eth0: Success:true IPv4Addr:\"10.102.128.4\" UseExternalSNAT:true VPCcidrs:\"10.102.0.0/16\" "}
{"level":"debug","ts":"2021-06-17T13:02:20.245Z","caller":"routed-eni-cni-plugin/cni.go:194","msg":"SetupNS: hostVethName=eni1abcefcdbba, contVethName=eth0, netnsPath=/proc/9747/ns/net, deviceNumber=0, mtu=9001"}
{"level":"debug","ts":"2021-06-17T13:02:20.253Z","caller":"driver/driver.go:184","msg":"setupVeth network: disabled IPv6 RA and ICMP redirects on eni1abcefcdbba"}
{"level":"debug","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Setup host route outgoing hostVeth, LinkIndex 17"}
{"level":"debug","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Successfully set host route to be 10.102.128.4/0"}
{"level":"info","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Added toContainer rule for 10.102.128.4/32"}

In the past, we had an issue that looked like this, where systemd was changing the MAC address of the eni interface behind aws-cni back. But this doesn't look exactly like the same issue.

I'm currently clueless on how to troubleshoot this issue, can anyone offer some help?

Attach logs

eks_i-0fc36fa426a34ba90_2021-06-17_1558-UTC_0.6.2.tar.gz

What you expected to happen:
I expected that pod networking would work as it was in the previous version.

How to reproduce it (as minimally and precisely as possible):

Create a k8s cluster with Fedora CoreOS nodes and aws-cni.

Anything else we need to know?:

We need to test with an older kernel and/or systemd combination to check which ones creates this issue.

Environment:

aws-cni version: 1.7.10
Kubernetes version (use kubectl version): 1.18.6
CNI Version: 0.9.1
OS (e.g: cat /etc/os-release): Fedora CoreOS 34.20210529.3.0
Kernel (e.g. uname -a): Linux ip-10-102-168-153 5.12.7-300.fc34.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Wed May 26 12:58:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

masterzen · 2021-06-18T12:41:07Z

After more troubleshooting it looks like it is a race condition between NetworkManager and the aws-cni plugin.

We excluded the eni* interface from NetworkManager and it looks like it doesn't exhibit this issue. We need to perform more tests to validate this solution.

Here's the config file /etc/NetworkManager/conf.d/aws-cni.conf:

[keyfile]
unmanaged-devices=interface-name:eni*;interface-name:veth*

For good measure we also excluded them from systemd-networkd with a file /etc/systemd/network/10-aws-k8s-cni.network:

[Match]
OriginalName=eni*
[Link]
Unmanaged=yes
ActivationPolicy=manual

jayanthvn · 2021-07-06T18:16:09Z

Hey @masterzen

Thanks for conforming. Please feel free to reopen the issue if you still need any help.

masterzen added the bug label Jun 17, 2021

jayanthvn closed this as completed Jul 6, 2021

soem mentioned this issue Aug 24, 2021

feat: exclude eni* interface on systemd-networkd getamis/terraform-ignition-kubernetes#41

Open

hligit mentioned this issue Sep 1, 2021

Pod routing policies deleted by systemd #1600

Closed

RomanCherednikovAZ mentioned this issue Sep 22, 2021

Traffic not goes through routes at Fedora 33 #1614

Closed

achevuru mentioned this issue Feb 22, 2022

Traffic from pods use the wrong egress interface - sometimes #1856

Closed

jdn5126 mentioned this issue Jan 12, 2023

Remove NetworkManager-cloud-setup RPM if present awslabs/amazon-eks-ami#1136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some pods don't get host -> pod route on newer Fedora CoreOS version #1514

Some pods don't get host -> pod route on newer Fedora CoreOS version #1514

masterzen commented Jun 17, 2021

masterzen commented Jun 18, 2021

jayanthvn commented Jul 6, 2021

Some pods don't get host -> pod route on newer Fedora CoreOS version #1514

Some pods don't get host -> pod route on newer Fedora CoreOS version #1514

Comments

masterzen commented Jun 17, 2021

masterzen commented Jun 18, 2021

jayanthvn commented Jul 6, 2021