Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some pods don't get host -> pod route on newer Fedora CoreOS version #1514

Closed
masterzen opened this issue Jun 17, 2021 · 2 comments
Closed
Labels

Comments

@masterzen
Copy link

What happened:
We're testing a Fedora CoreOS upgrade (from 33.20210426.3.0 to 34.20210529.3.0) on a test k8s cluster (non EKS), and we have some pods that are stuck in CrashLoopBackoff when the machine first boots because the route that goes from the host to the pod hasn't been set.

For instance the pod that has IP 10.102.128.4 has none of this route:

# ip route show table main
default via 10.102.128.1 dev ens5 proto dhcp metric 100
default via 10.102.128.1 dev ens6 proto dhcp metric 102
10.102.128.0/18 dev ens5 proto kernel scope link src 10.102.168.153 metric 100
10.102.128.0/18 dev ens6 proto kernel scope link src 10.102.154.115 metric 102
10.102.134.191 dev eniec4c2d67f18 scope link
10.102.163.181 dev eni3408eb5d67f scope link
10.102.165.80 dev eni4cbf85aa67f scope link
10.102.169.89 dev enife9c45d505d scope link
10.102.187.132 dev eni50cdd2eb92b scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

There's no entry for 10.102.128.4.

Note that killing the pod is sometimes enough to make the network work again.

Note that the plugin doesn't report any errors when setting the route:

{"level":"info","ts":"2021-06-17T13:02:20.236Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received CNI add request: ContainerID(7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c) Netns(/proc/9747/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=gatekeeper-system;K8S_POD_NAME=gatekeeper-audit-84964f86f-r9bqv;K8S_POD_INFRA_CONTAINER_ID=7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2021-06-17T13:02:20.236Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2021-06-17T13:02:20.245Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received add network response for container 7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c interface eth0: Success:true IPv4Addr:\"10.102.128.4\" UseExternalSNAT:true VPCcidrs:\"10.102.0.0/16\" "}
{"level":"debug","ts":"2021-06-17T13:02:20.245Z","caller":"routed-eni-cni-plugin/cni.go:194","msg":"SetupNS: hostVethName=eni1abcefcdbba, contVethName=eth0, netnsPath=/proc/9747/ns/net, deviceNumber=0, mtu=9001"}
{"level":"debug","ts":"2021-06-17T13:02:20.253Z","caller":"driver/driver.go:184","msg":"setupVeth network: disabled IPv6 RA and ICMP redirects on eni1abcefcdbba"}
{"level":"debug","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Setup host route outgoing hostVeth, LinkIndex 17"}
{"level":"debug","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Successfully set host route to be 10.102.128.4/0"}
{"level":"info","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Added toContainer rule for 10.102.128.4/32"}

In the past, we had an issue that looked like this, where systemd was changing the MAC address of the eni interface behind aws-cni back. But this doesn't look exactly like the same issue.

I'm currently clueless on how to troubleshoot this issue, can anyone offer some help?

Attach logs

eks_i-0fc36fa426a34ba90_2021-06-17_1558-UTC_0.6.2.tar.gz

What you expected to happen:
I expected that pod networking would work as it was in the previous version.

How to reproduce it (as minimally and precisely as possible):

Create a k8s cluster with Fedora CoreOS nodes and aws-cni.

Anything else we need to know?:

We need to test with an older kernel and/or systemd combination to check which ones creates this issue.

Environment:

  • aws-cni version: 1.7.10
  • Kubernetes version (use kubectl version): 1.18.6
  • CNI Version: 0.9.1
  • OS (e.g: cat /etc/os-release): Fedora CoreOS 34.20210529.3.0
  • Kernel (e.g. uname -a): Linux ip-10-102-168-153 5.12.7-300.fc34.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Wed May 26 12:58:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
@masterzen masterzen added the bug label Jun 17, 2021
@masterzen
Copy link
Author

After more troubleshooting it looks like it is a race condition between NetworkManager and the aws-cni plugin.

We excluded the eni* interface from NetworkManager and it looks like it doesn't exhibit this issue. We need to perform more tests to validate this solution.

Here's the config file /etc/NetworkManager/conf.d/aws-cni.conf:

[keyfile]
unmanaged-devices=interface-name:eni*;interface-name:veth*

For good measure we also excluded them from systemd-networkd with a file /etc/systemd/network/10-aws-k8s-cni.network:

[Match]
OriginalName=eni*
[Link]
Unmanaged=yes
ActivationPolicy=manual

@jayanthvn
Copy link
Contributor

Hey @masterzen

Thanks for conforming. Please feel free to reopen the issue if you still need any help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants