Health checks fail, phantom ENI in logs #1572

therc · 2021-08-11T20:35:51Z

What happened:
One r5.4xlarge machine, with 15 existing IP addresses, has three containers in ContainerCreating state for almost a day now.

Looking at ipamd logs, this stands out:

{"level":"debug","ts":"2021-08-11T18:15:02.964Z","caller":"ipamd/ipamd.go:1106","msg":"Total number of interfaces found: 2 "}
{"level":"debug","ts":"2021-08-11T18:15:02.964Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI MAC address: 0a:16:40:c0:05:81"}
{"level":"debug","ts":"2021-08-11T18:15:02.966Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI: eni-0dd22c2d2aa07e244, MAC 0a:16:40:c0:05:81, device 1"}
{"level":"debug","ts":"2021-08-11T18:15:02.967Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI MAC address: 0a:1d:84:b2:f1:7d"}
{"level":"debug","ts":"2021-08-11T18:15:02.969Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI: eni-0fa8cac2b043fb6e7, MAC 0a:1d:84:b2:f1:7d, device 0"}
{"level":"debug","ts":"2021-08-11T18:15:02.971Z","caller":"ipamd/ipamd.go:557","msg":"A new ENI added but not by ipamd, updating tags by calling EC2"}
{"level":"debug","ts":"2021-08-11T18:15:02.971Z","caller":"awsutils/awsutils.go:1027","msg":"Total number of interfaces found: 2 "}
{"level":"debug","ts":"2021-08-11T18:15:02.971Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI MAC address: 0a:16:40:c0:05:81"}
{"level":"debug","ts":"2021-08-11T18:15:02.977Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI: eni-0dd22c2d2aa07e244, MAC 0a:16:40:c0:05:81, device 1"}
{"level":"debug","ts":"2021-08-11T18:15:02.979Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI MAC address: 0a:1d:84:b2:f1:7d"}
{"level":"debug","ts":"2021-08-11T18:15:02.980Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI: eni-0fa8cac2b043fb6e7, MAC 0a:1d:84:b2:f1:7d, device 0"}
{"level":"error","ts":"2021-08-11T18:15:03.054Z","caller":"ipamd/ipamd.go:1136","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-0dd22c2d2aa07e244 eni-0fa8cac2b043fb6e7]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-0dd22c2d2aa07e244' does not exist\n\tstatus code: 400, request id: 151824b3-edfc-44d9-8489-98aaece8a31d"}
{"level":"debug","ts":"2021-08-11T18:15:03.054Z","caller":"ipamd/ipamd.go:1136","msg":"Could not find interface: The networkInterface ID 'eni-0dd22c2d2aa07e244' does not exist, ID: eni-0dd22c2d2aa07e244"}

I thought this might be due to stale metadata, but the problem persists even after updating to 1.9.0, which is supposed to carry some partial fixes.

An additional question: why a second ENI? Isn't the machine supposed to support 30 addresses per ENI? Then I remembered that the plugin was running with some custom settings to reduce calls to EC2 that would get us rate-limited:

WARM_ENI_TARGET=1
WARM_IP_TARGET=3

So the former might explain why a second ENI, but not why the plugin gets into this state and never recovers.

What you expected to happen:
the plugin works

How to reproduce it (as minimally and precisely as possible):
No idea how exactly, but WARM_ENI_TARGET>0 might be required. This is happening on just a few machines, out of many hundreds, and this is the most affected by far.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
CNI Version 1.9.0
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a): Linux ip-10-1-135-162.ec2.internal 5.4.117-58.216.amzn2.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Tue May 11 20:50:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

therc · 2021-08-11T20:36:51Z

Setting WARM_ENI_TARGET to 0, until we start using the new options to reserve prefixes, seems to make the problem go away for now.

jayanthvn · 2021-08-24T18:34:00Z

@therc

Based on the logs, you have 2 ENIs which was retrieved from IMDS and out of that one is a stale ENI. The other ENI which you are seeing is the primary ENI - eni-0fa8cac2b043fb6e7.

Setting WARM_IP_TARGET will override WARM_ENI_TARGET. So do you have both configured? And also regarding the IPAMD issue, the 3 pods which are stuck in container creating, can you please share the reason for one of the pods on why it is stuck in container creating? /var/log/aws-routed-eni/ipamd.log should have the error. Please do share last few Pool stats log lines from the file.

jayanthvn · 2021-10-03T06:10:42Z

@therc - If you can please attach the logs by running this script - sudo bash /opt/cni/bin/aws-cni-support.sh on one of the impacted nodes, we can help debug this further.

jayanthvn · 2021-11-09T19:22:02Z

@therc - Can you please share the instance logs? You can run this script - sudo bash /opt/cni/bin/aws-cni-support.sh

github-actions · 2022-04-16T00:14:12Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

jayanthvn · 2022-04-19T19:01:24Z

The second ENI is a stale ENI and it is expected behavior. Please feel free to open an issue for debugging the pod which is stuck in container creating.

github-actions · 2022-04-19T19:01:55Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

therc added the bug label Aug 11, 2021

github-actions bot added the stale Issue or PR is stale label Apr 16, 2022

jayanthvn closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health checks fail, phantom ENI in logs #1572

Health checks fail, phantom ENI in logs #1572

therc commented Aug 11, 2021

therc commented Aug 11, 2021

jayanthvn commented Aug 24, 2021

jayanthvn commented Oct 3, 2021

jayanthvn commented Nov 9, 2021

github-actions bot commented Apr 16, 2022

jayanthvn commented Apr 19, 2022

github-actions bot commented Apr 19, 2022