Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health checks fail, phantom ENI in logs #1572

Closed
therc opened this issue Aug 11, 2021 · 7 comments
Closed

Health checks fail, phantom ENI in logs #1572

therc opened this issue Aug 11, 2021 · 7 comments
Labels
bug stale Issue or PR is stale

Comments

@therc
Copy link

therc commented Aug 11, 2021

What happened:
One r5.4xlarge machine, with 15 existing IP addresses, has three containers in ContainerCreating state for almost a day now.

Looking at ipamd logs, this stands out:

{"level":"debug","ts":"2021-08-11T18:15:02.964Z","caller":"ipamd/ipamd.go:1106","msg":"Total number of interfaces found: 2 "}
{"level":"debug","ts":"2021-08-11T18:15:02.964Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI MAC address: 0a:16:40:c0:05:81"}
{"level":"debug","ts":"2021-08-11T18:15:02.966Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI: eni-0dd22c2d2aa07e244, MAC 0a:16:40:c0:05:81, device 1"}
{"level":"debug","ts":"2021-08-11T18:15:02.967Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI MAC address: 0a:1d:84:b2:f1:7d"}
{"level":"debug","ts":"2021-08-11T18:15:02.969Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI: eni-0fa8cac2b043fb6e7, MAC 0a:1d:84:b2:f1:7d, device 0"}
{"level":"debug","ts":"2021-08-11T18:15:02.971Z","caller":"ipamd/ipamd.go:557","msg":"A new ENI added but not by ipamd, updating tags by calling EC2"}
{"level":"debug","ts":"2021-08-11T18:15:02.971Z","caller":"awsutils/awsutils.go:1027","msg":"Total number of interfaces found: 2 "}
{"level":"debug","ts":"2021-08-11T18:15:02.971Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI MAC address: 0a:16:40:c0:05:81"}
{"level":"debug","ts":"2021-08-11T18:15:02.977Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI: eni-0dd22c2d2aa07e244, MAC 0a:16:40:c0:05:81, device 1"}
{"level":"debug","ts":"2021-08-11T18:15:02.979Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI MAC address: 0a:1d:84:b2:f1:7d"}
{"level":"debug","ts":"2021-08-11T18:15:02.980Z","caller":"awsutils/awsutils.go:539","msg":"Found ENI: eni-0fa8cac2b043fb6e7, MAC 0a:1d:84:b2:f1:7d, device 0"}
{"level":"error","ts":"2021-08-11T18:15:03.054Z","caller":"ipamd/ipamd.go:1136","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-0dd22c2d2aa07e244 eni-0fa8cac2b043fb6e7]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-0dd22c2d2aa07e244' does not exist\n\tstatus code: 400, request id: 151824b3-edfc-44d9-8489-98aaece8a31d"}
{"level":"debug","ts":"2021-08-11T18:15:03.054Z","caller":"ipamd/ipamd.go:1136","msg":"Could not find interface: The networkInterface ID 'eni-0dd22c2d2aa07e244' does not exist, ID: eni-0dd22c2d2aa07e244"}

I thought this might be due to stale metadata, but the problem persists even after updating to 1.9.0, which is supposed to carry some partial fixes.

An additional question: why a second ENI? Isn't the machine supposed to support 30 addresses per ENI? Then I remembered that the plugin was running with some custom settings to reduce calls to EC2 that would get us rate-limited:

WARM_ENI_TARGET=1
WARM_IP_TARGET=3

So the former might explain why a second ENI, but not why the plugin gets into this state and never recovers.

What you expected to happen:
the plugin works

How to reproduce it (as minimally and precisely as possible):
No idea how exactly, but WARM_ENI_TARGET>0 might be required. This is happening on just a few machines, out of many hundreds, and this is the most affected by far.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

  • CNI Version 1.9.0

  • OS (e.g: cat /etc/os-release):

  • Kernel (e.g. uname -a): Linux ip-10-1-135-162.ec2.internal 5.4.117-58.216.amzn2.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Tue May 11 20:50:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

@therc therc added the bug label Aug 11, 2021
@therc
Copy link
Author

therc commented Aug 11, 2021

Setting WARM_ENI_TARGET to 0, until we start using the new options to reserve prefixes, seems to make the problem go away for now.

@jayanthvn
Copy link
Contributor

@therc

Based on the logs, you have 2 ENIs which was retrieved from IMDS and out of that one is a stale ENI. The other ENI which you are seeing is the primary ENI - eni-0fa8cac2b043fb6e7.

Setting WARM_IP_TARGET will override WARM_ENI_TARGET. So do you have both configured? And also regarding the IPAMD issue, the 3 pods which are stuck in container creating, can you please share the reason for one of the pods on why it is stuck in container creating? /var/log/aws-routed-eni/ipamd.log should have the error. Please do share last few Pool stats log lines from the file.

@jayanthvn
Copy link
Contributor

@therc - If you can please attach the logs by running this script - sudo bash /opt/cni/bin/aws-cni-support.sh on one of the impacted nodes, we can help debug this further.

@jayanthvn
Copy link
Contributor

@therc - Can you please share the instance logs? You can run this script - sudo bash /opt/cni/bin/aws-cni-support.sh

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Apr 16, 2022
@jayanthvn
Copy link
Contributor

The second ENI is a stale ENI and it is expected behavior. Please feel free to open an issue for debugging the pod which is stuck in container creating.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests

2 participants