Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 Metadata not in sync is causing aws-node be stuck in Crash Loopback state. #1340

Closed
kakarotbyte opened this issue Dec 22, 2020 · 3 comments
Assignees
Labels

Comments

@kakarotbyte
Copy link

IPAMD is restarting with the below messages.

{"level":"debug","ts":"2020-11-21T12:18:10.565Z","caller":"awsutils/awsutils.go:388","msg":"Update ENI eni-xxxxxxxxxx"}
{"level":"error","ts":"2020-11-21T12:18:10.773Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-xxxxxxxxxxxx' does not exist\n\tstatus code: 400, request id: aaaaaa-bbbbbb-cccccc-ccccc-sssssss"}

From awsutils l458 We understand ModifyNetworkInterfaceAttribute action is triggering the above error message.

From the Cloud trail API calls I was able to confirm that the ENI was created and deleted by VPC cni however. They were not cleared from EC2 metadata. Since metadataMACPath function will use the below snippet to make the metadata call which will gather the eni-id We are seeing the above issue.

eniID, err = cache.ec2Metadata.GetMetadata( + eniMAC + metadataInterface)

I have confirmed by running the below curl call that the deleted ENI is still present on the metadata.

curl 169.254.169.254/2020-10-27/meta-data/network/interfaces/macs/<macaddress>/<interface-id>

Summary

IPAMD should be able to handle out-of-synced IMDS in above cases. As This will cause aws-node to be in crashloopback state and cause pods scheduling onto the node to not have any ip's causing them to be stuck in creating state.

similar issue form the past #1177

Why is this needed:
CNI should handle similar issue related to imds gracefully instead of getting the node and all pods stuck.

@falgofrancis
Copy link

Is there any work around for this issue? Currently we drain the node.

@jayanthvn
Copy link
Contributor

jayanthvn commented Apr 9, 2021

@falgofrancis - #1341 will avoid aws-node crash on boot up if the metadata has stale data and also a counter is added which keeps track of this. This is just a part of the fix. We are tracking this with EC2 team for the race condition which is causing IMDS to never sync. Current work around is to drain the node.

@jayanthvn
Copy link
Contributor

Release 1.8 has the fix from IPAMD. Internally we are tracking issue on the EC2 side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants