EC2 Metadata not in sync is causing aws-node be stuck in Crash Loopback state. #1340

kakarotbyte · 2020-12-22T00:21:57Z

IPAMD is restarting with the below messages.

{"level":"debug","ts":"2020-11-21T12:18:10.565Z","caller":"awsutils/awsutils.go:388","msg":"Update ENI eni-xxxxxxxxxx"}
{"level":"error","ts":"2020-11-21T12:18:10.773Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-xxxxxxxxxxxx' does not exist\n\tstatus code: 400, request id: aaaaaa-bbbbbb-cccccc-ccccc-sssssss"}

From awsutils l458 We understand ModifyNetworkInterfaceAttribute action is triggering the above error message.

From the Cloud trail API calls I was able to confirm that the ENI was created and deleted by VPC cni however. They were not cleared from EC2 metadata. Since metadataMACPath function will use the below snippet to make the metadata call which will gather the eni-id We are seeing the above issue.

eniID, err = cache.ec2Metadata.GetMetadata( + eniMAC + metadataInterface)

I have confirmed by running the below curl call that the deleted ENI is still present on the metadata.

curl 169.254.169.254/2020-10-27/meta-data/network/interfaces/macs/<macaddress>/<interface-id>

Summary

IPAMD should be able to handle out-of-synced IMDS in above cases. As This will cause aws-node to be in crashloopback state and cause pods scheduling onto the node to not have any ip's causing them to be stuck in creating state.

similar issue form the past #1177

Why is this needed:
CNI should handle similar issue related to imds gracefully instead of getting the node and all pods stuck.

The text was updated successfully, but these errors were encountered:

falgofrancis · 2021-02-26T21:50:01Z

Is there any work around for this issue? Currently we drain the node.

jayanthvn · 2021-04-09T18:03:12Z

@falgofrancis - #1341 will avoid aws-node crash on boot up if the metadata has stale data and also a counter is added which keeps track of this. This is just a part of the fix. We are tracking this with EC2 team for the race condition which is causing IMDS to never sync. Current work around is to drain the node.

jayanthvn · 2021-08-04T21:33:48Z

Release 1.8 has the fix from IPAMD. Internally we are tracking issue on the EC2 side.

kakarotbyte added the bug label Dec 22, 2020

kakarotbyte mentioned this issue Dec 22, 2020

EC2 Metadata not in sync is causing aws-node be stuck in Crash Loopback state. aws/amazon-vpc-cni-plugins#50

Closed

haouc mentioned this issue Dec 22, 2020

IPAMD should gracefully handle a rare case that IMDS is out-of-sync for Netork Interface #1339

Closed

jayanthvn mentioned this issue Dec 22, 2020

Gracefully handle failed ENI SG update #1341

Merged

jayanthvn self-assigned this Feb 10, 2021

jayanthvn mentioned this issue Apr 8, 2021

Pod stuck in CrashLoopBackOff after restarting the pod #1422

Closed

jayanthvn mentioned this issue Jun 3, 2021

Eventually some pods Stucks on "container creating" #1479

Closed

jayanthvn closed this as completed Aug 4, 2021

abhipth mentioned this issue Nov 12, 2021

Install latest vpc-cni in canary script and fix windows test flakiness aws/amazon-vpc-resource-controller-k8s#83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EC2 Metadata not in sync is causing aws-node be stuck in Crash Loopback state. #1340

EC2 Metadata not in sync is causing aws-node be stuck in Crash Loopback state. #1340

kakarotbyte commented Dec 22, 2020

falgofrancis commented Feb 26, 2021

jayanthvn commented Apr 9, 2021 •

edited

Loading

jayanthvn commented Aug 4, 2021

EC2 Metadata not in sync is causing aws-node be stuck in Crash Loopback state. #1340

EC2 Metadata not in sync is causing aws-node be stuck in Crash Loopback state. #1340

Comments

kakarotbyte commented Dec 22, 2020

Summary

falgofrancis commented Feb 26, 2021

jayanthvn commented Apr 9, 2021 • edited Loading

jayanthvn commented Aug 4, 2021

jayanthvn commented Apr 9, 2021 •

edited

Loading