-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully handle failed ENI SG update #1341
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm 👍
Hi! Any updates on this PR? We have had IMDS' ENI data go out of sync and cni pods crashing. |
Hi @jinja2 I will update the PR with @fawadkhaliq comments. Will try to push this in this week. Thanks! |
Fixed merge conflict
@achevuru - If the ENI is stale then we won't add the ENI to ENIPool [https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L1007] so secondary IPs won't be attached in the create path. Reconciler, will again get AttachedENIs, this time if there is a stale ENI, This condition [https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L963] will be true but describeAllENIs will again prevent adding the ENI to the metadataResult [https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L969 ] and [https://github.com/aws/amazon-vpc-cni-k8s/blob/f96e713debda4cf21dd9ea5f0a122031a9e82b91/pkg/awsutils/awsutils.go#L1011 ] but we need to account the error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
* gracefully handle stale ENI's SG updates * Added counter for IMDS out of sync Fixed merge conflict * Added the counter for IMDS out of sync
What type of PR is this?
Bug
Which issue does this PR fix:
#1340
What does this PR do / Why do we need it:
In refreshSGIDs, if IMDS contains a stale ENI or if ENI SG update fails we used to return an error. Problem is here - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/awsutils/awsutils.go#L440. We have a go routine but also we make this call (L440). If this call returns error, then awsutils.new return error (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L294) and eventually ipamd.New return error causing init failure (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/aws-k8s-agent/main.go#L56 ). Hence aws-node keeps restarting. We don't need to make a separate call and also the error need not be returned since in the next 30 seconds retry will be done. Once IMDS data is synced from Ec2, stale ENIs will be removed so IPAMD will be gracefully handling this. Also returning error will fail updating all the ENIs since if one of the ENIs in the list is stale then following ENIs wont be updated. This PR will fix even that issue.
In nodeInit, if the attachedENIs have stale ENI then describeAllENIs will fail for that ENI, so added a counter to track the number of times IMDS is out of sync. Also this ENI won't be added to datastore.
Reconciler, will again get AttachedENIs, this time if there is a stale ENI, This condition [https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L963] will be true but describeAllENIs will again prevent adding the ENI to the metadataResult [https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L969 ] and [https://github.com/aws/amazon-vpc-cni-k8s/blob/f96e713debda4cf21dd9ea5f0a122031a9e82b91/pkg/awsutils/awsutils.go#L1011 ] but we need to account the error.
If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
#1340
Testing done on this change:
Automation added to e2e:
No
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
No
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.