Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle failed ENI SG update #1341

Merged
merged 3 commits into from
May 19, 2021
Merged

Conversation

jayanthvn
Copy link
Contributor

@jayanthvn jayanthvn commented Dec 22, 2020

What type of PR is this?
Bug

Which issue does this PR fix:
#1340

What does this PR do / Why do we need it:
In refreshSGIDs, if IMDS contains a stale ENI or if ENI SG update fails we used to return an error. Problem is here - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/awsutils/awsutils.go#L440. We have a go routine but also we make this call (L440). If this call returns error, then awsutils.new return error (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L294) and eventually ipamd.New return error causing init failure (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/aws-k8s-agent/main.go#L56 ). Hence aws-node keeps restarting. We don't need to make a separate call and also the error need not be returned since in the next 30 seconds retry will be done. Once IMDS data is synced from Ec2, stale ENIs will be removed so IPAMD will be gracefully handling this. Also returning error will fail updating all the ENIs since if one of the ENIs in the list is stale then following ENIs wont be updated. This PR will fix even that issue.

In nodeInit, if the attachedENIs have stale ENI then describeAllENIs will fail for that ENI, so added a counter to track the number of times IMDS is out of sync. Also this ENI won't be added to datastore.

Reconciler, will again get AttachedENIs, this time if there is a stale ENI, This condition [https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L963] will be true but describeAllENIs will again prevent adding the ENI to the metadataResult [https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L969 ] and [https://github.com/aws/amazon-vpc-cni-k8s/blob/f96e713debda4cf21dd9ea5f0a122031a9e82b91/pkg/awsutils/awsutils.go#L1011 ] but we need to account the error.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
#1340

Testing done on this change:

{"level":"debug","ts":"2020-12-22T21:14:22.792Z","caller":"awsutils/awsutils.go:387","msg":"Update ENI eni-0bbfc9102ff0d750e"}
{"level":"debug","ts":"2020-12-22T21:14:23.184Z","caller":"awsutils/awsutils.go:387","msg":"Update ENI eni-0e31591083f43cd00"}
{"level":"debug","ts":"2020-12-22T21:14:23.335Z","caller":"awsutils/awsutils.go:387","msg":"refreshSGIDs: unable to update the ENI eni-0e31591083f43cd00 SG - InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-0e31591083f43cd00' does not exist\n\tstatus code: 400, request id: 0b4366ac-6e58-4253-908f-7d805a4255c7"}
{"level":"debug","ts":"2020-12-22T21:14:23.335Z","caller":"awsutils/awsutils.go:387","msg":"Update ENI eni-0e31591083f43cd89"}

Automation added to e2e:

No

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

No


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@nithu0115 nithu0115 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm 👍

@jinja2
Copy link

jinja2 commented Mar 22, 2021

Hi! Any updates on this PR? We have had IMDS' ENI data go out of sync and cni pods crashing.

@jayanthvn
Copy link
Contributor Author

Hi @jinja2

I will update the PR with @fawadkhaliq comments. Will try to push this in this week.

Thanks!

@jayanthvn
Copy link
Contributor Author

jayanthvn commented May 3, 2021

@achevuru - If the ENI is stale then we won't add the ENI to ENIPool [https://github.com/aws/amazon-vpc-cni-k8s/blob/9db2ae62ecd0cb56f7fc20b80427fa6ff4e17a42/pkg/awsutils/awsutils.go#L1007] so secondary IPs won't be attached in the create path.

Reconciler, will again get AttachedENIs, this time if there is a stale ENI, This condition [https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L963] will be true but describeAllENIs will again prevent adding the ENI to the metadataResult [https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L969 ] and [https://github.com/aws/amazon-vpc-cni-k8s/blob/f96e713debda4cf21dd9ea5f0a122031a9e82b91/pkg/awsutils/awsutils.go#L1011 ] but we need to account the error.

@jayanthvn jayanthvn requested a review from achevuru May 4, 2021 23:15
Copy link

@fawadkhaliq fawadkhaliq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

htoo97 pushed a commit to htoo97/amazon-vpc-cni-k8s that referenced this pull request Jul 4, 2021
* gracefully handle stale ENI's SG updates

* Added counter for IMDS out of sync

Fixed merge conflict

* Added the counter for IMDS out of sync
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants