-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid detaching ENIs on nodes being drained #1223
Comments
There are 2 parts to this - 1. There is an internal tracking ticket with Ec2 team to see if this can be handled. 2. IPAMD can check if the node is marked as unschedulable then prevent ENI deletion. But the cleaner approach is #1. |
Pending POC on the IPAMD changes. Also following up with ec2 team if this can be handled internally. |
Do we have an ETA on this fix please? This is causing us a lot of issues, and substantial coding to workaround it. |
Can this be tagged as a bug please, as this is more than a feature request? |
What is the CNI version you are using and are you having short lived clusters in the account? currently we have a background thread - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/awsutils/awsutils.go#L382 which cleans up leaked ENIs in the account and I agree this would need atleast one aws-node pod running in the account. |
Hello @jayanthvn. Thank you for your response and also for tagging this issue as a bug. CNI version is V1.7.5 as per the below kubectl describe.
Would the background thread you mention help us? How frequently does it run? Not sure what qualifies a cluster's life span as 'short', but this particular use case could possibly qualify as one. We have an EKS cluster but we manage the nodes ourselves. As part of our end-to-end test suite, we create nodes, performs tests and then delete them via Cloudformation. Here are the related CloudFormation stack events and timelines showing the delete error.
|
kOps is also having problems with ENIs being leaked by amazon-vpc upon cluster deletion. I suspect the window between creation of an ENI and attaching it with DeleteOnTermination set is also a factor. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
People have been complaining about this since 2019. Can we get this taken care of now please? |
@jayanthvn anything preventing #1927 from being merged? |
@bryantbiggs - It is pending code review. We are tracking it for 1.11.1 release. I will provide an ETA soon. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
/not stale |
Hi, any update here? |
#1927 mitigates the issue to certain extent. But we are actively working with EC2/VPC team on fixing the backend calls. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
/not stale |
@jayanthvn how do we get this marked as "not stale"? |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
/not stale |
@Nuru - I had a similar use case to the one you described in #608. I also encountered the same error. When provisioning a managed node group with Terraform ( VersionsEKS: 1.24 My SolutionI ended up creating a Here is an example of the solution in action during a destroy operation using Terraform. Code/* This resource is used to ensure that any "dangling" ENIs are deleted before the destroy operation is performed on
* the security group assigned to the managed node group. This is often, but not always, necessary because managed node
* groups result in the creation of AWS-managed ENIs that cannot be evaluated through the use of Terraform state. The
* AWS-managed ENIs often detach during the destroy operation but are not always deleted. The root cause of this
* behavior is because the EC2 API requires the "detach" and "delete" invocations to be performed separately.
* See https://github.com/aws/amazon-vpc-cni-k8s/issues/608
*/
resource "null_resource" "dangling_eni_cleanup" {
provisioner "local-exec" {
command = "./delete-detached-enis.sh"
when = destroy
working_dir = path.module
}
depends_on = [
aws_security_group.this
]
} #!/bin/sh
# This script was created to provide a solution to "dangling" ENIs that occur during the deletion of a managed EKS node
# group. https://github.com/aws/amazon-vpc-cni-k8s/issues/608#issuecomment-694654359
set -e
query_dangling_enis() {
aws ec2 describe-network-interfaces \
--query 'NetworkInterfaces[?Status==`available`&&contains(Description, `aws-K8S`) == `true`].NetworkInterfaceId'
}
delete_eni() {
echo "Deleting detached ENI: $1"
aws ec2 delete-network-interface --network-interface-id $1
echo "The interface was deleted."
}
echo "Initiating dangling ENI cleanup."
query_dangling_enis | jq -rc '.[]' | while read eni; do
delete_eni $eni
done
echo "All dangling ENIs have been removed."
|
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
/not stale |
|
What would you like to be added:
We should prevent ipamd from trying to free ENIs when a node is about to terminated.
For spot instances we could do something similar to the aws-node-termination-handler and check some metadata endpoints.
For the case where a node is cordoned off before being terminated, meaning it is marked as "unschedulable", we should be able to check this node taints before trying to attach or detach any ENIs.
Why is this needed:
Since there is no EC2 API call to directly "delete" an ENI that is attached, instead they first have to be detached, which takes a few seconds, then deleted. If the instance gets terminated after the ENI has been detached, but before it has been deleted, it will be leaked. This leaked ENI can prevent Security Groups and VPCs from being deleted and require manual clean up.
Related issues: #608 #69, #690
The text was updated successfully, but these errors were encountered: