-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaking Network Interfaces (ENI) #69
Comments
Further to this, when you remove the cni-plugin from the cluster using |
What is responsible for releasing the ENIs? Is there some sort of shutdown procedure that needs to happen on the Nodes in order for it to release the ENIs properly? |
@jonmoter Today, ipamD daemonset on each node should be responsible releasing ENI if there are too many free IPs in the IP warm pool. When the node is terminated (e.g. scaled down by auto scaling group), EC2 control plane will release ENI and its IPs automatically. In summary, there should NOT any leaked ENIs, unless a unknown
|
Okay, thanks for the clarification. In the past I've had issues of Kubernetes nodes cleaning up properly on termination, like running That's why I was wondering if the ipamD was responsible for releasing the ENIs on shutdown, or if something like the EC2 control plane should handle that. Thanks! |
@jonmoter In case, an ENI is leak in the situation mentioned above (node get killed at time before ENI is attached to the node but after ENI is allocated by ipamD), you should be able to manually delete this ENI. Each ENI has a description set as |
@liwenwu-amazon When a fix to this issue is planned? |
Give that we need to create the ENI, attach it to then instance, then set the termination policy the current design will always have a gap where the plugin is either creation (and has not set termination policy) or deleting (removed termination policy, and possibly detached) when the process/instance is killed causing the resource to be leaked. There has been some talk of creating a centralized container to allocate ENIs and IPs and assign them outside of the local daemon. This would allow for clean up after termination. There is some more design that needs to happen but it is something I think is worth the investment. I don't have a timeline for it at this point though. |
What is the status of this? This issue is completely blocking us from moving off of our KOPS cluster, which has almost NO problems. |
Can confirm I am running into this issue as well, would love to hear from the AWS team if there's any updates. |
There is another bug closed, but this also seems to be related (I can confirm if required) - right now, my cluster is working normally, but will check the next time this happens and update with confirmation. |
@liwenwu-amazon Is there progress on this issue? |
I confirm running into this issue too. |
We are running into the same issue |
So far it seems that the main reason for this issue is doing forced detaches of ENIs. v1.5.0-rc has #458 that instead just does retries instead of forcing. |
Resolving since v1.5.0 is released. |
I'm still hitting this issue of leaked ENIs using v1.5.0 on EKS v1.13 when decomissioning (
It does seem like the shutdown & cleanup on aws-cni is still having issues cleaning up ENI's when workers are terminated. Update: This seems to primarily happen on ASG nodes in EKS clusters being used for Pods of a LoadBalancer-typed Service. The leaked ENIs are always associated with an instance from this ASG. FWIW, on tear down the LoadBalancer Service & Deployment is deleted before the ASG is removed. |
Still having this issue on EKS v1.12 CNI 1.5.0. Not deleting any nodes, just redeploying deployment with large amount of pods (~600). Every time I deploy, less become available because there are no available IPs. |
Update: we were able to get passed this issued of leaked ENIs by updating to The specific fixes in
|
@metral thanks. Does the following enough for that update?
|
@imriss Yes but I also uncommented the |
For context, these fixes worked in our specific test use case because on tear downs, we intentionally wait on the:
We then sleep for a few minutes for good measure, before tearing down the cluster. This gives the AWS resources and |
I continue to see this problem after 1.5.3. |
@robin-engineml We have identified an issue when ENIs might get leaked, specifically when a node is drained, then terminated, there is a window after the call to We have seen a few cases where the first attempt to delete the ENI after the detach fails because the ENI is not yet detached, but before the first retry, the pod or node gets terminated. Then the ENI will be left in a detached state. We are tracking that issue in #608. |
@vipulsabhaya Any updates? This is causing IP exhaustion between our two /19 subnets (16k IPs)... |
We have a quick fix. We were using this fix a few months ago. You need a docker image with aws CLI and a kubernetes cronjob and always run 5 minutes. |
I think I'm seeing the latter on node termination, but it doesn't always happen on every node in a node group. When I look at the ENIs after node termination, I see that they've had the termination policy removed. What if you don't remove the termination policy on deletion? Would ipamd still be able to delete ENIs? It would seem to me that this would allow EC2 to clean up any ENIs that ipamd missed. |
@jlforester You are correct that the ENIs will be cleaned up automatically when the EC2 instance terminates, as long as they are attached. If the CNI detaches the ENI, but gets terminated before it can delete the ENI, we leak it. We don't explicitly remove the policy, I think that happens when you detach it from the instance. |
Just spitballing here, then... What if on instance termination you DON'T try to clean it up and let EC2 do it like with the eth0 ENI. Would that cause any problems like IP leakage? They're already set to delete on termination. Or, what if we use an ELB lifecycle policy to give ipamD more time to clean up its ENIs? Just adding the lifecycle policy to isn't enough. You'd have to watch for a notification from the ELB that a scale-in event is occurring for that instance. |
@jlforester Well, in #645 I added a shut-down listener, to do just that, but there will always be the case where the number of pods on the node goes down, we have too many IPs available and want to free an ENI. If all nodes gets killed just after ipamd detaches that ENI, we will lose it. There is also a background cleanup loop in the v1.6 branch, added in #624. |
@mogren I just ran a couple cluster standup/teardown tests using terraform, and it looks like 1.6.0rc4 resolves the leaking ENI issue for me. |
I perform about 50 standup/teardown tests per week. This problem still occurs on 1.6.0rc4, but less frequently than previously, in my experience. |
Been using 1.6.0 since it came out in our test cluster. I disabled our manual eni-cleanup pod we wrote as a workaround. I now have ~200 leaked (available) ENI's when the cluster is only actively using (in-use) ~30. We average about 8-10 nodes but spin up/down maybe 100 per week. I made sure to clear out the leaked/available ENI's when I upgraded to 1.6. TLDR: Not resolved in 1.6 |
Just adding our experience in here too - we've been running 1.6.0 for around a week and can also confirm we're still seeing it happen. I'm not totally sure on the difference but each of our nodes gets assigned two ENIs - one with no description and one with |
In our case I have just noticed that it is our primary interfaces that are leaking as they are missing the "delete on termination" option 🤦 |
We encountered this issue as well. It gets worse if you end up running out of ip space in your subnets, like we did. Ended up causing an outage for us during a deploy as new nodes/pods could not spin up to replace the old ones. |
After setting |
@mogggggg The non-tagged one is the default ENI that gets created with the EC2 instance. Could your issue be similar to the problem in the previous comment? That |
@mogren Wow, I totally missed that - looks like we were missing |
I added a follow up ticket to log a warning for the case where I think it's finally time to close this ticket. If issues with leaking ENIs come up again, please open a new ticket with the specific details for that case. |
Problem
We noticed in some deployments, even though a instance has been terminated, the ENIs allocated by ipamD are NOT released back to EC2. In addition, the secondary IP addresses allocated on these ENI are also NOT released back to EC2.
When there are too many of these leaked ENIs and Secondary IP addresses, subnet available IP pool can be depleted. And node in cluster will failed to allocate secondary IP addresses. When this happens, Pod may not able to get an IP and get stucked in ContainerCreating
You can verify if you are running into this issue in console:
and in description aws-K8S-i-02cf6e80932099598, the instance i-02cf6e80932099598 has already been terminated
Workaround
Manually delete these ENI after confirming the instance has already been terminated.
Be careful, when ipamD is in the middle of create/attach ENIs, the eni will show up as available. but the instance-id should be valid
The text was updated successfully, but these errors were encountered: