-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Existing pod network not cleanedup when using security groups with stateful sets #1374
Comments
I'm trying to repro the issue but so far no luck. Will retry few more times. Can you let me know the name of the pod affected in the logs attached (I couldn't find cni-test pod in logs), I can dig through the logs. If its happening consistently on your cluster, we can schedule a call to dig further into this issue (you can reach me at srajakum@amazon.com).
|
We should probably add the unique ID in the annotation as well and return the details of ENI from ipamd based on unique ID. This will ensure even when kubelet invokes delete after network is removed for old pod, we won't delete the new pod network. (AddNetwork and DelNetwork). Created issue here as well - aws/amazon-vpc-resource-controller-k8s#19 for enhancing this functionality. |
The pod was called It seems that updating the CNI to |
Thanks for the info @hintofbasil. May be this pod - |
Yes. That would be the one. Should have written it down earlier. Thanks Sri |
@hintofbasil can you ensure you have terminationGracePeriodSeconds set on your yaml? Because for pods using security group we describe pods during deletion and if terminationGracePeriodSeconds is not set then pods data will get removed from Kubernetes datastore (etcd) and cni plugin will have dangling records in ip rule which will affect pod network. |
@sri, we do not. We only set |
@hintofbasil sorry I meant |
@hintofbasil I have created PR to clean up network even if pods are force deleted by the controllers. This will help with network issues you noticed with new pods. |
Hi Sri, It seems we were a bit early to announce that 1.7.8 fixed the issue. Unfortunately we are still seeing it. I've even installed a version built from master (99ecb4c). I've attached further logs from the master branch version. This time the failing pod is |
@hintofbasil our next release which will be this week, will clean up dangling rules which was blocking pod traffic. This occurs when pod is deleted from K8s datastore(etcd) even before CNI is able to read the pod information during deletion (to read annotation). Fix I mentioned above (which is merged to master and 1.7 branch) will take care of cleaning up all dangling rules. This will be prevented once when we have #kubernetes/kubernetes#69882. Even this might help to some level to avoid the race condition - kubernetes/kubernetes#88543 Regarding prometheus-prometheus-operator-prometheus-0, I see that pod network is setup properly. Can you send me your cluster arn to srajakum@amazon.com to investigate further. |
Local store support for pods using security group - #1313. This will mitigate invoking APIServer on the deletion path instead use local file to read the vlan associated. |
Closing this as we are tracking the issue using #1313 and our https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html is updated to include terminationPeriodInSeconds on pod spec to avoid deleting the pod objects from etcd before network is cleanedup. |
What happened:
When using security groups with stateful sets we noticed that pods often lost connectivity when restarted.
The security group they were bound to allowed all connections inbound and outbound on 0.0.0.0/0.
After some investigation we discovered the bug seems to affect pods re-created on the same node with the same name.
Attach logs
eks_i-08aff468a2d6ce527_2021-02-05_1421-UTC_0.6.2.tar.gz
What you expected to happen:
The pod should launch normally
How to reproduce it (as minimally and precisely as possible):
Create a securityGroupPolicy
Create a pod which uses the security group policy
Kill the pod then recreate the pod ensuring it is scheduled to the same node. Then attempt to make an outbound connection from the pod
Anything else we need to know?:
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: