-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod startup connectivity issue when using calico and vpc-cni #493
Pod startup connectivity issue when using calico and vpc-cni #493
Comments
Have the same problem. It feels like that's because of Calico policies asynchronous nature. Here's my theory: calico policies kick in after all the CNI plugins finish running. It then races with the pod startup, and the pods that start network interactions immediately almost always get a delay, because Calico has not injected all the rules yet. The best scenario would be somehow to make the AWS-VPC-CNI plugin wait for Calico policies, but unfortunately looks like it won't work. I wrote a PoC CNI plugin today that just sleeps, wanted to check if Calico felix "sees" that pod's IP and starts injecting the rules before CNI finishes execution. Looks like they don't. The way I worked around it for now is by creating an initContainer that loops with a DNS query for kubernetes.default. As soon as this is ok, the init container quits. This guarantees (sort of) that all the "normal" pods will have networking setup by the time they start up. |
It could be a problem with async way that calico's apply its policies, but strange or not, running on a similar fashion on GKE it didn't show this issue. Certainly their network plugin works in a different way. I will test version 1.5.2 as it has a Ipam improvment: Improvement - Reduce the wait time when checking for pods without IPs (#552, @mogren) |
Hi @mlsmaycon did you ever get a chance to test later versions of the plugin and see if this issue went away? |
Closing because of no updates. |
@mogren Fairly certain we are running into this on: EKS Platform version: eks.7 Calico install was from official AWS instruction pointing to https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.5/config/v1.5/calico.yaml Very consistently we see a connectivity issue when the first pod launches on a newly scaled up node. |
@gmatev Thanks for the update, still something we need to reproduce then. |
Facing a similar issue as mentioned above by @gmatev - consistently seeing connectivity issue when the pod launches on a newly scaled up node. |
I opened a similar issue to Calico. They have replied that this is caused by Kubelet being very slow in updating the IP addresses in the Kubernetes API. The Calico CNI driver has implemented a workaround consisting of updating the IP address directly instead of relying on Kubelet. Would it be possible for the VPC CNI driver to implement the same workaround?. Here is the Calico issue: projectcalico/calico#3530 |
I am sure it us possible, and it might be worth taking a look at how much work this would be. We just have to be sure that it can be done without any impact to clusters not using Calico. |
Hi, As @caseydavenport mentioned in projectcalico/calico#3530, AWS CNI is not aware when calico is done setting up the policy unless we get an endpoint to query that will give the init status. Thank you! |
Kubernetes cluster
What are the issues
While running Calico for network policy, a pod when starts up is unable to connect to another IP for a time that variates from 300ms up to 1.2s.
This doesn't happens when you run a pod in a node that Calico's Felix daemon wasn't provisioned.
How we reproduce the issue:
Run a pod through a cronjob that do some network test, like a ping to its own kubernetes node. This is the example we used:
It tries to ping the host IP until it works, I have added a timeout of 100ms in order to see get a better number of the time it couldn't connect to the host IP.
Result
It failed for 1.2s before it could reach its own host.
Logs
the block above repeated one more time in less than 300ms before I get the following:
It is not clear to me if the issue is caused by Typha not being able to verify that the pod got an IP address, maybe because of a check interval or if the Ipamd is not making this information available in the datastore fast enough so calico can do its thing.
The text was updated successfully, but these errors were encountered: