-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge network connections latencies after installing Calico in an EKS cluster #3530
Comments
This issue is also described here: aws/amazon-vpc-cni-k8s#493 |
There's a known issue with the kubelet not writing pod IP addresses back into the Kubernetes API quickly enough. I raised this upstream back in 2016: kubernetes/kubernetes#39113 The TLDR is that kubelet doesn't update the address quickly enough and so Calico is blocked on programming policy until that address is in the API. If you're using the Calico CNI plugin, there is a workaround that bypasses the kubelet for writing the IP which make sure that policy is programmed fast. However, non-Calico CNI plugins (like the Amazon VPC plugin) don't have that logic, and so rely on the kubelet to program the IP into the API. I suspect that this is what you're encountering. |
Thank you very much for your response. I guess there is nothing to do if I want to keep using the VPC CNI plugin. |
@caseydavenport Casey, above you mentioned "If you're using the Calico CNI plugin, there is a workaround that bypasses the kubelet for writing the IP which make sure that policy is programmed fast. However, non-Calico CNI plugins (like the Amazon VPC plugin) don't have that logic, and so rely on the kubelet to program the IP into the API." Could you please provide details on this workaround using the Calico CNI plugin? Thanks. |
@alaytonoracle the Calico CNI plugin writes the IP address back to the pod as an annotation to close the race-condition window caused by the kubelet's update batching delay. Once the kubelet updates the pod status, then the annotation will be ignored moving forward. |
@caseydavenport Thanks. But in the case of a k8s job, each job creates a new pod - so the delays occur every time under heavy volumes. I'm not clear if there is a workaround for this. If not, do you know if there's any ETA for a fix for the root issue? Thanks for the help. |
After installing Calico following AWS's instructions in an EKS cluster, the time for pods to establish a new connection is more than 10 seconds. Subsequent connections are, in general, much faster, but the first one usually can take more than 10 seconds.
No network policies have been created.
The issue disappears when calico is removed and iptables are flushed. The latter step is important.
Expected Behavior
Connections with external and internal IP addresses should be established in less than 1 second.
Current Behavior
The first connections that the pod tries to establish are taking more than 10 seconds on average. Usually, only the first connection takes this long time and the subsequent ones are much faster, although I've also seen some timeouts appearing after a container was running for more than 20 minutes.
When running a pod that executes the following script:
Most pods fail while just a few can start without problems. This is an example of the type of errors that I can see:
This is the output of a pod that started successfully. The first connection takes 7 seconds while the next ones are much faster:
Connection latencies and the number of time-outs seem to increase when starting multiple containers simultaneously.
Looking at Calico logs I can see that the problem is that containers are starting before Calico has written the IP tables. For instance:
This is the output of the application, showing a delay of 15 seconds. In this case I started 10 containers in parallel (when starting only 1 container everything is perfect and there are no delays).
I can find the following lines in the calico logs:
So it seems that the problem is that Calico is taking 15 seconds to flush the IP Tables.
Steps to Reproduce (for bugs)
Context
This issue makes many systems fail during container start-up with time-out errors. It has a huge impact on DNS resolution as the default time-out is 5 seconds and most lookups fail.
Increasing the timeout in /etc/resolv.conf helps but that's just a workaround.
Your Environment
calico.log
The text was updated successfully, but these errors were encountered: