Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate IP Addresses for pods w/o host network #711

Closed
uruddarraju opened this issue Nov 8, 2019 · 12 comments
Closed

Duplicate IP Addresses for pods w/o host network #711

uruddarraju opened this issue Nov 8, 2019 · 12 comments
Labels

Comments

@uruddarraju
Copy link
Contributor

uruddarraju commented Nov 8, 2019

We are running a 1.12.6 cluster provisioned with Kops.

Networking setup: aws-k8s-cni with calico for network policies
Using: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.2

We saw a few networking issues for pods created in our integration testing cluster. This cluster experiences a lot of churn where we consistently spin up 100s of pods and delete them periodically. We observed a few pods experiencing some network issues and upon digging a little more, we found the following:

admin@ip-10-20-89-225:~$ kubectl get pods --all-namespaces -o wide | grep Running | awk '{print $7}' |  sort | uniq -c | grep "   2"
      2 10.20.106.106
      2 10.20.126.85
      2 10.20.33.133
      2 10.20.35.126
....<10s of more entries>

Trying to find if these pods are using host network,

admin@ip-10-20-89-225:~$ kubectl get pods --all-namespaces -o wide | grep 10.20.72.196
uday                       test-data-ui-6644c8f79-k8g5v                                               0/1     Running             1          9d      10.20.72.196    ip-10-20-75-71.us-west-2.compute.internal     <none>
marco                       mock-server                                                              1/1     Running             0          25m     10.20.72.196    ip-10-20-75-71.us-west-2.compute.internal     <none>

And you can see both those pods using the ip 10.20.72.196

We cleaned up the pods experiencing these duplicate IP issue a few times and that does not seem to solve the problem(which makes sense as we still dont know the root cause)

One common scenario we observer in all cases above, atleast one pod has a restart count > 0 or they are in CrashLoopBackoff/Error/Completed states.

Ideally, the cni should only be reclaiming IPs of pods not running that have a restartPolicy of Never. I am not sure if this is the behavior today.

@mogren
Copy link
Contributor

mogren commented Nov 8, 2019

Hi @uruddarraju, the CNI will not reuse IPs directly, but after a 1 minute cool down they will be available to be assigned to pods again.

Do you use the default configuration? What kind of nodes are you running on, and how many pods per node? If you have a lot of churn, it helps to pre-allocate the IPs.

Also, how come you use v1.5.2? I'd recommend upgrading to v1.5.3.

@uruddarraju
Copy link
Contributor Author

Hi @uruddarraju, the CNI will not reuse IPs directly, but after a 1 minute cool down they will be available to be assigned to pods again.

Thanks @mogren. I am not sure that is the problem though. As you can see in the above two examples I gave, the pods have been spun up at 9d ago and 25m ago. So I don't really think reconciliation races are a problem here. And yes, we run default configuration, are there any best practices guides you want us to refer to for our deployment?

@jaypipes
Copy link
Contributor

@uruddarraju sorry for delay in getting back to you on this! As @mogren mentioned, it would be good to upgrade to at least 1.5.3 and see if the issue with duplicate IPs goes away.

@ataranto
Copy link

I'm working with @uruddarraju on this problem. We have been able to consistently observe pods receiving duplicate IP addresses over the past few days which is usually correlated with the startup time of the L-IPAMD process. Here is the subset of log lines that show the outline of the problem scenario:

On node ip-10-20-118-72.us-west-2.compute.internal, 2 pods (fluent-es-v2.4.0-bnvqs and pod-a) have received the same IP address (10.20.115.153).

Timeline:

Tue, 19 Nov 2019 15:24:47 -0800: fluentd-es-v2.4.0-bnvqs started
Fri, 29 Nov 2019 10:28:59 -0800: aws-node restarted
Fri, 29 Nov 2019 10:30:34 -0800: pod-a started

Relevant subset of log:

2019-11-29T18:28:59.808Z [INFO] 	Starting L-IPAMD v1.6.0-rc4  ...

2019-11-29T18:28:59.842Z [INFO] 	Waiting for controller cache sync
2019-11-29T18:29:01.342Z [INFO] 	Synced successfully with APIServer

2019-11-29T18:29:01.343Z [DEBUG] 	GetLocalPods start ...
2019-11-29T18:29:01.343Z [DEBUG] 	getLocalPodsWithRetry() found 0 local pods

2019-11-29T18:29:01.348Z [DEBUG] 	Found pod fluentd-es-v2.4.0-bnvqs with container ID: docker://b8c192e5b00b4ce5c19722fc09254f64b0649b6e621c67adbcb6dc454d229a07
2019-11-29T18:29:01.348Z [INFO] 	Add/Update for Pod fluentd-es-v2.4.0-bnvqs on my node, namespace = kube-system, IP = 10.20.115.153

2019-11-29T18:30:34.058Z [DEBUG] 	No container ID found for pod-a
2019-11-29T18:30:34.058Z [INFO] 	Add/Update for Pod pod-a on my node, namespace = namespace-1, IP = 
2019-11-29T18:30:34.066Z [DEBUG] 	No container ID found for pod-a
2019-11-29T18:30:34.066Z [INFO] 	Add/Update for Pod pod-a on my node, namespace = namespace-1, IP = 
2019-11-29T18:30:34.712Z [INFO] 	Received AddNetwork for NS /proc/30864/ns/net, Pod pod-a, NameSpace namespace-1, Container e8c88083893cf91d26843ae506dd6d3816481deb14c1a42343d98acb51d7f90f, ifname eth0
2019-11-29T18:30:34.712Z [DEBUG] 	AssignIPv4Address: IP address pool stats: total: 180, assigned 0
2019-11-29T18:30:34.712Z [INFO] 	AssignPodIPv4Address: Assign IP 10.20.115.153 to pod (name pod-a, namespace namespace-1 container e8c88083893cf91d26843ae506dd6d3816481deb14c1a42343d98acb51d7f90f)
2019-11-29T18:30:34.712Z [INFO] 	Send AddNetworkReply: IPv4Addr 10.20.115.153, DeviceNumber: 0, err: <nil>
2019-11-29T18:30:34.957Z [DEBUG] 	AssignIPv4Address: IP address pool stats: total: 180, assigned 1
2019-11-29T18:30:35.368Z [DEBUG] 	Found pod pod-a with container ID: docker://b6993fffa5bc874d16eec739d440faeb0cd1d5b0bd73169eda72b25b1f764dd0
2019-11-29T18:30:35.368Z [INFO] 	Add/Update for Pod pod-a on my node, namespace = namespace-1, IP = 10.20.115.153
2

First, it's unclear why getLocalPodsWithRetry() found 0 local pods, as there were many pods (including fluentd-es-v2.4.0-bnvqs running at the time that L-IPAMD was restarted.

Second, even through the code progresses to the point of Found pod fluentd-es-v2.4.0-bnvqs, the discovery mechanism doesn't seem to modify the state of the datastore to reflect that 10.20.115.153 is in use. We can see that when we receive an AddNetwork request for the newly scheduled pod-a, that datastore considers zero of our 180 addresses to be assigned.

@uruddarraju
Copy link
Contributor Author

Took a stab here: #738. @jaypipes can you take a look at it please? Difficult adding a test case for this usecase.

@mogren
Copy link
Contributor

mogren commented Jan 29, 2020

The related PR needs to be rebased, and still has some required changes.

@uthark
Copy link
Contributor

uthark commented May 6, 2020

We've seen this too.
Happened after we had API outage and then CNI assigned ip address that was used by other pod running on the same node.

@mogren
Copy link
Contributor

mogren commented May 7, 2020

Thanks for confirming @uthark, I'll try to make an updated version of #738 and test this.

@mogren
Copy link
Contributor

mogren commented Jun 7, 2020

Since #972, we read this directly from the CRI socket instead of using the watcher.

@mogren
Copy link
Contributor

mogren commented Aug 23, 2020

@uthark @uruddarraju Late update, but is this still an issue with v1.6.4 or v1.7.0?

@mogren mogren removed the priority/P0 Highest priority. Someone needs to actively work on this. label Aug 23, 2020
@mogren
Copy link
Contributor

mogren commented Sep 4, 2020

v1.7.2-rc1 is a release candidate that includes a lot of fixes to resolve this type of issues.

@mogren
Copy link
Contributor

mogren commented Sep 10, 2020

@uruddarraju, @ataranto Since v1.5.x, we have moved away from using the informer and instead we query the CRI socket directly. These changes should prevent this issue from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants