Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some pods are getting stuck without external network #1070

Closed
michalzxc opened this issue Jul 3, 2020 · 24 comments
Closed

Some pods are getting stuck without external network #1070

michalzxc opened this issue Jul 3, 2020 · 24 comments
Labels

Comments

@michalzxc
Copy link

michalzxc commented Jul 3, 2020

Our pods are using vault-agent, so it is very simple to see when it is getting stuck without network greping for pods like:

t5                              webapp-8bcdc6bc-wn54v                                     0/1     Init:0/2                0          21m     10.64.135.103   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>

In logs I see connections issue:

Jun 29 07:01:34 ote001spot05-i-0cad12ccb1577476a.ad.dice.fm docker_k8s_vault-agent_partners-77c5988c88-clgh8_edge_cd84d918-b76a-11ea-b9ec-062a67e343cc_0_091bad43079c[9276]: 2020-06-29T07:01:34.839Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=1.101356452
Jun 29 07:01:42 ote001spot05-i-0cad12ccb1577476a.ad.dice.fm docker_k8s_vault-agent_backend-7b6bd7984-9l4vg_edge_38c19fd8-b869-11ea-b9ec-062a67e343cc_0_640b1c1ca1af[9276]: 2020-0

Sometimes some pods are getting stuck like that, deploying 1500 pods in average 17 get stuck.
Often half or more pods on node are fine, but other get stuck

Single node:

t25                             partners-744ddcccbf-vftsx                                 1/1     Running                 0          17m     10.64.153.78    ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t25                             seatselection-77d4ddf89b-6nbw6                            1/1     Running                 0          15m     10.64.166.135   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t26                             seatselection-77d4ddf89b-mh9zk                            1/1     Running                 0          15m     10.64.134.196   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t26                             ticketing-687d94c666-t4k4w                                1/1     Running                 0          17m     10.64.150.152   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t28                             bowie-5f8855bcff-67mrq                                    1/1     Running                 0          15m     10.64.173.8     ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t28                             partners-744ddcccbf-dd2vh                                 0/1     Init:0/2                0          16m     10.64.152.173   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t3                              seatselection-77d4ddf89b-pjtbv                            0/1     Init:0/2                0          21m     10.64.169.91    ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t5                              webapp-8bcdc6bc-wn54v                                     0/1     Init:0/2                0          21m     10.64.135.103   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t7                              ticketing-687d94c666-slhnc                                0/1     Init:0/2                0          21m     10.64.156.107   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t8                              hedwig-5596c54f5c-tqq7v                                   0/1     Init:0/2                0          20m     10.64.176.49    ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t8                              kim-58656b9d6b-8qvmd                                      1/1     Running                 0          20m     10.64.180.121   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>

When I will delete aws-node pod from corresponding node they are all recovering instantly when new one will start.

I tried - in hope that maybe it will improve - adding to aws-node MINIMUM_IP_TARGET=30 and WARM_IP_TARGET=10. Later I also add "sleep" to aws-node, maybe it is placebo but it seems when new node is created by autoscaler it is better if pods are Scheaduled before aws-node witll start starting.

It "seems" much better, but even now it is sometimes stucking:

default                         debian-xcgv6                                              1/1     Running       0          107m    10.64.13.160    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
kube-system                     aws-node-s5jzc                                            1/1    Running       0          33s     10.64.62.184    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
kube-system                     kube-proxy-m8m74                                          1/1     Running       0          108m    10.64.62.184    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
prometheus                      node-cadvisor-bigspot-j6ffx                               1/1     Running       0          107m    10.64.62.184    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
prometheus                      node-exporter-z77v6                                       1/1     Running       0          108m    10.64.62.184    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
search-fixes                    background-76ff6b85cb-4l5z5                               1/1     Running       0          6m37s   10.64.18.71     ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>

Over there background-76ff6b85cb-4l5z5 was stuck until I killed previous aws-node-s5jzc pod, 30s later since new one and it worked

https://drive.google.com/drive/folders/1dfjrcBOaUc3cE7gEfcLZHlpBh_Lv2oGZ?usp=sharing

More logs:

[root@ote001spot08-127921709917439 ~]# grep background-85c5ddbf9b-jdsll /var/log/aws-routed-eni/ipamd.log
{"level":"debug","ts":"2020-07-03T10:01:34.904Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:01:34.904Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}
{"level":"debug","ts":"2020-07-03T10:01:34.926Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:01:34.926Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}
{"level":"debug","ts":"2020-07-03T10:06:22.397Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:06:22.397Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}

Describe when stuck:

  Type     Reason                  Age                      From                                                  Message
  ----     ------                  ----                     ----                                                  -------
  Normal   Scheduled               9m45s                    default-scheduler                                     Successfully assigned a1/background-85c5ddbf9b-jdsll to ip-10-64-154-188.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  9m16s (x13 over 9m43s)   kubelet, ip-10-64-154-188.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "background-85c5ddbf9b-jdsll": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown
  Normal   SandboxChanged          4m43s (x144 over 9m42s)  kubelet, ip-10-64-154-188.eu-west-1.compute.internal  Pod sandbox changed, it will be killed and re-created.

Plugin logs:

{"level":"info","ts":"2020-07-02T09:40:06.799Z","caller":"driver/driver.go:316","msg":"Delete toContainer rule for 10.64.154.149/32 "}
{"level":"info","ts":"2020-07-02T09:40:06.800Z","caller":"driver/driver.go:316","msg":"Delete fromContainer rule for 10.64.154.149/32 in table 2"}
{"level":"debug","ts":"2020-07-02T09:40:06.800Z","caller":"driver/driver.go:316","msg":"Tear down of NS complete"}
{"level":"info","ts":"2020-07-02T11:26:41.320Z","caller":"routed-eni-cni-plugin/cni.go:252","msg":"Received CNI del request: ContainerID(4f69bc0a130e34d97457635a8fe235f54a6b278563e627ce5ab5b7659ae2f4e3) Netns(/proc/5754/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-7529;K8S_POD_NAME=background-84fd9c4d49-qdqqt;K8S_POD_INFRA_CONTAINER_ID=4f69bc0a130e34d97457635a8fe235f54a6b278563e627ce5ab5b7659ae2f4e3) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"Debug\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2020-07-02T11:26:41.327Z","caller":"routed-eni-cni-plugin/cni.go:322","msg":"TeardownNS: addr 10.64.131.128/32, table 2"}
{"level":"info","ts":"2020-07-02T11:26:41.327Z","caller":"driver/driver.go:316","msg":"Delete toContainer rule for 10.64.131.128/32 "}
{"level":"info","ts":"2020-07-02T11:26:41.328Z","caller":"driver/driver.go:316","msg":"Delete fromContainer rule for 10.64.131.128/32 in table 2"}
{"level":"debug","ts":"2020-07-02T11:26:41.328Z","caller":"driver/driver.go:316","msg":"Tear down of NS complete"}
{"level":"info","ts":"2020-07-02T12:25:37.856Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"Received CNI add request: ContainerID(a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54) Netns(/proc/22410/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=discovery-badges;K8S_POD_NAME=background-7c4dbf5b96-zjs9r;K8S_POD_INFRA_CONTAINER_ID=a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"Debug\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2020-07-02T12:25:37.857Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2020-07-02T12:25:37.860Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"Received add network response for pod background-7c4dbf5b96-zjs9r namespace discovery-badges sandbox a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54: 10.64.147.197, table 0, external-SNAT: false, vpcCIDR: [10.64.0.0/16]"}
{"level":"debug","ts":"2020-07-02T12:25:37.860Z","caller":"routed-eni-cni-plugin/cni.go:203","msg":"SetupNS: hostVethName=eni9d2f9c4fffa, contVethName=eth0, netnsPath=/proc/22410/ns/net, table=0, mtu=9001"}
{"level":"debug","ts":"2020-07-02T12:25:37.887Z","caller":"driver/driver.go:173","msg":"SetupNS: disabled IPv6 RA and ICMP redirects on eni9d2f9c4fffa"}
{"level":"debug","ts":"2020-07-02T12:25:37.895Z","caller":"driver/driver.go:173","msg":"Setup host route outgoing hostVeth, LinkIndex 262"}
{"level":"debug","ts":"2020-07-02T12:25:37.896Z","caller":"driver/driver.go:173","msg":"Successfully set host route to be 10.64.147.197/0"}
{"level":"info","ts":"2020-07-02T12:25:37.923Z","caller":"driver/driver.go:173","msg":"Added toContainer rule for 10.64.147.197/32"}

This time it actually got the IP, but after 11 minutes, most of the time it is not as lucky (there was only 6 pods on this node, so all IPs were already attached to VM waiting to be used MINIMUM_IP_TARGET=30)

~❯ kubectl --namespace a1 get pod -o wide|grep background
background-85c5ddbf9b-jdsll       1/1     Running     0          14m   10.64.166.246   ip-10-64-154-188.eu-west-1.compute.internal   <none>           <none>

Problem started last week, previously we were using aws-node 1.5.X for months, but after issues upgraded to current one last Friday - what didn't help

We have over 100 nodes, this is cluster where developers are creating their personal environments, so as result of such big rotation (sometimes thousand of pods per day) issue happens between every couple of days, and sometimes even 2 times per day when there is big activity, on random node(s). So it seems like edge case issue, 98%ish of pods are fine, but with big activity it becomes a issue. When creating 20 environments (around 1500 pods in total), it is almost certain that 5-20 pods will get stuck

Just to make it clear, it doesn't only happen during some crazy activity, at some rare cases it happened when deployed a single pod to node - but deploying tones of pods seems to be effective way to check if issue is still there or is it fixed. Yesterday at 5PM I deployed 1500 pods 4 times without any issue, but today got a single stuck pod again. Maybe it is some issue in AWS side, some endpoint issue what might be used by aws-node ?

@jayanthvn
Copy link
Contributor

jayanthvn commented Jul 3, 2020

Hi @michalzxc,

Which version of CNI you are using? Also can you please run the aws-cni-support.sh on the node having the issue and share us the logs?

WARM_ENI_TARGET=1 should be sufficient. Probably MIN_IP_TARGET and WARM_IP_TARGET is not needed.

Thanks.

@rochacon
Copy link

rochacon commented Jul 4, 2020

I've been having the same issue with v1.6.x, tried all versions of this series and still get new pods without a working network from time to time. Currently running v1.6.3.
I was not able to reproduce this consistently, just as OP mentioned, the issue seems to be random. I seems to be related to IP reuse, I noticed that some of the stuck pods had the same IP as some old job pod (in completed status).

In my case, the stuck pods can’t access CoreDNS or other pods either, container restart has no effect, the only way I found to fix is to delete the bugged pod.

I wonder if this could be related to SNAT changes on 1.6? Or maybe IP cooldown?

@jayanthvn
Copy link
Contributor

Hi @rochacon

Can you please run aws-cni-support.sh on the impacted node and share us the logs? Please do share how many pods you had when the issue occurred.

Thanks.

@michalzxc
Copy link
Author

~❯  kubectl get pods --all-namespaces -o wide|grep "Init:0"
aqa-14                          background-85c5ddbf9b-58jwl                               0/1     Init:0/2    0          2m26s   <none>          ip-10-64-154-188.eu-west-1.compute.internal   <none>           <none>
edge                            background-85c5ddbf9b-5qmct                               0/1     Init:0/2    0          2m25s   <none>          ip-10-64-154-188.eu-west-1.compute.internal   <none>           <none>

https://drive.google.com/file/d/1_Mo1XnXqgRdT3VAtMFCcVVZaGVqHaM02/view?usp=sharing - result of aws-cni-support.sh

@michalzxc
Copy link
Author

Other one:

amp                             partners-56d77cd869-m8h2k                                 0/1     Init:0/2      0          5m      10.64.157.143   ip-10-64-149-238.eu-west-1.compute.internal   <none>           <
none>
aqa-14                          partners-56d77cd869-pjqts                                 0/1     Init:0/2      0          4m31s   10.64.186.84    ip-10-64-149-238.eu-west-1.compute.internal   <none>           <
none>
embeddable-streams-fullscreen   kimlexir-cron-74fcfd99fc-xg5k8                            0/1     Init:0/3      0          4m43s   10.64.151.35    ip-10-64-149-238.eu-west-1.compute.internal   <none>           <
none>

https://drive.google.com/file/d/1Y8OGleFJqlb-DecnEI-XT92jMXbPhrnT/view?usp=sharing

@michalzxc
Copy link
Author

Both cases above - deleting aws-node pod, solves these pods issue - network is appearing the moment new aws-node pod started

@michalzxc
Copy link
Author

michalzxc commented Jul 6, 2020

Before we were using 1.5.3, we upgraded to 1.6.3 because of this issue (it happen out of nowhere, so we thought maybe aws API changes somewhere and we need new aws-node etc)

@michalzxc
Copy link
Author

We were using aws-node 1.5.3 over a year without this issue. Recently we reduced some resources requests, so possibly it is result of having more pods (=>more IPs) per node?

@jayanthvn
Copy link
Contributor

jayanthvn commented Jul 6, 2020

Hi @michalzxc

Thanks for sharing the logs. Seems like CNI plugin hasnt got an add request and also Kubelet logs show below errors -

Jul 06 09:41:15 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: I0706 09:41:15.010180   24477 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "vault-agent-config" (UniqueName: "kubernetes.io/configmap/d58f6c61-bf6c-11ea-b9ec-062a67e343cc-vault-agent-config") pod "background-85c5ddbf9b-58jwl" (UID: "d58f6c61-bf6c-11ea-b9ec-062a67e343cc")
Jul 06 09:41:15 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: I0706 09:41:15.010245   24477 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "vault-env" (UniqueName: "kubernetes.io/empty-dir/d58f6c61-bf6c-11ea-b9ec-062a67e343cc-vault-env") pod "background-85c5ddbf9b-58jwl" (UID: "d58f6c61-bf6c-11ea-b9ec-062a67e343cc")
Jul 06 09:41:15 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: I0706 09:41:15.010281   24477 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "background-token-jq9lv" (UniqueName: "kubernetes.io/secret/d58f6c61-bf6c-11ea-b9ec-062a67e343cc-background-token-jq9lv") pod "background-85c5ddbf9b-58jwl" (UID: "d58f6c61-bf6c-11ea-b9ec-062a67e343cc")
Jul 06 09:41:15 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: I0706 09:41:15.315170   24477 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "background-token-gwxz8" (UniqueName: "kubernetes.io/secret/d5bea14c-bf6c-11ea-b9ec-062a67e343cc-background-token-gwxz8") pod "background-85c5ddbf9b-5qmct" (UID: "d5bea14c-bf6c-11ea-b9ec-062a67e343cc")
Jul 06 09:41:15 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: I0706 09:41:15.315236   24477 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "vault-agent-config" (UniqueName: "kubernetes.io/configmap/d5bea14c-bf6c-11ea-b9ec-062a67e343cc-vault-agent-config") pod "background-85c5ddbf9b-5qmct" (UID: "d5bea14c-bf6c-11ea-b9ec-062a67e343cc")
Jul 06 09:41:15 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: I0706 09:41:15.315349   24477 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "vault-env" (UniqueName: "kubernetes.io/empty-dir/d5bea14c-bf6c-11ea-b9ec-062a67e343cc-vault-env") pod "background-85c5ddbf9b-5qmct" (UID: "d5bea14c-bf6c-11ea-b9ec-062a67e343cc")
**Jul 06 09:41:17 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: E0706 09:41:17.619705   24477 remote_runtime.go:109] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "background-85c5ddbf9b-58jwl": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown
Jul 06 09:41:17 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: E0706 09:41:17.619768   24477 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "background-85c5ddbf9b-58jwl_aqa-14(d58f6c61-bf6c-11ea-b9ec-062a67e343cc)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "background-85c5ddbf9b-58jwl": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown
Jul 06 09:41:17 ote001spot08-127921709917439.ad.dice.fm kubelet[24477]: E0706 09:41:17.619796   24477 kuberuntime_manager.go:697] createPodSandbox for pod "background-85c5ddbf9b-58jwl_aqa-14(d58f6c61-bf6c-11ea-b9ec-062a67e343cc)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "background-85c5ddbf9b-58jwl": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown**

@michalzxc
Copy link
Author

michalzxc commented Jul 7, 2020

Don't see OCI error - downgraded docker and trying with older one

rom plugin/docker: NetworkPlugin cni failed on the status hook for pod "kimlexir-6d5d5f646-9rjjl_mio-event-details": CNI failed to retrieve network names
c8b8a33fb86b7df3226093dcbc664c227b2e06c9a2ac777e052c116ee8" not found in pod's containers
espace path: cannot find network namespace for the terminated container "508c0ac8b8a33fb86b7df3226093dcbc664c227b2e06c9a2ac777e052c116ee8"
ime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "10800f63ca9f3bfb8a6f544ce741ba3fa2472ac332bfae429eb9fd8a6b9283ff
r pod "ticketing-86b847775c-zqx49_t6(c9868e8b-c037-11ea-b9ec-062a67e343cc)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "
or pod "ticketing-86b847775c-zqx49_t6(c9868e8b-c037-11ea-b9ec-062a67e343cc)" failed: rpc error: code = Unknown desc = failed to set up sandbox container 
-c037-11ea-b9ec-062a67e343cc ("ticketing-86b847775c-zqx49_t6(c9868e8b-c037-11ea-b9ec-062a67e343cc)"), skipping: failed to "CreatePodSandbox" for "ticketi
der for "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc9868e8b_c037_11ea_b9ec_062a67e343cc.slice/docker-10800f63ca9f3bfb8a6f544ce741ba3
ime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "f2be00b1e8ba83fd23cfcea3623498cef437008322163b319cb85cd4e0cc404e
r pod "amp-6979f8b874-755v6_t6(cc07fc06-c037-11ea-b9ec-062a67e343cc)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "f2be00
or pod "amp-6979f8b874-755v6_t6(cc07fc06-c037-11ea-b9ec-062a67e343cc)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "f2be0
-c037-11ea-b9ec-062a67e343cc ("amp-6979f8b874-755v6_t6(cc07fc06-c037-11ea-b9ec-062a67e343cc)"), skipping: failed to "CreatePodSandbox" for "amp-6979f8b87
der for "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcc07fc06_c037_11ea_b9ec_062a67e343cc.slice/docker-f2be00b1e8ba83fd23cfcea3623498c
rom plugin/docker: NetworkPlugin cni failed on the status hook for pod "amp-6979f8b874-755v6_t6": CNI failed to retrieve network namespace path: cannot f
runtime/proc.go:203","msg":"CNI Plugin version: v1.6.3 ..."}
rom plugin/docker: NetworkPlugin cni failed on the status hook for pod "ticketing-86b847775c-zqx49_t6": CNI failed to retrieve network namespace path: ca
b1e8ba83fd23cfcea3623498cef437008322163b319cb85cd4e0cc404e" not found in pod's containers
63ca9f3bfb8a6f544ce741ba3fa2472ac332bfae429eb9fd8a6b9283ff" not found in pod's containers

Stuck pods:
pontarelixir-5f5898dbb6-dg765, amp-6979f8b874-755v6, janitor-6cdfb457bd-5q5t9

https://drive.google.com/file/d/1YBhpDxO7EClzpdoY7iSnWpZbVkVW6xmF/view?usp=sharing

@michalzxc
Copy link
Author

michalzxc commented Jul 8, 2020

Downgraded to VM image what was fully stable month ago
https://drive.google.com/file/d/1dlbrRRPR7qjqA2_sQFwA_HLDQz8XQRaN/view?usp=sharing

Every 2.0s: kubectl get pods --all-namespaces -o wide|grep "Init:0"                                                                                                               

t1                              backend-ddcf669dd-c7s2x                                   0/1     Init:0/2           0          10m     10.64.30.245    ip-10-64-20-169.eu-west-1.compute.internal    <none>
    <none>
t2                              bowie-6bbd5d44bd-zhgr9                                    0/1     Init:0/2           0          10m     10.64.40.179    ip-10-64-20-169.eu-west-1.compute.internal    <none>
    <none>
t3                              bowie-6bbd5d44bd-xcvf4                                    0/1     Init:0/2           0          10m     10.64.7.34      ip-10-64-20-169.eu-west-1.compute.internal    <none>
    <none>
t3                              kimlexir-77464475f4-hq56d                                 0/1     Init:0/2           0          10m     10.64.28.198    ip-10-64-20-169.eu-west-1.compute.internal    <none>
    <none>
t3                              partners-cron-bdf448854-f2gqz                             0/1     Init:0/2           0          10m     10.64.0.23      ip-10-64-20-169.eu-west-1.compute.internal    <none>
    <none>
t3                              pontarelixir-6c6c465ff-smngk                              0/1     Init:0/2           0          10m     10.64.56.60     ip-10-64-20-169.eu-west-1.compute.internal    <none>
    <none>
t4                              amp-5bd8b89487-vcg2p                                      0/1     Init:0/2           0          10m     10.64.27.222    ip-10-64-20-169.eu-west-1.compute.internal    <none>
    <none>
t4                              janitor-cron-7b964c8d55-p2hsr                             0/1     Init:0/2           0          9m58s   10.64.42.234    ip-10-64-20-169.eu-west-1.compute.internal    <none>
    <none>
```

@michalzxc
Copy link
Author

I can connect to stucked pods from its node, I can't connect from other nodes. Any advice where to look, how aws-cni is setting routing/iptables/etc ?

@michalzxc
Copy link
Author

michalzxc commented Jul 8, 2020

@jayanthvn

[root@ote001spot03-i-0a4976e528bc159aa ~]# ip rule list
0:	from all lookup local 
512:	from all to 10.64.143.178 lookup main 
512:	from all to 10.64.147.163 lookup main 
512:	from all to 10.64.174.118 lookup main 
512:	from all to 10.64.191.29 lookup main 
512:	from all to 10.64.167.92 lookup main 
512:	from all to 10.64.190.125 lookup main 
512:	from all to 10.64.183.176 lookup main 
512:	from all to 10.64.189.108 lookup main 
512:	from all to 10.64.136.198 lookup main 
512:	from all to 10.64.158.86 lookup main 
512:	from all to 10.64.180.224 lookup main 
512:	from all to 10.64.178.20 lookup main 
512:	from all to 10.64.136.247 lookup main 
512:	from all to 10.64.191.232 lookup main 
512:	from all to 10.64.140.156 lookup main 
512:	from all to 10.64.163.78 lookup main 
512:	from all to 10.64.151.63 lookup main 
512:	from all to 10.64.150.203 lookup main 
512:	from all to 10.64.180.84 lookup main 
1024:	from all fwmark 0x80/0x80 lookup main 
1536:	from 10.64.158.86 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.180.224 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.178.20 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.136.247 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.191.232 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.140.156 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.163.78 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.151.63 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.150.203 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.180.84 to 10.64.0.0/16 lookup 3 
32766:	from all lookup main 
32767:	from all lookup default

The stucked pod is

1536:	from 10.64.180.84 to 10.64.0.0/16 lookup 3 

When I check routing table 2 and 3, third one is empty:

[root@ote001spot03-i-0a4976e528bc159aa ~]# ip route show table 3
[root@ote001spot03-i-0a4976e528bc159aa ~]# ip route show table 2
default via 10.64.128.1 dev eth0 
10.64.128.1 dev eth0 scope link 

@michalzxc
Copy link
Author

ip route add default via 10.64.128.1 dev eth1 table 3
ip route add 10.64.128.1 dev eth1 scope link table 3

And pod is reachable from other nodes and unstuck

Why aws-node is failing to set this routing, how to fix it in permanent automatic way?

@michalzxc
Copy link
Author

I am considering writing cron script populating routing tables on VMs, but I hope for better solution on aws-cni side :)

@michalzxc
Copy link
Author

This is what I am running every minute in cron:

#!/bin/bash

echo "Start Date: $(date)"
INTERFACES="$(ip addr|egrep -o "eth[0-9]+"|sort|uniq)"
echo -e "Interfaces:\n${INTERFACES}"
i=2
for int in $INTERFACES; do
        GW="$(ip route|grep default|head -n 1|egrep -o '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+')"
        echo "Checking routing table ${i}"
        if [ -z "$(ip route  show table ${i}|grep default -A 1 -B 1|grep scope)" ] ; then
                echo "Routing table ${i} not exist or empty:"
                ip route show table ${i}
                echo "---"
                echo "Adding routing"
                echo "ip route add default via ${GW} dev ${int} table ${i}:"
                ip route add default via ${GW} dev ${int} table ${i}
                echo "---"
                echo "ip route add ${GW} dev ${int} scope link table ${i}:"
                ip route add ${GW} dev ${int} scope link table ${i}
                echo "---"
        else
                echo "Routing table ${i} exists and looks OK:"
                ip route  show table ${i}
                echo "---"
        fi
        i=$[$i+1]
done
echo -e "-----------------------------------\n"

Seems it is doing the trick

@michalzxc
Copy link
Author

michalzxc commented Jul 21, 2020

@mogren Hey, could you take a look at this one as well?

@alexandrvb
Copy link

We are facing the same issue with CNI v1.5.5 and EKS v1.14.9-eks-f459c0. Route table 3 was empty.

Updating to CNI v1.6.3 and rebooting (!) EKS node helped for now.

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Aug 28, 2020

From the logs (eks_i-0477eb0d8ac0edef0_2020-07-08_0940-UTC_0.6.2) attached

{"level":"debug","ts":"2020-07-08T09:31:22.094Z","caller":"retry/retry.go:69","msg":"Not able to set route 0.0.0.0/0 via 10.64.0.1 table 2"}
{"level":"error","ts":"2020-07-08T09:31:22.094Z","caller":"ipamd/ipamd.go:677","msg":"Failed to increase pool size: failed to set up ENI eni-06cd9a1ec56ac0250 network: setupENINetwork: unable to replace route entry 0.0.0.0: network is unreachable"}

This will result in same issue as #1094 as you can see below 10.64.51.851 is added to the datastore.

{"level":"debug","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:508","msg":"Reconcile existing ENI eni-0b5d82ef36c11eb01 IP pool"}
{"level":"debug","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:997","msg":"Reconcile and skip primary IP 10.64.11.230 on ENI eni-0b5d82ef36c11eb01"}

{"level":"debug","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:508","msg":"Reconcile existing ENI eni-06cd9a1ec56ac0250 IP pool"}
{"level":"info","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:1075","msg":"Added ENI(eni-06cd9a1ec56ac0250)'s IP 10.64.51.85 to datastore"}

{"level":"debug","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:508","msg":"Reconcile existing ENI eni-017d5e5b45866e889 IP pool"}
{"level":"debug","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:997","msg":"Reconcile and skip primary IP 10.64.25.174 on ENI eni-017d5e5b45866e889"}

{"level":"debug","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:508","msg":"Reconcile existing ENI eni-01f62d804d30e9aaa IP pool"}
{"level":"debug","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:997","msg":"Reconcile and skip primary IP 10.64.20.169 on ENI eni-01f62d804d30e9aaa"}

{"level":"debug","ts":"2020-07-08T09:33:04.095Z","caller":"ipamd/ipamd.go:508","msg":"Successfully Reconciled ENI/IP pool"}

@SaranBalaji90
Copy link
Contributor

@michalzxc Thank you for reporting these issues. We are actively working on the fix. Sorry for the delayed response.

@mogren
Copy link
Contributor

mogren commented Sep 2, 2020

The changes in #1177 fixes this issue.

@mogren mogren closed this as completed Sep 2, 2020
@rochacon
Copy link

rochacon commented Sep 3, 2020

@mogren thank you for this fix! Do we have plans for the release date of it? Will it be backported to 1.6 series or upgrading to 1.7 will be required?

@mogren
Copy link
Contributor

mogren commented Sep 3, 2020

@rochacon We are planning to get a v1.7.2 out before the end of next week, if all testing continues to go well. We don't plan to backport this to the v1.6.x branch, too much has changed in the code base, so an upgrade will be required.

@rochacon
Copy link

rochacon commented Sep 3, 2020

OK, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants