Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod restarting at node startup after upgrading to Kubernetes 1.15 #865

Closed
nachomillangarcia opened this issue Mar 13, 2020 · 10 comments
Closed
Labels

Comments

@nachomillangarcia
Copy link

nachomillangarcia commented Mar 13, 2020

aws-node pods restart 1 time in every node startup, with this message in logs:

starting IPAM daemon in background ... ok.
checking for IPAM connectivity ...  failed.
timed out waiting for IPAM daemon to start.

This behavior has started just after upgrading EKS cluster to 1.15.10. Specifically, I think the problem is in kube-proxy version 1.15.10. I downgraded kube-proxy to 1.14.7 and the issue stopped. After upgrading it again to 1.15.10, it restarted.

In the IPAMD logs on the node, I see the service restarting always at this point:

2020-03-13T11:16:54.321Z [INFO] 	Setting up host network... 
2020-03-13T11:16:54.322Z [DEBUG] 	Trying to find primary interface that has mac : xx:xx:xx:xx:xx:xx
2020-03-13T11:16:54.322Z [DEBUG] 	Discovered interface: lo, mac: 
2020-03-13T11:16:54.322Z [DEBUG] 	Discovered interface: eth0, mac: xx:xx:xx:xx:xx:xx
2020-03-13T11:16:54.322Z [INFO] 	Discovered primary interface: eth0
2020-03-13T11:16:54.322Z [DEBUG] 	Setting RPF for primary interface: /proc/sys/net/ipv4/conf/eth0/rp_filter
2020-03-13T11:16:54.323Z [DEBUG] 	Found the Link that uses mac address xx:xx:xx:xx:xx:xx and its index is 2 (attempt 1/5)


2020-03-13T11:16:55.840Z [INFO] 	Starting L-IPAMD v1.6.0  ...
2020-03-13T11:16:56.160Z [INFO] 	Testing communication with server
2020-03-13T11:16:56.161Z [INFO] 	Running with Kubernetes cluster version: v1.15+. git version: v1.15.10-eks-bac369. git tree state: clean. commit: bac3690554985327ae4d13e42169e8b1c2f37226. platform: linux/amd64

Anybody getting the same issue after upgrading? Thanks!

@nachomillangarcia nachomillangarcia changed the title Pod restarting at node startup after upgrading to Kubernetes 1.15.10 Pod restarting at node startup after upgrading to Kubernetes 1.15 Mar 13, 2020
@nithu0115
Copy link
Contributor

Similar issue #872

@jaypipes
Copy link
Contributor

I'm wondering if the kube-proxy thing might be a red herring. Perhaps we just need to up the timeout for k8s API connectivity from 30 seconds to something like 1 minute?

That line:

checking for IPAM connectivity ...  failed.

occurs when IPAMd fails to connect to the Kubernetes API service after 30 seconds. kube-proxy is required in order for the Daemonset running IPAM-D to connect to the Kubernetes API service (because kube-proxy sets up the iptables rules on the host for the pod traffic). I'm wondering if kube-proxy 1.15.10 is taking a little longer (>30 seconds or so) to come up on the host after an upgrade and therefore a domino effect is happening with the IPAM-D timing out trying to connect to the k8s API server.

@nachomillangarcia
Copy link
Author

It seems totally that. I don't see the error when kube-proxy is fully working, only at startup.

Would be great to customize that timeout

@SaranBalaji90
Copy link
Contributor

@nachomillangarcia did you update both kube-proxy and aws-node at the same time?

@nachomillangarcia
Copy link
Author

No, I was using aws-node 1.6 weeks before upgrading kube-proxy, no errors so far.

jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue Mar 18, 2020
Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the
entrypoint.sh script. Increases the default timeout from ~30 seconds to
60 seconds.

Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change
the timeout.

Related: aws#625, aws#865 aws#872
@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Mar 19, 2020

@nachomillangarcia thanks for confirming. I initially thought you noticed the behavior on existing nodes as well. But reading your issue description again, seems like it happens on node startup only.

One thing we can do to confirm this quickly would be, to look at when kube-proxy went to running state on the worker node and compare that with ipamd restart times.

@mogren
Copy link
Contributor

mogren commented Apr 29, 2020

Hi @nachomillangarcia, have you tried with v1.6.1? Also, did this restart only happen when kube-proxy was updated?

@mogren
Copy link
Contributor

mogren commented May 19, 2020

We have made some changes to the master upgrade process that should mitigate this problem. Please open a new issue if there are any kube-proxy or CNI issues.

@mogren mogren closed this as completed May 19, 2020
mogren pushed a commit that referenced this issue Jun 24, 2020
* add configurable timeout for ipamd startup

Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the
entrypoint.sh script. Increases the default timeout from ~30 seconds to
60 seconds.

Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change
the timeout.

Related: #625, #865 #872

* This is a local gRPC call, so just try every 1 second indefinitely

Since we have a liveness probe restarting the probe, we can rely on that to kill the pod.

Co-authored-by: Claes Mogren <mogren@amazon.com>
bnapolitan added a commit to bnapolitan/amazon-vpc-cni-k8s that referenced this issue Jul 1, 2020
commit d938e5e
Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Date:   Wed Jul 1 01:19:14 2020 +0000

    Json o/p for logs from entrypoint.sh

commit 2d20308
Author: Nathan Prabhu <natprabh@amazon.com>
Date:   Mon Jun 29 18:06:22 2020 -0500

    bugfix: make metrics-helper docker logging statement multi-arch compatible

commit bf9ded3
Author: Claes Mogren <claes.mogren@gmail.com>
Date:   Sat Jun 27 14:51:35 2020 -0700

    Use install command instead of cp

commit e3b7dbb
Author: Gyuho Lee <leegyuho@amazon.com>
Date:   Mon Jun 29 09:40:02 2020 -0700

    scripts/lib: bump up tester to v1.4.0

    Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

commit c369480
Author: Claes Mogren <claes.mogren@gmail.com>
Date:   Sun Jun 28 12:19:27 2020 -0700

    Some refresh cleanups

commit 8c266e9
Author: Claes Mogren <claes.mogren@gmail.com>
Date:   Sun Jun 28 18:37:46 2020 -0700

    Run staticcheck and clean up

commit 8dfc5b1
Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Date:   Sun Jun 28 17:39:20 2020 -0700

    Fix integration test script for code pipeline (aws#1062)

    Co-authored-by: Claes Mogren <mogren@amazon.com>

commit 52306be
Author: Murcherla <nithu0115@gmail.com>
Date:   Wed Jun 24 23:37:24 2020 -0500

    minor nits, fast follow up to PR 903

commit 4ddd248
Author: Claes Mogren <mogren@amazon.com>
Date:   Sun Jun 14 23:20:22 2020 -0700

    Add bandwidth plugin

commit 6d35fda
Author: Robert Sheehy <gameboy1092@gmail.com>
Date:   Fri May 22 21:11:12 2020 -0500

    Chain interface to other CNI plugins

commit 30f98bd
Author: Penugonda <saiteja313@gmail.com>
Date:   Thu Jun 25 15:14:00 2020 -0400

    removed custom networking default vars, introspection var

commit aa8b818
Author: Penugonda <saiteja313@gmail.com>
Date:   Wed Jun 24 19:11:38 2020 -0400

    updated manifest configs with default env vars

commit a073d66
Author: Nithish Murcherla <nithu0115@gmail.com>
Date:   Wed Jun 24 16:51:38 2020 -0500

    refresh subnet/CIDR information every 30 seconds and update ip rules to map pods (aws#903)

    Co-authored-by: Claes Mogren <mogren@amazon.com>

commit a0da387
Author: Claes Mogren <mogren@amazon.com>
Date:   Wed Jun 24 12:30:45 2020 -0700

    Default to random-fully (aws#1048)

commit 9fea153
Author: Claes Mogren <mogren@amazon.com>
Date:   Sun Jun 14 22:37:10 2020 -0700

    Update probe settings

    * Reduce readiness probe startup delay
    * Increase liveness polling period
    * Reduce shutdown grace period to 10 seconds

commit ad7df34
Author: Jay Pipes <jaypipes@gmail.com>
Date:   Wed Jun 24 02:06:23 2020 -0400

    Remove timeout for ipamd startup (aws#874)

    * add configurable timeout for ipamd startup

    Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the
    entrypoint.sh script. Increases the default timeout from ~30 seconds to
    60 seconds.

    Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change
    the timeout.

    Related: aws#625, aws#865 aws#872

    * This is a local gRPC call, so just try every 1 second indefinitely

    Since we have a liveness probe restarting the probe, we can rely on that to kill the pod.

    Co-authored-by: Claes Mogren <mogren@amazon.com>

commit 1af40d2
Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Date:   Fri Jun 19 10:14:44 2020 -0700

    Changelog and config file changes for v1.6.3

commit 14d5135
Author: Ari Becker <ari-becker@users.noreply.github.com>
Date:   Wed Jun 17 09:39:21 2020 +0300

    Generated the different configurations

commit 00395cb
Author: Ari Becker <ari-becker@users.noreply.github.com>
Date:   Tue Jun 16 14:33:55 2020 +0300

    Fix discovery RBAC issues in Kubernetes 1.17

commit 7e224af
Author: Gyuho Lee <leegyuho@amazon.com>
Date:   Mon Jun 15 16:04:44 2020 -0700

    scripts/lib/aws: bump up tester to v1.3.9

    Includes improvements to log fetcher + MNG deletion when metrics server
    is installed.

    Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

commit 36286ba
Author: Claes Mogren <mogren@amazon.com>
Date:   Mon Jun 15 07:56:59 2020 -0700

    Remove Printf and format test (aws#1027)

commit af54066
Author: Gyuho Lee <leegyuho@amazon.com>
Date:   Sat Jun 13 01:31:08 2020 -0700

    scripts/lib/aws: tester v1.3.6, enable color outputs (aws#1025)

    Includes various bug fixes + color output if $TERM is supported.
    Fallback to plain text output automatic.

    ref.
    https://github.com/aws/aws-k8s-tester/blob/master/CHANGELOG/CHANGELOG-1.3.md#v136-2020-06-12

    Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

commit 6d52e1b
Author: jayanthvn <1111446+jayanthvn@users.noreply.github.com>
Date:   Fri Jun 12 16:26:33 2020 -0700

    added warning if delete on termination is set to false for the primar… (aws#1024)

    * Added a warning message if delete on termination is set to false for the primary ENI
@tibin-mfl
Copy link

Facing the Same issue on aws-node pod restarts 1 time on at every node startup, it will work after that
Error:

kubectl logs aws-node-f8tw6   --previous -n kube-system

Copying portmap binary ... Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0904 13:53:37.150548       8 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://10.100.0.1:443/api?timeout=32s: dial tcp 10.100.0.1:443: i/o timeout)
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

EKS Version: 1.17
Platform version: eks.2
Kube-proxy: v1.17.9-eksbuild.1
aws-node: v1.6.3-eksbuild.1

I tried adding sleep in aws-node to rule that this is happening because kube-proxy is taking time to start, verified that kube-proxy started before aws-node.

@jurajseffer
Copy link

Happened intermittently after upgrading EKS from 1.14 to 1.15 with proxy v1.14.9-eksbuild.1 and aws node v1.6.3-eksbuild.1. When this happens, node takes a very long time to register as healthy and aws node restarts several times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants