aws-node pod does not start correctly the first time #1702

xtroncode · 2021-10-21T05:20:56Z

What happened:

When a new node is started in a nodegroup the node takes a lot of time to be marked as Ready because aws-node (cni) pod does not start correctly the first time and has to undergo 1-2 restarts. Also the restarts are delayed because the initial delay for liveness probe is set to 60 sec. If we are increasing the failure threshold for liveness then the aws-node pod is not marked ready even after 7-8 minutes (may even be longer, it does not seem to run at all as readiness probes never succeed).

What you expected to happen:
We expect the nodes to be ready in under a minute

How to reproduce it (as minimally and precisely as possible):

We are facing this on a newly created eks cluster so should be easily reproducable

Environment:

Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
CNI Version : v1.9.1-eksbuild.1
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a): 5.4.149-73.259.amzn2.x86_64

The text was updated successfully, but these errors were encountered:

jayanthvn · 2021-10-21T05:49:26Z

Can you please share the node logs? You can run this script sudo bash /opt/cni/bin/aws-cni-support.sh.

xtroncode · 2021-10-21T08:59:13Z

Hi,
PFA logs as requested

eks_i-0ce4f4f2ac445bafc_2021-10-21_0854-UTC_0.6.2.tar.gz

backjo · 2021-10-21T18:22:22Z

I'd suggest reading https://medium.com/keikoproj/rapid-auto-scaling-on-eks-part-1-bb4de84fc599 - it might be that kube-proxy hasn't started yet by the time the CNI tries to start, in which case it can't connect to the control plane.

jayanthvn · 2021-10-22T15:23:02Z

Yes as @backjo mentioned, kube-proxy is taking time here Aws-node successfully started at - 2021-10-21T05:43:11.277Z

{"log":"E1021 05:41:44.902444       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:41:44.903022279Z"}
{"log":"E1021 05:41:46.029638       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:41:46.029767316Z"}
{"log":"E1021 05:41:48.118122       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:41:48.118244512Z"}
{"log":"E1021 05:41:52.582685       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:41:52.582958565Z"}
{"log":"E1021 05:42:01.317247       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:42:01.317415496Z"}
{"log":"E1021 05:42:18.314653       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:42:18.317728691Z"}
{"log":"I1021 05:42:18.314698       1 server_others.go:442] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag\n","stream":"stderr","time":"2021-10-21T05:42:18.317794174Z"}

{"level":"info","ts":"2021-10-21T05:41:48.246Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.1  ..."}
{"level":"info","ts":"2021-10-21T05:41:48.247Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2021-10-21T05:43:11.269Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.1  ..."}
{"level":"info","ts":"2021-10-21T05:43:11.270Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2021-10-21T05:43:11.277Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.21+. git version: v1.21.2-eks-0389ca3. git tree state: clean. commit: 8a4e27b9d88142bbdd21b997b532eb6d493df6d2. platform: linux/amd64"}

An option is to set "--bindAddress" to 127.0.0.1 and this determines the address family as v4 and kube-proxy won't wait for getting the node IP but this will break for v6. ref : #1078 (comment) and https://gist.github.com/M00nF1sh/84d380b4e08017a5bc958658f7010914. We are working on including this in the default kube-proxy manifest.

jayanthvn · 2021-10-26T04:49:46Z

@xtroncode - Did the above workaround work for you?

xtroncode · 2021-10-26T05:27:18Z

Hi @jayanthvn, I haven't been able to try it out yet. Will try it out today and let you know. Thanks.

xtroncode · 2021-10-27T06:36:47Z

Hi @jayanthvn , I tried setting bindAddress to 127.0.0.1 in the kube-proxy-config configmap and it seems to work when a new node is added, the start time is much lower. But the configmap is being reset by eks after a while, is there a way to make the change persistent?

jayanthvn · 2021-10-27T16:26:03Z

Hi @xtroncode, I assume you are using kube-proxy managed add on? We are working on making it part of the default manifest.

xtroncode · 2021-10-28T04:16:34Z

Ok..thanks. Any workaround for this until it is added as default?

jayanthvn · 2021-11-02T18:11:33Z

@xtroncode - workaround for now is to include this - - --hostname-override=$(NODE_NAME)

spec:
  template:
    spec:
      containers:
        - name: kube-proxy
          command:
            - kube-proxy
            - --hostname-override=$(NODE_NAME)
            - --v=2
            - --config=/var/lib/kube-proxy-config/config
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName

jayanthvn · 2021-11-02T18:11:58Z

Will close this issue for now, please reach out if you need any more information.

ChrisRamsayITV · 2021-11-25T12:30:22Z

Hi @jayanthvn - Is there any update on when this will be included as default? Thanks

jayanthvn · 2021-11-26T04:49:42Z

@ChrisRamsayITV - I will check with the team and get back to you next week.

ettiee · 2021-12-16T10:17:32Z

Hi @jayanthvn is there any update on when - --hostname-override=$(NODE_NAME) will be included as default in kube-proxy managed add on?

The EKS addon for kube-proxy introduced regressions of #124 and #209. We will apply the recommended overrides from aws/containers-roadmap#657 and aws/amazon-vpc-cni-k8s#1702 in the manifests until the EKS addon applies these by default or allows you to override config in the addon.

Baptistee-B · 2022-03-01T08:45:01Z

@jayanthvn Any news on that ? Some topics we can vote for to push the fix ? :)

jayanthvn · 2022-03-14T12:01:49Z

1.22 default kube-proxy manifest will have this change. Release calendar can be found here - https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html

xtroncode added the bug label Oct 21, 2021

jayanthvn closed this as completed Nov 2, 2021

ettiee mentioned this issue Dec 16, 2021

Possible regression of #209 in 1.20 cookpad/terraform-aws-eks#285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-node pod does not start correctly the first time #1702

aws-node pod does not start correctly the first time #1702

xtroncode commented Oct 21, 2021

jayanthvn commented Oct 21, 2021

xtroncode commented Oct 21, 2021

backjo commented Oct 21, 2021

jayanthvn commented Oct 22, 2021

jayanthvn commented Oct 26, 2021

xtroncode commented Oct 26, 2021

xtroncode commented Oct 27, 2021

jayanthvn commented Oct 27, 2021

xtroncode commented Oct 28, 2021

jayanthvn commented Nov 2, 2021

jayanthvn commented Nov 2, 2021

ChrisRamsayITV commented Nov 25, 2021

jayanthvn commented Nov 26, 2021

ettiee commented Dec 16, 2021

Baptistee-B commented Mar 1, 2022

jayanthvn commented Mar 14, 2022

aws-node pod does not start correctly the first time #1702

aws-node pod does not start correctly the first time #1702

Comments

xtroncode commented Oct 21, 2021

jayanthvn commented Oct 21, 2021

xtroncode commented Oct 21, 2021

backjo commented Oct 21, 2021

jayanthvn commented Oct 22, 2021

jayanthvn commented Oct 26, 2021

xtroncode commented Oct 26, 2021

xtroncode commented Oct 27, 2021

jayanthvn commented Oct 27, 2021

xtroncode commented Oct 28, 2021

jayanthvn commented Nov 2, 2021

jayanthvn commented Nov 2, 2021

ChrisRamsayITV commented Nov 25, 2021

jayanthvn commented Nov 26, 2021

ettiee commented Dec 16, 2021

Baptistee-B commented Mar 1, 2022

jayanthvn commented Mar 14, 2022