Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node pod does not start correctly the first time #1702

Closed
xtroncode opened this issue Oct 21, 2021 · 16 comments
Closed

aws-node pod does not start correctly the first time #1702

xtroncode opened this issue Oct 21, 2021 · 16 comments
Labels

Comments

@xtroncode
Copy link

What happened:

When a new node is started in a nodegroup the node takes a lot of time to be marked as Ready because aws-node (cni) pod does not start correctly the first time and has to undergo 1-2 restarts. Also the restarts are delayed because the initial delay for liveness probe is set to 60 sec. If we are increasing the failure threshold for liveness then the aws-node pod is not marked ready even after 7-8 minutes (may even be longer, it does not seem to run at all as readiness probes never succeed).

What you expected to happen:
We expect the nodes to be ready in under a minute

How to reproduce it (as minimally and precisely as possible):

We are facing this on a newly created eks cluster so should be easily reproducable

Environment:

  • Kubernetes version (use kubectl version):
    Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

  • CNI Version : v1.9.1-eksbuild.1

  • OS (e.g: cat /etc/os-release): Amazon Linux 2

  • Kernel (e.g. uname -a): 5.4.149-73.259.amzn2.x86_64

@xtroncode xtroncode added the bug label Oct 21, 2021
@jayanthvn
Copy link
Contributor

Can you please share the node logs? You can run this script sudo bash /opt/cni/bin/aws-cni-support.sh.

@xtroncode
Copy link
Author

Hi,
PFA logs as requested

eks_i-0ce4f4f2ac445bafc_2021-10-21_0854-UTC_0.6.2.tar.gz

@backjo
Copy link
Contributor

backjo commented Oct 21, 2021

I'd suggest reading https://medium.com/keikoproj/rapid-auto-scaling-on-eks-part-1-bb4de84fc599 - it might be that kube-proxy hasn't started yet by the time the CNI tries to start, in which case it can't connect to the control plane.

@jayanthvn
Copy link
Contributor

Yes as @backjo mentioned, kube-proxy is taking time here Aws-node successfully started at - 2021-10-21T05:43:11.277Z

{"log":"E1021 05:41:44.902444       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:41:44.903022279Z"}
{"log":"E1021 05:41:46.029638       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:41:46.029767316Z"}
{"log":"E1021 05:41:48.118122       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:41:48.118244512Z"}
{"log":"E1021 05:41:52.582685       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:41:52.582958565Z"}
{"log":"E1021 05:42:01.317247       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:42:01.317415496Z"}
{"log":"E1021 05:42:18.314653       1 node.go:161] Failed to retrieve node info: nodes \"ip-10-10-183-57.aepl.com\" not found\n","stream":"stderr","time":"2021-10-21T05:42:18.317728691Z"}
{"log":"I1021 05:42:18.314698       1 server_others.go:442] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag\n","stream":"stderr","time":"2021-10-21T05:42:18.317794174Z"}
{"level":"info","ts":"2021-10-21T05:41:48.246Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.1  ..."}
{"level":"info","ts":"2021-10-21T05:41:48.247Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2021-10-21T05:43:11.269Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.1  ..."}
{"level":"info","ts":"2021-10-21T05:43:11.270Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2021-10-21T05:43:11.277Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.21+. git version: v1.21.2-eks-0389ca3. git tree state: clean. commit: 8a4e27b9d88142bbdd21b997b532eb6d493df6d2. platform: linux/amd64"}

An option is to set "--bindAddress" to 127.0.0.1 and this determines the address family as v4 and kube-proxy won't wait for getting the node IP but this will break for v6. ref : #1078 (comment) and https://gist.github.com/M00nF1sh/84d380b4e08017a5bc958658f7010914. We are working on including this in the default kube-proxy manifest.

@jayanthvn
Copy link
Contributor

@xtroncode - Did the above workaround work for you?

@xtroncode
Copy link
Author

Hi @jayanthvn, I haven't been able to try it out yet. Will try it out today and let you know. Thanks.

@xtroncode
Copy link
Author

Hi @jayanthvn , I tried setting bindAddress to 127.0.0.1 in the kube-proxy-config configmap and it seems to work when a new node is added, the start time is much lower. But the configmap is being reset by eks after a while, is there a way to make the change persistent?

@jayanthvn
Copy link
Contributor

Hi @xtroncode, I assume you are using kube-proxy managed add on? We are working on making it part of the default manifest.

@xtroncode
Copy link
Author

Ok..thanks. Any workaround for this until it is added as default?

@jayanthvn
Copy link
Contributor

@xtroncode - workaround for now is to include this - - --hostname-override=$(NODE_NAME)

spec:
  template:
    spec:
      containers:
        - name: kube-proxy
          command:
            - kube-proxy
            - --hostname-override=$(NODE_NAME)
            - --v=2
            - --config=/var/lib/kube-proxy-config/config
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName

@jayanthvn
Copy link
Contributor

Will close this issue for now, please reach out if you need any more information.

@ChrisRamsayITV
Copy link

Hi @jayanthvn - Is there any update on when this will be included as default? Thanks

@jayanthvn
Copy link
Contributor

@ChrisRamsayITV - I will check with the team and get back to you next week.

@ettiee
Copy link

ettiee commented Dec 16, 2021

Hi @jayanthvn is there any update on when - --hostname-override=$(NODE_NAME) will be included as default in kube-proxy managed add on?

ettiee added a commit to cookpad/terraform-aws-eks that referenced this issue Dec 21, 2021
The EKS addon for kube-proxy introduced regressions of #124 and #209.
We will apply the recommended overrides from aws/containers-roadmap#657 and aws/amazon-vpc-cni-k8s#1702 in the manifests until the EKS addon applies these by default or allows you to override config in the addon.
ettiee added a commit to cookpad/terraform-aws-eks that referenced this issue Dec 22, 2021
The EKS addon for kube-proxy introduced regressions of #124 and #209.
We will apply the recommended overrides from aws/containers-roadmap#657 and aws/amazon-vpc-cni-k8s#1702 in the manifests until the EKS addon applies these by default or allows you to override config in the addon.
@Baptistee-B
Copy link

@jayanthvn Any news on that ? Some topics we can vote for to push the fix ? :)

@jayanthvn
Copy link
Contributor

1.22 default kube-proxy manifest will have this change. Release calendar can be found here - https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants