CNI takes longer on bigger nodes #1956

mmclane · 2022-04-07T13:55:05Z

What happened:

I have an EKS cluster running with the latest CNI plugin (v.1.10.2) and I am using the latest version of Karpenter to scale nodes out dynamically. That works great but I am seeing some slowness when a new nodes is added that I can't explain and I am trying to understand and if possible improve. The following are my observations. I have found these times to be fairly consistent over multiple tests.

Core cluster nodes support 6 pods of our application running. When I added a new pod to our cluster and it gets scheduled on an existing node, the pods are ready and available in ~40s.

Adding two pods so that karpenter creates a new node creates a t3.medium.
- Without linkerd-proxy
  - node is ready in ~ 80-90s
  - pods sit at CreatingContainer for 3m, waiting on the CNI
  - pods ready and available at ~ 3m45s
- With linkerd-proxy
  - node is ready in ~ 80-90s
  - init container executes almost immediately.
  - pods sit at podInitializing for ~3m10 (When proxy is ready)
  - pods fully ready in ~ 4m
Adding eight new pods creates a t3.2xlarge
- With linkerd-proxy
  - node is ready in ~ 80-90s
  - aws-node pod takes 30s to start
  - init continers executes ~2m20s
  - pods sit at podInitializing for ~ 6m40s (When proxy is ready)
  - pods fully ready in ~ 7m30s

During this time the pods report that they are waiting on the CNI to be available but that only takes 30s after the node is ready. It also doesn't explain why it takes so much longer for a larger node then a smaller one. I suspect that this might be related: #1943 but I don't know when that will make it into a release.

Attach logs

{"level":"info","ts":"2022-04-07T13:48:29.853Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-04-07T13:48:29.856Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-04-07T13:48:29.871Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-04-07T13:48:29.873Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I0407 13:48:30.965474 12 request.go:621] Throttling request took 1.043511864s, request: GET:https://172.20.0.1:443/apis/ui.cattle.io/v1?timeout=32s
{"level":"info","ts":"2022-04-07T13:48:31.885Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:33.894Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:35.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:35.931Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-04-07T13:48:35.935Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-04-07T13:48:35.936Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

I tried to ssh into the pod and run the script but it wasn't present.

What you expected to happen:

I expect large nodes to be available quicker and more consistent with smaller nodes.

How to reproduce it (as minimally and precisely as possible):

Use karpenter to create a small node (t3.small) and node the times, then have it create a large node (t3.xlarge) and node the difference in how long it takes for your pods to be available.

Anything else we need to know?:

Environment: Amazon EKS

Kubernetes version (use kubectl version): 1.21
CNI Version: v.1.10.2
OS (e.g: cat /etc/os-release): linux
Kernel (e.g. uname -a): 5.4.181-99.354.amzn2.x86_64

The text was updated successfully, but these errors were encountered:

jayanthvn · 2022-04-12T21:07:26Z

We will repro locally and check why the delay is happening.

mmclane · 2022-04-13T19:56:05Z

Please let me know if there is anything I can do to help.

mmclane · 2022-04-19T17:42:56Z

This doesn't seem to have anything to do with the CNI.

github-actions · 2022-04-19T17:43:20Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

mmclane added the bug label Apr 7, 2022

mmclane mentioned this issue Apr 7, 2022

Nodes come up slower then expected aws/karpenter-provider-aws#1639

Closed

mmclane closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNI takes longer on bigger nodes #1956

CNI takes longer on bigger nodes #1956

mmclane commented Apr 7, 2022 •

edited

Loading

jayanthvn commented Apr 12, 2022

mmclane commented Apr 13, 2022

mmclane commented Apr 19, 2022

github-actions bot commented Apr 19, 2022

CNI takes longer on bigger nodes #1956

CNI takes longer on bigger nodes #1956

Comments

mmclane commented Apr 7, 2022 • edited Loading

jayanthvn commented Apr 12, 2022

mmclane commented Apr 13, 2022

mmclane commented Apr 19, 2022

github-actions bot commented Apr 19, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

mmclane commented Apr 7, 2022 •

edited

Loading