Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI takes longer on bigger nodes #1956

Closed
mmclane opened this issue Apr 7, 2022 · 4 comments
Closed

CNI takes longer on bigger nodes #1956

mmclane opened this issue Apr 7, 2022 · 4 comments
Labels

Comments

@mmclane
Copy link

mmclane commented Apr 7, 2022

What happened:

I have an EKS cluster running with the latest CNI plugin (v.1.10.2) and I am using the latest version of Karpenter to scale nodes out dynamically. That works great but I am seeing some slowness when a new nodes is added that I can't explain and I am trying to understand and if possible improve. The following are my observations. I have found these times to be fairly consistent over multiple tests.

Core cluster nodes support 6 pods of our application running. When I added a new pod to our cluster and it gets scheduled on an existing node, the pods are ready and available in ~40s.

  • Adding two pods so that karpenter creates a new node creates a t3.medium.
    • Without linkerd-proxy
      • node is ready in ~ 80-90s
      • pods sit at CreatingContainer for 3m, waiting on the CNI
      • pods ready and available at ~ 3m45s
    • With linkerd-proxy
      • node is ready in ~ 80-90s
      • init container executes almost immediately.
      • pods sit at podInitializing for ~3m10 (When proxy is ready)
      • pods fully ready in ~ 4m
  • Adding eight new pods creates a t3.2xlarge
    • With linkerd-proxy
      • node is ready in ~ 80-90s
      • aws-node pod takes 30s to start
      • init continers executes ~2m20s
      • pods sit at podInitializing for ~ 6m40s (When proxy is ready)
      • pods fully ready in ~ 7m30s

During this time the pods report that they are waiting on the CNI to be available but that only takes 30s after the node is ready. It also doesn't explain why it takes so much longer for a larger node then a smaller one. I suspect that this might be related: #1943 but I don't know when that will make it into a release.

Attach logs

{"level":"info","ts":"2022-04-07T13:48:29.853Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-04-07T13:48:29.856Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-04-07T13:48:29.871Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-04-07T13:48:29.873Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I0407 13:48:30.965474 12 request.go:621] Throttling request took 1.043511864s, request: GET:https://172.20.0.1:443/apis/ui.cattle.io/v1?timeout=32s
{"level":"info","ts":"2022-04-07T13:48:31.885Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:33.894Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:35.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:35.931Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-04-07T13:48:35.935Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-04-07T13:48:35.936Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

I tried to ssh into the pod and run the script but it wasn't present.

What you expected to happen:

I expect large nodes to be available quicker and more consistent with smaller nodes.

How to reproduce it (as minimally and precisely as possible):

  • Use karpenter to create a small node (t3.small) and node the times, then have it create a large node (t3.xlarge) and node the difference in how long it takes for your pods to be available.

Anything else we need to know?:

Environment: Amazon EKS

  • Kubernetes version (use kubectl version): 1.21
  • CNI Version: v.1.10.2
  • OS (e.g: cat /etc/os-release): linux
  • Kernel (e.g. uname -a): 5.4.181-99.354.amzn2.x86_64
@jayanthvn
Copy link
Contributor

We will repro locally and check why the delay is happening.

@mmclane
Copy link
Author

mmclane commented Apr 13, 2022

Please let me know if there is anything I can do to help.

@mmclane
Copy link
Author

mmclane commented Apr 19, 2022

This doesn't seem to have anything to do with the CNI.

@mmclane mmclane closed this as completed Apr 19, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants