You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an EKS cluster running with the latest CNI plugin (v.1.10.2) and I am using the latest version of Karpenter to scale nodes out dynamically. That works great but I am seeing some slowness when a new nodes is added that I can't explain and I am trying to understand and if possible improve. The following are my observations. I have found these times to be fairly consistent over multiple tests.
Core cluster nodes support 6 pods of our application running. When I added a new pod to our cluster and it gets scheduled on an existing node, the pods are ready and available in ~40s.
Adding two pods so that karpenter creates a new node creates a t3.medium.
Without linkerd-proxy
node is ready in ~ 80-90s
pods sit at CreatingContainer for 3m, waiting on the CNI
pods ready and available at ~ 3m45s
With linkerd-proxy
node is ready in ~ 80-90s
init container executes almost immediately.
pods sit at podInitializing for ~3m10 (When proxy is ready)
pods fully ready in ~ 4m
Adding eight new pods creates a t3.2xlarge
With linkerd-proxy
node is ready in ~ 80-90s
aws-node pod takes 30s to start
init continers executes ~2m20s
pods sit at podInitializing for ~ 6m40s (When proxy is ready)
pods fully ready in ~ 7m30s
During this time the pods report that they are waiting on the CNI to be available but that only takes 30s after the node is ready. It also doesn't explain why it takes so much longer for a larger node then a smaller one. I suspect that this might be related: #1943 but I don't know when that will make it into a release.
Attach logs
{"level":"info","ts":"2022-04-07T13:48:29.853Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-04-07T13:48:29.856Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-04-07T13:48:29.871Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-04-07T13:48:29.873Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I0407 13:48:30.965474 12 request.go:621] Throttling request took 1.043511864s, request: GET:https://172.20.0.1:443/apis/ui.cattle.io/v1?timeout=32s
{"level":"info","ts":"2022-04-07T13:48:31.885Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:33.894Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:35.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:35.931Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-04-07T13:48:35.935Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-04-07T13:48:35.936Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}
I tried to ssh into the pod and run the script but it wasn't present.
What you expected to happen:
I expect large nodes to be available quicker and more consistent with smaller nodes.
How to reproduce it (as minimally and precisely as possible):
Use karpenter to create a small node (t3.small) and node the times, then have it create a large node (t3.xlarge) and node the difference in how long it takes for your pods to be available.
Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.
What happened:
I have an EKS cluster running with the latest CNI plugin (v.1.10.2) and I am using the latest version of Karpenter to scale nodes out dynamically. That works great but I am seeing some slowness when a new nodes is added that I can't explain and I am trying to understand and if possible improve. The following are my observations. I have found these times to be fairly consistent over multiple tests.
Core cluster nodes support 6 pods of our application running. When I added a new pod to our cluster and it gets scheduled on an existing node, the pods are ready and available in ~40s.
During this time the pods report that they are waiting on the CNI to be available but that only takes 30s after the node is ready. It also doesn't explain why it takes so much longer for a larger node then a smaller one. I suspect that this might be related: #1943 but I don't know when that will make it into a release.
Attach logs
{"level":"info","ts":"2022-04-07T13:48:29.853Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-04-07T13:48:29.856Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-04-07T13:48:29.871Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-04-07T13:48:29.873Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I0407 13:48:30.965474 12 request.go:621] Throttling request took 1.043511864s, request: GET:https://172.20.0.1:443/apis/ui.cattle.io/v1?timeout=32s
{"level":"info","ts":"2022-04-07T13:48:31.885Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:33.894Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:35.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-04-07T13:48:35.931Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-04-07T13:48:35.935Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-04-07T13:48:35.936Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}
I tried to ssh into the pod and run the script but it wasn't present.
What you expected to happen:
I expect large nodes to be available quicker and more consistent with smaller nodes.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment: Amazon EKS
kubectl version
): 1.21cat /etc/os-release
): linuxuname -a
): 5.4.181-99.354.amzn2.x86_64The text was updated successfully, but these errors were encountered: