optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip #2138

bwagner5 · 2022-11-11T07:57:32Z

What type of PR is this?

feature

Which issue does this PR fix:
awslabs/amazon-eks-ami#1099

What does this PR do / Why do we need it:

When a node is starting, kube-proxy and the VPC CNI DaemonSets get scheduled and started on the node at roughly the same time. This causes a race in the VPC CNI since it currently depends on kube-proxy to finish setting up cluster-ip routes on the node. If the VPC CNI hits the k8s server check first, then there is a 5 sec delay since that is the current timeout value set. This delays the node from getting to a Ready state.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

Installed with and without setting env.CLUSTER_ENDPOINT:

helm upgrade --install aws-node-ce -n kube-system charts/aws-vpc-cni/ --set image.override=xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/vpc-cni:ce --set init.image.override=xxxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/vpc-cni:init-ce --set env.CLUSTER_ENDPOINT=https://xxxxx.xxx.us-west-2.eks.amazonaws.com

This log shows a race where kube-proxy wasn't finished when aws-node reached out to the api-server through the cluster ip, resulting in a delay in initialization:

|     Event                     | Timestamp | t  |
|-------------------------------|-----------|----|
| kube-proxy Started            | 07:30:54  | 0  |
| AWS Node Container Starts     | 07:30:55  | 1  |
| VPC CNI Init Container Starts | 07:30:55  | 1  |
| VPC CNI Plugin Initialized    | 07:31:05  | 10 |

{"level":"info","ts":"2022-11-11T07:30:56.231Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-11-11T07:30:56.235Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2022-11-11T07:30:56.321Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-11-11T07:30:56.329Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"error","ts":"2022-11-11T07:31:01.329Z","caller":"wait/wait.go:211","msg":"Unable to reach API Server, Get \"https://10.100.0.1:443/version?timeout=5s\": context deadline exceeded"}
{"level":"info","ts":"2022-11-11T07:31:03.351Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.15-eks-fb459a0. git tree state: clean. commit: be82fa628e60d024275efaa239bfe53a9119c2d9. platform: linux/amd64"}

This PR adds an option CLUSTER_ENDPOINT environment variable that can be passed which allows aws-node to initialize without waiting on kube-proxy. Kube-proxy is still required before aws-node finishes initializing the CNI plugin.

This log shows the components starting at the same time where kube-proxy is not completed, but aws-node can continue bootstrapping with the CLUSTER_ENDPOINT:

|     Event                     | Timestamp | t  |
|-------------------------------|-----------|----|
| AWS Node Container Starts     | 07:40:39  | 0  |
| kube-proxy Started            | 07:40:39  | 0  |
| VPC CNI Init Container Starts | 07:40:39  | 0  |
| VPC CNI Plugin Initialized    | 07:40:41  | 2  |

{"level":"info","ts":"2022-11-11T07:40:39.878Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-11-11T07:40:39.881Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2022-11-11T07:40:39.974Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-11-11T07:40:39.982Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2022-11-11T07:40:40.014Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.15-eks-fb459a0. git tree state: clean. commit: be82fa628e60d024275efaa239bfe53a9119c2d9. platform: linux/amd64"}

...

{"level":"info","ts":"2022-11-11T07:40:41.407Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.15-eks-fb459a0. git tree state: clean. commit: be82fa628e60d024275e
faa239bfe53a9119c2d9. platform: linux/amd64"}

The first api-server connectivity check uses the CLUSTER_ENDPOINT passed in and the second one uses the cluster ip to make sure that kube-proxy is finished. The second check blocks the CNI from initializing.

Automation added to e2e:

Will this PR introduce any new dependencies?:

NO

Will this break upgrades or downgrades. Has updating a running cluster been tested?:

Have not tested upgrading a cluster, but this change only affects the endpoint being used.

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

CLUSTER_ENDPOINT can now be specified to allow the VPC CNI to initialize before kube-proxy has finished setting up cluster IP routes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…es cluster ip

achevuru · 2022-11-11T17:17:21Z

@ bwagner5 We were exploring on removing the API Server connectivity check dependency from CNI altogether. We just do it once during boot up and it is not the CNIs job to setup and/or validate the worker nodes to CPI connectivity anyways...So, we can just remove that check altogether.

bwagner5 · 2022-11-11T17:23:20Z

@ bwagner5 We were exploring on removing the API Server connectivity check dependency from CNI altogether. We just do it once during boot up and it is not the CNIs job to setup and/or validate the worker nodes to CPI connectivity anyways...So, we can just remove that check altogether.

That would be nice, although it doesn't solve the latency problem since something will try to connect to the apiserver within the VPC CNI and then likely timeout if kube-proxy isn't done yet. AFAIK nothing else guarantees that kube-proxy is done setting up routes though right? So pods could schedule and not have connectivity which could result in weird crashing behavior or higher latency pod start times like the VPC CNI runs into.

achevuru · 2022-11-11T18:03:07Z

VPC CNI only checks for API Server connectivity during the initial start up phase and so once we remove it we shouldn't run in to any additional init latency due to this race condition. CNI itself runs a controller when it is operating in custom networking mode but at that point it is like any other controller running on worker nodes.

As far as other application pods running in to issues (because kube-proxy for some reason takes long time to setup Kubernetes service endpoints), those pods should retry just like CNI does currently and I guess I'm trying to say CNI isn't responsible for validating API Server connectivity. Should be a kube-proxy issue. For example, if API server connectivity breaks post CNI startup, these pods will run in to the same issue. Currently, we only check it once during init and don't do periodic checks.

Also, the pods will always rely on the service endpoints to reach to API Server and if we're bypassing that in CNI by providing cluster endpoint to improve the start up time of VPC CNI, then we're hiding the kube proxy problem from VPC CNI but other application pods will now run in to the same issue while trying to connect to API Server as you outlined above.

bwagner5 · 2022-11-11T19:27:09Z

VPC CNI only checks for API Server connectivity during the initial start up phase and so once we remove it we shouldn't run in to any additional init latency due to this race condition.

Even when removing the API Server connectivity check, I believe the K8s cached client would still hang on synchronization blocking the rest of the initialization flow:

amazon-vpc-cni-k8s/pkg/k8sapi/k8sutils.go

Line 77 in f8bc3b8

cache.WaitForCacheSync(stopChan)

I agree that it's no worse than what happens today with the connectivity check, but it still adds unneeded latency when initializing the node.

achevuru · 2022-11-14T17:53:52Z

Yeah, I can see the problem there. But, I do remember someone from NW team tested without the check for API Server connectivity to make sure the init goes fine when Kubernetes service endpoints are missing. I'll try to check what was tested...

But, the kube-proxy startup latency issue should've been mitigated to a large extent from K8S v1.22 onwards. We identified one of the reasons contributing to this behavior and that was addressed from v1.22+. I'll share the internal tickets with you. Beyond that, only other scenario where we saw the race condition emerge is when the cluster has a huge number of services (100+) which can result kube-proxy taking up additional time during startup.

jayanthvn · 2022-11-16T00:31:51Z

README.md

@@ -410,6 +410,17 @@ Specifies the cluster name to tag allocated ENIs with. See the "Cluster Name tag

 ---

+#### `CLUSTER_ENDPOINT`


Nit: Lets mention the release version, CLUSTER_ENDPOINT (v1.12.1+)

jayanthvn

just a minor nit, lgtm.

* create publisher with logger (#2119) * Add missing rules when NodePort support is disabled (#2026) * Add missing rules when NodePort support is disabled * the rules that need to be installed for NodePort support and SNAT support are very similar. The same traffic mark is needed for both. As a result, rules that are currently installed only when NodePort support is enabled should also be installed when external SNAT is disabled, which is the case by default. * remove "-m state --state NEW" from a rule in the nat table. This is always true for packets that traverse the nat table. * fix typo in one rule's name (extra whitespace). Fixes #2025 Co-authored-by: Quan Tian <qtian@vmware.com> Signed-off-by: Antonin Bas <abas@vmware.com> * Fix typos and unit tests Signed-off-by: Antonin Bas <abas@vmware.com> * Minor improvement to code comment Signed-off-by: Antonin Bas <abas@vmware.com> * Address review comments * Delete legacy nat rule * Fix an unrelated log message Signed-off-by: Antonin Bas <abas@vmware.com> Signed-off-by: Antonin Bas <abas@vmware.com> Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com> * downgrade test go.mod to align with root go.mod (#2128) * skip addon installation when addon info is not available (#2131) * Merging test/Makefile and test/go.mod to the root Makefil and go.mod, adjust the .github/workflows and integration test instructions (#2129) * update troubleshooting docs for CNI image (#2132) fix location where make command is run * fix env name in test script (#2136) * optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip (#2138) * optionally allow CLUSTER_ENDPOINT to be used rather than the kubernetes cluster ip * remove check for kube-proxy * add version to readme * Add resources config option to cni metrics helper (#2141) * Add resources config option to cni metrics helper * Remove default-empty resources block; replace with conditional * Add metrics for ec2 api calls made by CNI and expose via prometheus (#2142) Co-authored-by: Jay Deokar <jsdeokar@amazon.com> * increase workflow role duration to 4 hours (#2148) * Update golang 1.19.2 EKS-D (#2147) * Update golang * Move to EKS distro builds * [HELM]: Move CRD resources to a separate folder as per helm standard (#2144) Co-authored-by: Jay Deokar <jsdeokar@amazon.com> * VPC-CNI minimal image builds (#2146) * VPC-CNI minimal image builds * update dependencies for ginkgo when running integration tests * address review comments and break up init main function * review comments for sysctl * Simplify binary installation, fix review comments Since init container is required to always run, let binary installation for external plugins happen in init container. This simplifies the main container entrypoint and the dockerfile for each image. * when IPAMD connection fails, try to teardown pod network using prevResult (#2145) * add env var to enable nftables (#2155) * fix failing weekly cron tests (#2154) * Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER and remove no-op setter (#2153) * Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER * update release version comments Signed-off-by: Antonin Bas <abas@vmware.com> Co-authored-by: Jeffrey Nelson <jdnelson@amazon.com> Co-authored-by: Antonin Bas <antonin.bas@gmail.com> Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com> Co-authored-by: Jerry He <37866862+jerryhe1999@users.noreply.github.com> Co-authored-by: Brandon Wagner <wagnerbm@amazon.com> Co-authored-by: Jonathan Ogilvie <679297+jcogilvie@users.noreply.github.com> Co-authored-by: Jay Deokar <jsdeokar@amazon.com>

optionally allow CLUSTER_ENDPOINT to be used rather than the kubernet…

7d52cd4

…es cluster ip

bwagner5 mentioned this pull request Nov 11, 2022

Improve Node Startup Latency awslabs/amazon-eks-ami#1099

Closed

remove check for kube-proxy

e74fb39

jdn5126 assigned achevuru Nov 15, 2022

bwagner5 marked this pull request as ready for review November 15, 2022 23:17

bwagner5 requested a review from a team as a code owner November 15, 2022 23:17

achevuru previously approved these changes Nov 16, 2022

View reviewed changes

jayanthvn reviewed Nov 16, 2022

View reviewed changes

add version to readme

9c8aa12

bwagner5 dismissed achevuru’s stale review via 9c8aa12 November 16, 2022 00:35

bwagner5 requested review from jayanthvn and achevuru and removed request for jayanthvn and achevuru November 16, 2022 00:36

jayanthvn approved these changes Nov 16, 2022

View reviewed changes

achevuru approved these changes Nov 16, 2022

View reviewed changes

jayanthvn merged commit 1aac6d0 into aws:master Nov 16, 2022

jayanthvn added this to the v1.12.1 milestone Nov 23, 2022

sidewinder12s mentioned this pull request Jan 14, 2023

Node take 100s to start aws/karpenter-provider-aws#3102

Closed

gbucknel mentioned this pull request Mar 8, 2023

Empty cache files causes "KillPodSandbox" errors when deleting pods containerd/containerd#8197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip #2138

optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip #2138

bwagner5 commented Nov 11, 2022 •

edited

Loading

achevuru commented Nov 11, 2022

bwagner5 commented Nov 11, 2022

achevuru commented Nov 11, 2022

bwagner5 commented Nov 11, 2022 •

edited

Loading

achevuru commented Nov 14, 2022

jayanthvn Nov 16, 2022

bwagner5 Nov 16, 2022

jayanthvn left a comment

		@@ -410,6 +410,17 @@ Specifies the cluster name to tag allocated ENIs with. See the "Cluster Name tag

		---

		#### `CLUSTER_ENDPOINT`

optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip #2138

optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip #2138

Conversation

bwagner5 commented Nov 11, 2022 • edited Loading

achevuru commented Nov 11, 2022

bwagner5 commented Nov 11, 2022

achevuru commented Nov 11, 2022

bwagner5 commented Nov 11, 2022 • edited Loading

achevuru commented Nov 14, 2022

jayanthvn Nov 16, 2022

Choose a reason for hiding this comment

bwagner5 Nov 16, 2022

Choose a reason for hiding this comment

jayanthvn left a comment

Choose a reason for hiding this comment

bwagner5 commented Nov 11, 2022 •

edited

Loading

bwagner5 commented Nov 11, 2022 •

edited

Loading