Timeout and reconcile when checking API server connectivity #1943

prateekgogia · 2022-03-30T15:53:19Z

What type of PR is this?
Bug

Which issue does this PR fix:
Some of the instance take ~2 minutes to become ready while other can connect in <30 seconds.

What does this PR do / Why do we need it:
In the current implementation, request to the API server hangs if kube-proxy starts after aws-node pod, it doesn't time out. Container gets restarted after ~90-120 seconds and client reconnects again to the API server to check connectivity.
As part of this change kube client times out in 1 second if it fails to connect and will try to reconnect after 1 second. If its never able to connect to the API server container will be restarted like it does today.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

I have tested this change on my cluster by adding ~20 nodes with this fix and without this fix.

The following graph shows the time a given node stays in not ready state.

Test with CNI image - amazon-k8s-cni:v1.10.2, shows some of the instances were not ready for ~2 minutes in the API server

Test image with this fix, shows all the nodes were ready in <30 seconds.

Automation added to e2e:

No

Will this PR introduce any new dependencies?:

No

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:

No
Does this PR introduce any user-facing change?:

No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

M00nF1sh · 2022-03-31T21:03:32Z

pkg/k8sapi/k8sutils.go

-
-	log.Infof("Testing communication with server")
-	version, err := clientSet.Discovery().ServerVersion()
+	restCfg.Timeout = 1 * time.Second


can we change this to be 5 second to be safe?
with some slow response cluster, 1 second might be too aggressive.

change to 5 second shouldn't impact the effect of this optimization much.

How about 2 second timeout? because if a API server is taking more than 2 seconds to respond to version query its too slow and we want to know why and fix the issue.

apiserver will easily take more than 2 seconds when scaling 10,000 pods, 600 nodes

Version call shouldn't be impacted by the load in the server.
If it's happening we will root cause, try and fix it, feel free to cut a support ticket and we can help debug.

ok, didn't read the actual code, just shouting random comments 👍

M00nF1sh

/lgtm

prateekgogia requested a review from a team as a code owner March 30, 2022 15:53

prateekgogia force-pushed the add_timeout branch 2 times, most recently from 391cf1e to df8707a Compare March 31, 2022 15:58

jayanthvn requested a review from M00nF1sh March 31, 2022 21:00

M00nF1sh reviewed Mar 31, 2022

View reviewed changes

Timeout and reconcile when checking API server connectivity

e4adb2a

prateekgogia force-pushed the add_timeout branch from df8707a to e4adb2a Compare April 1, 2022 20:20

M00nF1sh approved these changes Apr 1, 2022

View reviewed changes

prateekgogia merged commit ec3c9fc into aws:master Apr 1, 2022

prateekgogia deleted the add_timeout branch April 1, 2022 21:44

This was referenced Apr 7, 2022

CNI takes longer on bigger nodes #1956

Closed

Nodes come up slower then expected aws/karpenter-provider-aws#1639

Closed

jayanthvn added this to the v1.11.0 milestone Apr 12, 2022

jayanthvn mentioned this pull request Apr 12, 2022

aws-node amazon-k8s-cni:v1.10.2-eksbuild.1 restarts always on start #1930

Closed

jayanthvn mentioned this pull request Oct 3, 2022

Remove kube-proxy dependency on startup #2100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout and reconcile when checking API server connectivity #1943

Timeout and reconcile when checking API server connectivity #1943

prateekgogia commented Mar 30, 2022 •

edited

Loading

M00nF1sh Mar 31, 2022 •

edited

Loading

prateekgogia Mar 31, 2022

matti Apr 13, 2022

prateekgogia Apr 13, 2022

matti Apr 14, 2022

M00nF1sh left a comment

Timeout and reconcile when checking API server connectivity #1943

Timeout and reconcile when checking API server connectivity #1943

Conversation

prateekgogia commented Mar 30, 2022 • edited Loading

M00nF1sh Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

prateekgogia Mar 31, 2022

Choose a reason for hiding this comment

matti Apr 13, 2022

Choose a reason for hiding this comment

prateekgogia Apr 13, 2022

Choose a reason for hiding this comment

matti Apr 14, 2022

Choose a reason for hiding this comment

M00nF1sh left a comment

Choose a reason for hiding this comment

prateekgogia commented Mar 30, 2022 •

edited

Loading

M00nF1sh Mar 31, 2022 •

edited

Loading