Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout and reconcile when checking API server connectivity #1943

Merged
merged 1 commit into from
Apr 1, 2022

Conversation

prateekgogia
Copy link
Contributor

@prateekgogia prateekgogia commented Mar 30, 2022

What type of PR is this?
Bug

Which issue does this PR fix:
Some of the instance take ~2 minutes to become ready while other can connect in <30 seconds.

What does this PR do / Why do we need it:
In the current implementation, request to the API server hangs if kube-proxy starts after aws-node pod, it doesn't time out. Container gets restarted after ~90-120 seconds and client reconnects again to the API server to check connectivity.
As part of this change kube client times out in 1 second if it fails to connect and will try to reconnect after 1 second. If its never able to connect to the API server container will be restarted like it does today.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

I have tested this change on my cluster by adding ~20 nodes with this fix and without this fix.

The following graph shows the time a given node stays in not ready state.

Test with CNI image - amazon-k8s-cni:v1.10.2, shows some of the instances were not ready for ~2 minutes in the API server
Screen Shot 2022-03-30 at 10 39 15 AM

Test image with this fix, shows all the nodes were ready in <30 seconds.
Screen Shot 2022-03-30 at 10 40 06 AM

Automation added to e2e:

No

Will this PR introduce any new dependencies?:

No

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:

No
Does this PR introduce any user-facing change?:

No


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@prateekgogia prateekgogia requested a review from a team as a code owner March 30, 2022 15:53
@prateekgogia prateekgogia force-pushed the add_timeout branch 2 times, most recently from 391cf1e to df8707a Compare March 31, 2022 15:58
@jayanthvn jayanthvn requested a review from M00nF1sh March 31, 2022 21:00

log.Infof("Testing communication with server")
version, err := clientSet.Discovery().ServerVersion()
restCfg.Timeout = 1 * time.Second
Copy link
Contributor

@M00nF1sh M00nF1sh Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change this to be 5 second to be safe?
with some slow response cluster, 1 second might be too aggressive.

change to 5 second shouldn't impact the effect of this optimization much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 2 second timeout? because if a API server is taking more than 2 seconds to respond to version query its too slow and we want to know why and fix the issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apiserver will easily take more than 2 seconds when scaling 10,000 pods, 600 nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version call shouldn't be impacted by the load in the server.
If it's happening we will root cause, try and fix it, feel free to cut a support ticket and we can help debug.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, didn't read the actual code, just shouting random comments 👍

Copy link
Contributor

@M00nF1sh M00nF1sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants