-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout and reconcile when checking API server connectivity #1943
Conversation
391cf1e
to
df8707a
Compare
pkg/k8sapi/k8sutils.go
Outdated
|
||
log.Infof("Testing communication with server") | ||
version, err := clientSet.Discovery().ServerVersion() | ||
restCfg.Timeout = 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change this to be 5 second to be safe?
with some slow response cluster, 1 second might be too aggressive.
change to 5 second shouldn't impact the effect of this optimization much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about 2 second timeout? because if a API server is taking more than 2 seconds to respond to version query its too slow and we want to know why and fix the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apiserver will easily take more than 2 seconds when scaling 10,000 pods, 600 nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version call shouldn't be impacted by the load in the server.
If it's happening we will root cause, try and fix it, feel free to cut a support ticket and we can help debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, didn't read the actual code, just shouting random comments 👍
df8707a
to
e4adb2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
What type of PR is this?
Bug
Which issue does this PR fix:
Some of the instance take ~2 minutes to become ready while other can connect in <30 seconds.
What does this PR do / Why do we need it:
In the current implementation, request to the API server hangs if kube-proxy starts after aws-node pod, it doesn't time out. Container gets restarted after ~90-120 seconds and client reconnects again to the API server to check connectivity.
As part of this change kube client times out in 1 second if it fails to connect and will try to reconnect after 1 second. If its never able to connect to the API server container will be restarted like it does today.
If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
Testing done on this change:
I have tested this change on my cluster by adding ~20 nodes with this fix and without this fix.
The following graph shows the time a given node stays in not ready state.
Test with CNI image -
amazon-k8s-cni:v1.10.2
, shows some of the instances were not ready for ~2 minutes in the API serverTest image with this fix, shows all the nodes were ready in <30 seconds.
Automation added to e2e:
No
Will this PR introduce any new dependencies?:
No
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
No
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.