-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce startup latency by removing some unneeded sleeps #2104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the explanation and cleanup
Looks like |
It will be because of go19 version in the workflows. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intention was to add an additional 1 sec between retries for the grpc probe. Ideally, retry should only happen if there is a delay of some sort during bootup and we've seen issues under load especially.
Not opposed to keeping the 1 sec sleep on the grpc probe but would like to understand what it's protecting. I'm not sure how the sleep would help under load since it's only executed on startup and the service isn't accessible yet. There is an implicit one sec pause when the connection is attempted which seems sufficient. |
During a node bootup (or) restart, CNI is one of the initial pods that come up and so load won't affect it but a restart on a running node that is under resource stress can contribute to delays and one of the reasons we had to increase liveness/readiness (exec) probe timeouts to 10s. Now, this is a forever loop waiting on a particular condition during startup and this sleep I guess helps make the probe less aggressive with the downside of an additional startup latency of up to 1s. I mean |
Yeah, I tried patching your change and running make format. Seems like workflow issue. |
We should be good to merge, need to look into the unit test workflow. |
What type of PR is this?
cleanup
Which issue does this PR fix:
N/A
What does this PR do / Why do we need it:
This PR reduces the latency of starting up the VPC CNI by 2-4 secs which results in a K8s node getting into the Ready state faster.
When checking if kube-apiserver is accessible, the
wait.PollInfinite(2*time.Second, func()...)
func was used, which sleeps before any action and then executes the func. Since kube-apiserver is generally accessible when the CNI starts, this sleep was delaying a successful, quick request. I switched the wait func to usewait.PollImmediateInfinite
instead which first executes the func and then sleeps if unsuccessful.There was also a sleep when checking if the ipamd daemon had started, but this sleep was unnecessary since there's already a 1 sec connection timeout in the grpc-health-probe.
If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
Before removing the sleeps:
After removing the sleeps:
Graph with latency reduction:
Testing done on this change:
See above
Automation added to e2e:
N/A
Will this PR introduce any new dependencies?:
Nope
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
N/A
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
No
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.