Reduce startup latency by removing some unneeded sleeps #2104

bwagner5 · 2022-10-06T22:50:26Z

What type of PR is this?

cleanup

Which issue does this PR fix:
N/A

What does this PR do / Why do we need it:
This PR reduces the latency of starting up the VPC CNI by 2-4 secs which results in a K8s node getting into the Ready state faster.

When checking if kube-apiserver is accessible, the wait.PollInfinite(2*time.Second, func()...) func was used, which sleeps before any action and then executes the func. Since kube-apiserver is generally accessible when the CNI starts, this sleep was delaying a successful, quick request. I switched the wait func to use wait.PollImmediateInfinite instead which first executes the func and then sleeps if unsuccessful.

There was also a sleep when checking if the ipamd daemon had started, but this sleep was unnecessary since there's already a 1 sec connection timeout in the grpc-health-probe.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Before removing the sleeps:

## aws-node logs:

2022-10-06T22:39:55.171648812Z stdout F {"level":"info","ts":"2022-10-06T22:39:55.171Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
2022-10-06T22:39:55.172662433Z stdout F {"level":"info","ts":"2022-10-06T22:39:55.172Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
2022-10-06T22:39:55.183080631Z stdout F {"level":"info","ts":"2022-10-06T22:39:55.182Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
2022-10-06T22:39:55.18412313Z stdout F {"level":"info","ts":"2022-10-06T22:39:55.183Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
2022-10-06T22:39:57.190165337Z stdout F {"level":"info","ts":"2022-10-06T22:39:57.189Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2022-10-06T22:39:59.196139932Z stdout F {"level":"info","ts":"2022-10-06T22:39:59.195Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2022-10-06T22:39:59.21347371Z stdout F {"level":"info","ts":"2022-10-06T22:39:59.213Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
2022-10-06T22:39:59.216869945Z stdout F {"level":"info","ts":"2022-10-06T22:39:59.216Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
2022-10-06T22:39:59.217669859Z stdout F {"level":"info","ts":"2022-10-06T22:39:59.217Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

## ipamd logs:

{"level":"info","ts":"2022-10-06T22:39:55.194Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-10-06T22:39:55.195Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2022-10-06T22:39:55.203Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-10-06T22:39:55.203Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2022-10-06T22:39:57.220Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.13-eks-15b7512. git tree state: clean. commit: 94138dfbea757d7aaf3b205419578ef186dd5efb. platform: linux/amd64"}

After removing the sleeps:

## aws-node logs:

2022-10-06T22:33:09.610064979Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.609Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
2022-10-06T22:33:09.61105197Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.610Z","caller":"entrypoint.sh","msg":"Copying CNI plugin binaries ... "}
2022-10-06T22:33:09.623162424Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.622Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
2022-10-06T22:33:09.633499311Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.633Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
2022-10-06T22:33:09.63456762Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.634Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
2022-10-06T22:33:10.639838542Z stdout F {"level":"info","ts":"2022-10-06T22:33:10.639Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2022-10-06T22:33:11.644939741Z stdout F {"level":"info","ts":"2022-10-06T22:33:11.644Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2022-10-06T22:33:11.660966276Z stdout F {"level":"info","ts":"2022-10-06T22:33:11.660Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
2022-10-06T22:33:11.664213944Z stdout F {"level":"info","ts":"2022-10-06T22:33:11.664Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
2022-10-06T22:33:11.665077664Z stdout F {"level":"info","ts":"2022-10-06T22:33:11.664Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

## ipamd logs

{"level":"info","ts":"2022-10-06T22:33:09.645Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-10-06T22:33:09.645Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2022-10-06T22:33:09.653Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-10-06T22:33:09.654Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2022-10-06T22:33:09.665Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.13-eks-15b7512. git tree state: clean. commit: 94138dfbea757d7aaf3b205419578ef186dd5efb. platform: linux/amd64"}

Graph with latency reduction:

Testing done on this change:

See above

Automation added to e2e:

N/A

Will this PR introduce any new dependencies?:

Nope

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
N/A

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

No

Improve the startup time of the vpc-cni

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jdn5126

LGTM, thanks for the explanation and cleanup

jdn5126 · 2022-10-06T22:54:42Z

Looks like make check-format is failing against change

jayanthvn · 2022-10-06T23:17:48Z

It will be because of go19 version in the workflows.

bwagner5 · 2022-10-06T23:24:14Z

➜  amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) make check-format
➜  amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) echo $?
0
➜  amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) go version
go version go1.19 darwin/amd64

achevuru

Intention was to add an additional 1 sec between retries for the grpc probe. Ideally, retry should only happen if there is a delay of some sort during bootup and we've seen issues under load especially.

bwagner5 · 2022-10-07T00:05:30Z

Intention was to add an additional 1 sec between retries for the grpc probe. Ideally, retry should only happen if there is a delay of some sort during bootup and we've seen issues under load especially.

Not opposed to keeping the 1 sec sleep on the grpc probe but would like to understand what it's protecting. I'm not sure how the sleep would help under load since it's only executed on startup and the service isn't accessible yet. There is an implicit one sec pause when the connection is attempted which seems sufficient.

achevuru · 2022-10-07T07:42:25Z

Intention was to add an additional 1 sec between retries for the grpc probe. Ideally, retry should only happen if there is a delay of some sort during bootup and we've seen issues under load especially.

Not opposed to keeping the 1 sec sleep on the grpc probe but would like to understand what it's protecting. I'm not sure how the sleep would help under load since it's only executed on startup and the service isn't accessible yet. There is an implicit one sec pause when the connection is attempted which seems sufficient.

During a node bootup (or) restart, CNI is one of the initial pods that come up and so load won't affect it but a restart on a running node that is under resource stress can contribute to delays and one of the reasons we had to increase liveness/readiness (exec) probe timeouts to 10s. Now, this is a forever loop waiting on a particular condition during startup and this sleep I guess helps make the probe less aggressive with the downside of an additional startup latency of up to 1s. I mean 1s is the worst case scenario and CNI initialization can take a while anyways (we set initialDelaySeconds to 60s for probes) so maybe this is alright.

jayanthvn · 2022-10-11T21:31:18Z

➜ amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) make check-format
➜ amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) echo $?
0
➜ amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) go version
go version go1.19 darwin/amd64

Yeah, I tried patching your change and running make format. Seems like workflow issue.

jayanthvn · 2022-10-11T21:50:25Z

We should be good to merge, need to look into the unit test workflow.

remove sleeps

4ed1767

bwagner5 requested a review from a team as a code owner October 6, 2022 22:50

jdn5126 self-assigned this Oct 6, 2022

jdn5126 approved these changes Oct 6, 2022

View reviewed changes

achevuru self-requested a review October 6, 2022 23:21

achevuru reviewed Oct 6, 2022

View reviewed changes

achevuru approved these changes Oct 11, 2022

View reviewed changes

jayanthvn merged commit b4f1777 into aws:master Oct 11, 2022

This was referenced Oct 28, 2022

🥳 aws-vpc-cni v1.12.0 Automated Release! 🥑 aws/eks-charts#835

Closed

🥳 aws-vpc-cni v1.12.0 Automated Release! 🥑 aws/eks-charts#836

Merged

This was referenced Nov 10, 2022

cache pause, vpc-cni, and kube-proxy images in the AMI awslabs/amazon-eks-ami#938

Merged

Improve Node Startup Latency awslabs/amazon-eks-ami#1099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce startup latency by removing some unneeded sleeps #2104

Reduce startup latency by removing some unneeded sleeps #2104

bwagner5 commented Oct 6, 2022 •

edited

Loading

jdn5126 left a comment

jdn5126 commented Oct 6, 2022

jayanthvn commented Oct 6, 2022 •

edited

Loading

bwagner5 commented Oct 6, 2022

achevuru left a comment

bwagner5 commented Oct 7, 2022

achevuru commented Oct 7, 2022

jayanthvn commented Oct 11, 2022

jayanthvn commented Oct 11, 2022

Reduce startup latency by removing some unneeded sleeps #2104

Reduce startup latency by removing some unneeded sleeps #2104

Conversation

bwagner5 commented Oct 6, 2022 • edited Loading

Graph with latency reduction:

jdn5126 left a comment

Choose a reason for hiding this comment

jdn5126 commented Oct 6, 2022

jayanthvn commented Oct 6, 2022 • edited Loading

bwagner5 commented Oct 6, 2022

achevuru left a comment

Choose a reason for hiding this comment

bwagner5 commented Oct 7, 2022

achevuru commented Oct 7, 2022

jayanthvn commented Oct 11, 2022

jayanthvn commented Oct 11, 2022

bwagner5 commented Oct 6, 2022 •

edited

Loading

jayanthvn commented Oct 6, 2022 •

edited

Loading