Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving startup time #1315

Closed
mkulke opened this issue Aug 8, 2023 · 9 comments · Fixed by #1360
Closed

Improving startup time #1315

mkulke opened this issue Aug 8, 2023 · 9 comments · Fixed by #1360
Assignees

Comments

@mkulke
Copy link
Collaborator

mkulke commented Aug 8, 2023

This is a boot chart from systemd-analyze on ubuntu 22.04 in azure. It shows that we spend ~15s on initialization tasks, before the images are being pulled.

A relatively low hanging fruit would probably be to remove the dependency on cloud-init. We provision files via the cloud-config protocol to the podvm's at the moment. We could replace that with some bespoke logic in the unit files directly and agent-protocol-forwarder would be able to start earlier.

systemd

@bpradipt
Copy link
Member

bpradipt commented Aug 8, 2023

Something like a systemd unit reading the user-data over network and putting the info in the right place ?

@mkulke
Copy link
Collaborator Author

mkulke commented Aug 8, 2023

Something like a systemd unit reading the user-data over network and putting the info in the right place ?

yup

@bpradipt
Copy link
Member

bpradipt commented Aug 9, 2023

I'll give it a try.

@bpradipt bpradipt self-assigned this Aug 9, 2023
@bpradipt
Copy link
Member

With userdata, it takes approx 1 sec from VM creation to agent proxy being ready.

2023/08/24 14:39:54 [adaptor/cloud/azure] created VM successfully: /subscriptions/89a9ba4b-8b66-446a-813d-a9ed3a129e3d/resourceGroups/aks-rg/providers/Microsoft.Compute/virtualMachines/podvm-test-d0453224
2023/08/24 14:39:54 [adaptor/cloud/azure] podNodeIP[0]=10.224.0.5
2023/08/24 14:39:54 [adaptor/cloud] failed to create PeerPod: the server could not find the requested resource (post peerPods.confidentialcontainers.org)
2023/08/24 14:39:54 [adaptor/cloud] created an instance podvm-test-d0453224 for sandbox d04532242afc04b033c13594a93acd1808a2e4f62bdbdd446fe86dbba54bc161
2023/08/24 14:39:54 [tunneler/vxlan] vxlan ppvxlan1 (remote 10.224.0.5:8472, id: 555009) created at /proc/1/task/12/ns/net
2023/08/24 14:39:54 [tunneler/vxlan] vxlan ppvxlan1 created at /proc/1/task/12/ns/net
2023/08/24 14:39:54 [tunneler/vxlan] vxlan ppvxlan1 is moved to /var/run/netns/cni-60e4bf8f-fb9c-0bc7-d3f1-73a325c0608b
2023/08/24 14:39:54 [tunneler/vxlan] Add tc redirect filters between eth0 and vxlan1 on pod network namespace /var/run/netns/cni-60e4bf8f-fb9c-0bc7-d3f1-73a325c0608b
2023/08/24 14:39:54 [adaptor/proxy] Listening on /run/peerpod/pods/d04532242afc04b033c13594a93acd1808a2e4f62bdbdd446fe86dbba54bc161/agent.ttrpc
2023/08/24 14:39:54 [adaptor/proxy] failed to init cri client, the err: cri runtime endpoint is not specified, it is used to get the image name from image digest
2023/08/24 14:39:54 [adaptor/proxy] Trying to establish agent proxy connection to 10.224.0.5:15150
2023/08/24 14:39:55 [adaptor/proxy] established agent proxy connection to 10.224.0.5:15150
2023/08/24 14:39:55 [adaptor/cloud] agent proxy is ready

Without userdata (existing mechanism), it takes approx 7 sec from VM creation to agent proxy being ready.

2023/08/24 14:56:46 [adaptor/cloud/azure] created VM successfully: /subscriptions/89a9ba4b-8b66-446a-813d-a9ed3a129e3d/resourceGroups/aks-rg/providers/Microsoft.Compute/virtualMachines/podvm-test-28aadd09
2023/08/24 14:56:46 [adaptor/cloud/azure] podNodeIP[0]=10.224.0.5
2023/08/24 14:56:46 [adaptor/cloud] failed to create PeerPod: the server could not find the requested resource (post peerPods.confidentialcontainers.org)
2023/08/24 14:56:46 [adaptor/cloud] created an instance podvm-test-28aadd09 for sandbox 28aadd091c517d840dbcea1491dedf637a33d29306026a8a1cf01b021307b65f
2023/08/24 14:56:46 [tunneler/vxlan] vxlan ppvxlan1 (remote 10.224.0.5:8472, id: 555000) created at /proc/1/task/14/ns/net
2023/08/24 14:56:46 [tunneler/vxlan] vxlan ppvxlan1 created at /proc/1/task/14/ns/net
2023/08/24 14:56:46 [tunneler/vxlan] vxlan ppvxlan1 is moved to /var/run/netns/cni-f62cb9c2-7fba-4dd3-1d11-eb7a5871b093
2023/08/24 14:56:46 [tunneler/vxlan] Add tc redirect filters between eth0 and vxlan1 on pod network namespace /var/run/netns/cni-f62cb9c2-7fba-4dd3-1d11-eb7a5871b093
2023/08/24 14:56:46 [adaptor/proxy] Listening on /run/peerpod/pods/28aadd091c517d840dbcea1491dedf637a33d29306026a8a1cf01b021307b65f/agent.ttrpc
2023/08/24 14:56:46 [adaptor/proxy] failed to init cri client, the err: cri runtime endpoint is not specified, it is used to get the image name from image digest
2023/08/24 14:56:46 [adaptor/proxy] Trying to establish agent proxy connection to 10.224.0.5:15150
2023/08/24 14:56:53 [adaptor/proxy] established agent proxy connection to 10.224.0.5:15150
2023/08/24 14:56:53 [adaptor/cloud] agent proxy is ready

@bpradipt
Copy link
Member

@mkulke the above data is only for one iteration. I didn't try with different instance types.
I'll raise a PR (most likely) tomorrow after some cleanups.

@mkulke
Copy link
Collaborator Author

mkulke commented Sep 7, 2023

Some numbers that might help to identify further opportunities for optimization:

VM launched (+0s):

2023/09/07 18:45:50 [adaptor/cloud/azure] CreateInstance: name: "podvm-busybox-caa-5f99bd66c7-tcd92-ed00a45a"

/stat being created on podvm (+32s):

stat -c %z /proc/
2023-09-07 18:46:22.609353200 +0000

waagent starts on podvm (+39s):

head -1 /var/log/waagent.log
2023-09-07T18:46:29.390118Z INFO Daemon Azure Linux Agent Version:2.2.46

VM reported as ready (+46s):

2023/09/07 18:46:36 [adaptor/cloud/azure] created VM successfully: /subscriptions/REDACTED/resourceGroups/mgns/providers/Microsoft.Compute/virtualMachines/podvm-busybox-caa-5f99bd66c7-tcd92-ed00a45a

Sandbox is created (+47s):

2023/09/07 18:46:37 [adaptor/proxy] CreateSandbox: hostname:busybox-caa-5f99bd66c7-tcd92 sandboxId:ed00a45a13d810c283cc9ab6f9e629644705782382837a60c5914fc9db05ec11

So atm we get into container business after ~45s, the first signs of life from the VM are at ~30s. I suspect that if we report the VM as ok earlier (replacing waagent and setting goal-state manually) we might get closer to the 30s mark.

@mkulke mkulke reopened this Sep 11, 2023
@mkulke
Copy link
Collaborator Author

mkulke commented Sep 11, 2023

There is some improvement, but we can possibly do more, hence re-opening

@mkulke
Copy link
Collaborator Author

mkulke commented Oct 9, 2023

At least on Azure there is a problem that the Create-VM API call will only return once the VM has registered itself as ready. We might want to reconsider this approach.

Today an Azure SNP CVM will be provisioned in about 30s until entering the kernel. This is something that we have to account for unless we look at optimizations beyond the PodVM Image.

An average boot process to multi-user.target we can consider to take around 15s. In the PodVM image, a VM can report itself as ready only relatively late in the boot process, since it needs networking to be up.

So for kata-agent to start working we will need to wait 45s really.

A possible optimization in CAA could be to not wait for the Create-VM call to return, i.e. we want to only know whether the VM creation has been successfully triggered, we're not interested necessarily in the "Readyness" of the VM. We will know this from a healthy POD or otherwise. So we could start facilitating connections even if the VM is not ready.

We need to measure things however before implementing. I suspect that user-data is populated asynchronous, and we'll have to wait for user-data to be available anyway which exceeds 45s anyway.

@mkulke
Copy link
Collaborator Author

mkulke commented Dec 11, 2024

From measurements it looks like we optimized what we could so far #1674 removed some 10s

@mkulke mkulke closed this as completed Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants