Improving startup time #1315

mkulke · 2023-08-08T11:32:02Z

This is a boot chart from systemd-analyze on ubuntu 22.04 in azure. It shows that we spend ~15s on initialization tasks, before the images are being pulled.

A relatively low hanging fruit would probably be to remove the dependency on cloud-init. We provision files via the cloud-config protocol to the podvm's at the moment. We could replace that with some bespoke logic in the unit files directly and agent-protocol-forwarder would be able to start earlier.

The text was updated successfully, but these errors were encountered:

bpradipt · 2023-08-08T15:04:09Z

Something like a systemd unit reading the user-data over network and putting the info in the right place ?

mkulke · 2023-08-08T15:24:36Z

Something like a systemd unit reading the user-data over network and putting the info in the right place ?

yup

bpradipt · 2023-08-09T03:50:29Z

I'll give it a try.

bpradipt · 2023-08-24T15:00:34Z

With userdata, it takes approx 1 sec from VM creation to agent proxy being ready.

2023/08/24 14:39:54 [adaptor/cloud/azure] created VM successfully: /subscriptions/89a9ba4b-8b66-446a-813d-a9ed3a129e3d/resourceGroups/aks-rg/providers/Microsoft.Compute/virtualMachines/podvm-test-d0453224
2023/08/24 14:39:54 [adaptor/cloud/azure] podNodeIP[0]=10.224.0.5
2023/08/24 14:39:54 [adaptor/cloud] failed to create PeerPod: the server could not find the requested resource (post peerPods.confidentialcontainers.org)
2023/08/24 14:39:54 [adaptor/cloud] created an instance podvm-test-d0453224 for sandbox d04532242afc04b033c13594a93acd1808a2e4f62bdbdd446fe86dbba54bc161
2023/08/24 14:39:54 [tunneler/vxlan] vxlan ppvxlan1 (remote 10.224.0.5:8472, id: 555009) created at /proc/1/task/12/ns/net
2023/08/24 14:39:54 [tunneler/vxlan] vxlan ppvxlan1 created at /proc/1/task/12/ns/net
2023/08/24 14:39:54 [tunneler/vxlan] vxlan ppvxlan1 is moved to /var/run/netns/cni-60e4bf8f-fb9c-0bc7-d3f1-73a325c0608b
2023/08/24 14:39:54 [tunneler/vxlan] Add tc redirect filters between eth0 and vxlan1 on pod network namespace /var/run/netns/cni-60e4bf8f-fb9c-0bc7-d3f1-73a325c0608b
2023/08/24 14:39:54 [adaptor/proxy] Listening on /run/peerpod/pods/d04532242afc04b033c13594a93acd1808a2e4f62bdbdd446fe86dbba54bc161/agent.ttrpc
2023/08/24 14:39:54 [adaptor/proxy] failed to init cri client, the err: cri runtime endpoint is not specified, it is used to get the image name from image digest
2023/08/24 14:39:54 [adaptor/proxy] Trying to establish agent proxy connection to 10.224.0.5:15150
2023/08/24 14:39:55 [adaptor/proxy] established agent proxy connection to 10.224.0.5:15150
2023/08/24 14:39:55 [adaptor/cloud] agent proxy is ready

Without userdata (existing mechanism), it takes approx 7 sec from VM creation to agent proxy being ready.

2023/08/24 14:56:46 [adaptor/cloud/azure] created VM successfully: /subscriptions/89a9ba4b-8b66-446a-813d-a9ed3a129e3d/resourceGroups/aks-rg/providers/Microsoft.Compute/virtualMachines/podvm-test-28aadd09
2023/08/24 14:56:46 [adaptor/cloud/azure] podNodeIP[0]=10.224.0.5
2023/08/24 14:56:46 [adaptor/cloud] failed to create PeerPod: the server could not find the requested resource (post peerPods.confidentialcontainers.org)
2023/08/24 14:56:46 [adaptor/cloud] created an instance podvm-test-28aadd09 for sandbox 28aadd091c517d840dbcea1491dedf637a33d29306026a8a1cf01b021307b65f
2023/08/24 14:56:46 [tunneler/vxlan] vxlan ppvxlan1 (remote 10.224.0.5:8472, id: 555000) created at /proc/1/task/14/ns/net
2023/08/24 14:56:46 [tunneler/vxlan] vxlan ppvxlan1 created at /proc/1/task/14/ns/net
2023/08/24 14:56:46 [tunneler/vxlan] vxlan ppvxlan1 is moved to /var/run/netns/cni-f62cb9c2-7fba-4dd3-1d11-eb7a5871b093
2023/08/24 14:56:46 [tunneler/vxlan] Add tc redirect filters between eth0 and vxlan1 on pod network namespace /var/run/netns/cni-f62cb9c2-7fba-4dd3-1d11-eb7a5871b093
2023/08/24 14:56:46 [adaptor/proxy] Listening on /run/peerpod/pods/28aadd091c517d840dbcea1491dedf637a33d29306026a8a1cf01b021307b65f/agent.ttrpc
2023/08/24 14:56:46 [adaptor/proxy] failed to init cri client, the err: cri runtime endpoint is not specified, it is used to get the image name from image digest
2023/08/24 14:56:46 [adaptor/proxy] Trying to establish agent proxy connection to 10.224.0.5:15150
2023/08/24 14:56:53 [adaptor/proxy] established agent proxy connection to 10.224.0.5:15150
2023/08/24 14:56:53 [adaptor/cloud] agent proxy is ready

bpradipt · 2023-08-24T15:04:02Z

@mkulke the above data is only for one iteration. I didn't try with different instance types.
I'll raise a PR (most likely) tomorrow after some cleanups.

mkulke · 2023-09-07T19:15:59Z

Some numbers that might help to identify further opportunities for optimization:

VM launched (+0s):

2023/09/07 18:45:50 [adaptor/cloud/azure] CreateInstance: name: "podvm-busybox-caa-5f99bd66c7-tcd92-ed00a45a"

/stat being created on podvm (+32s):

stat -c %z /proc/
2023-09-07 18:46:22.609353200 +0000

waagent starts on podvm (+39s):

head -1 /var/log/waagent.log
2023-09-07T18:46:29.390118Z INFO Daemon Azure Linux Agent Version:2.2.46

VM reported as ready (+46s):

2023/09/07 18:46:36 [adaptor/cloud/azure] created VM successfully: /subscriptions/REDACTED/resourceGroups/mgns/providers/Microsoft.Compute/virtualMachines/podvm-busybox-caa-5f99bd66c7-tcd92-ed00a45a

Sandbox is created (+47s):

2023/09/07 18:46:37 [adaptor/proxy] CreateSandbox: hostname:busybox-caa-5f99bd66c7-tcd92 sandboxId:ed00a45a13d810c283cc9ab6f9e629644705782382837a60c5914fc9db05ec11

So atm we get into container business after ~45s, the first signs of life from the VM are at ~30s. I suspect that if we report the VM as ok earlier (replacing waagent and setting goal-state manually) we might get closer to the 30s mark.

mkulke · 2023-09-11T07:14:31Z

There is some improvement, but we can possibly do more, hence re-opening

mkulke · 2023-10-09T16:35:27Z

At least on Azure there is a problem that the Create-VM API call will only return once the VM has registered itself as ready. We might want to reconsider this approach.

Today an Azure SNP CVM will be provisioned in about 30s until entering the kernel. This is something that we have to account for unless we look at optimizations beyond the PodVM Image.

An average boot process to multi-user.target we can consider to take around 15s. In the PodVM image, a VM can report itself as ready only relatively late in the boot process, since it needs networking to be up.

So for kata-agent to start working we will need to wait 45s really.

A possible optimization in CAA could be to not wait for the Create-VM call to return, i.e. we want to only know whether the VM creation has been successfully triggered, we're not interested necessarily in the "Readyness" of the VM. We will know this from a healthy POD or otherwise. So we could start facilitating connections even if the VM is not ready.

We need to measure things however before implementing. I suspect that user-data is populated asynchronous, and we'll have to wait for user-data to be available anyway which exceeds 45s anyway.

mkulke · 2024-12-11T15:06:26Z

From measurements it looks like we optimized what we could so far #1674 removed some 10s

bpradipt mentioned this issue Aug 9, 2023

Investigate usage of pre-created instance pool for faster pod creation time #1317

Open

bpradipt self-assigned this Aug 9, 2023

bpradipt mentioned this issue Aug 25, 2023

Add support for userData in Azure #1360

Merged

bpradipt closed this as completed in #1360 Sep 10, 2023

mkulke reopened this Sep 11, 2023

mkulke closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving startup time #1315

Improving startup time #1315

mkulke commented Aug 8, 2023

bpradipt commented Aug 8, 2023

mkulke commented Aug 8, 2023

bpradipt commented Aug 9, 2023

bpradipt commented Aug 24, 2023

bpradipt commented Aug 24, 2023

mkulke commented Sep 7, 2023

mkulke commented Sep 11, 2023 •

edited

Loading

mkulke commented Oct 9, 2023

mkulke commented Dec 11, 2024

Improving startup time #1315

Improving startup time #1315

Comments

mkulke commented Aug 8, 2023

bpradipt commented Aug 8, 2023

mkulke commented Aug 8, 2023

bpradipt commented Aug 9, 2023

bpradipt commented Aug 24, 2023

bpradipt commented Aug 24, 2023

mkulke commented Sep 7, 2023

mkulke commented Sep 11, 2023 • edited Loading

mkulke commented Oct 9, 2023

mkulke commented Dec 11, 2024

mkulke commented Sep 11, 2023 •

edited

Loading