Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache node images #11202

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Payback159
Copy link
Contributor

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test

/kind feature

/kind flake

What this PR does / why we need it:

Reason and discussion can be found in #11092

Which issue(s) this PR fixes:

Fixes #11092

Special notes for your reviewer:

I wanted to publish the first work on the issue. The original idea of caching it in the facts was discussed for good reasons. I am now trying to pull the "node_cache" out of the download loop.

My idea would be to manipulate the downloads list during runtime before processing it in the loop.
First tests look promising but it is not finalized yet, I hope to get early feedback due to the open development! :-)

Does this PR introduce a user-facing change?:

improve caching of node image information to avoid unnecessary multiple accesses to the container runtime.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 16, 2024
@k8s-ci-robot k8s-ci-robot requested review from cyclinder and mzaian May 16, 2024 19:47
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 16, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Payback159. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 16, 2024
- name: Download | Check if the container image is already present
set_fact:
downloads: "{{ downloads | combine({item.key: item.value | combine({'enabled': false})}) }}"
loop: "{{ downloads | dict2items }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid adding yet another loop.
I agree that "patching" downloads at runtime is a good idea, but using filter in the include_tasks loop below would be better.
If the ansible filters are not up to the tasks, use json_query, it's quite powerful (but can be headache inducing if abused 😆 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I didn't want to jump on the loop at first (I was a little put off by the loop 😄).
I have moved the logic to the loop task. It's not a json_query but it should still save us the extra loop.

What do you think?

@Payback159 Payback159 force-pushed the cache-node-images branch from af95c03 to dec395c Compare May 17, 2024 17:55
@Payback159
Copy link
Contributor Author

Payback159 commented May 17, 2024

I have moved the logic to the include_tasks logic.

The first run executes 10 download_container includes (cluster.yml without "image cache" with already provisioned 1 control-plane, 1 node). With a rerun on the same setup with the cache there are then only 3 download_container includes.

I have to take a closer look at these 3 includes, as they were also reloaded again and again with the previous approach (extra loop), I would actually expect 0 includes from download_container after adding the "image cache".

Another open point is the support to pull via sha256, I have to check in advance whether sha256 is defined in the item, but I wanted to concentrate on the feasibility first.

Once I have the sha256 logic, I'll start reviewing the download_container and sub-tasks to see if some variables and rules can be tidied up.

@Payback159 Payback159 force-pushed the cache-node-images branch from dec395c to a81965b Compare May 17, 2024 20:14
@Payback159
Copy link
Contributor Author

I think I have it. The initial cluster.yml downloads 10 containers, the 2nd run downloads 0 containers, because all images are in the cache.

If download_always_pull: true the node_image is registered but the filter logic is canceled by the "or" link and all images are pulled again.

First run

TASK [download : Show node images] *********************************************
ok: [kubespray-control-plane-1] => {
    "msg": ""
}
ok: [kubespray-worker-1] => {
    "msg": ""
--
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'calico_node', 'value': {'enabled': True, 'container': True, 'repo': 'quay.io/calico/node', 'tag': 'v3.27.3', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'calico_cni', 'value': {'enabled': True, 'container': True, 'repo': 'quay.io/calico/cni', 'tag': 'v3.27.3', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'calico_flexvol', 'value': {'enabled': True, 'container': True, 'repo': 'quay.io/calico/pod2daemon-flexvol', 'tag': 'v3.27.3', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'calico_policy', 'value': {'enabled': True, 'container': True, 'repo': 'quay.io/calico/kube-controllers', 'tag': 'v3.27.3', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_file.yml for kubespray-control-plane-1 => (item={'key': 'calico_crds', 'value': {'file': True, 'enabled': True, 'version': 'v3.27.3', 'dest': '/tmp/releases/calico-v3.27.3-kdd-crds/v3.27.3.tar.gz', 'sha256': 'd11a32919bff389f642af5df8180ad3cec586030decd35adb2a7d4a8aa3b298e', 'url': 'https://github.com/projectcalico/calico/archive/v3.27.3.tar.gz', 'unarchive': True, 'unarchive_extra_opts': ['--strip=3', '--wildcards', '*/libcalico-go/config/crd/'], 'owner': 'root', 'mode': '0755', 'groups': ['kube_control_plane']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'pod_infra', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/pause', 'tag': '3.9', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'coredns', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/coredns/coredns', 'tag': 'v1.11.1', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'nodelocaldns', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/dns/k8s-dns-node-cache', 'tag': '1.22.28', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1 => (item={'key': 'dnsautoscaler', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/cpa/cluster-proportional-autoscaler', 'tag': 'v1.8.8', 'sha256': '', 'groups': ['kube_control_plane']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1 => (item={'key': 'metrics_server', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/metrics-server/metrics-server', 'tag': 'v0.7.0', 'sha256': '', 'groups': ['kube_control_plane']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-worker-1 => (item={'key': 'nginx', 'value': {'enabled': True, 'container': True, 'repo': 'docker.io/library/nginx', 'tag': '1.25.2-alpine', 'sha256': '', 'groups': ['kube_node']}})
Friday 17 May 2024  21:08:44 +0200 (0:00:00.782)       0:02:30.673 ************ 

second run

TASK [download : Show node images] *********************************************
ok: [kubespray-control-plane-1] => {
    "msg": "quay.io/calico/cni:v3.27.3,quay.io/calico/cni:<none>,quay.io/calico/kube-controllers:v3.27.3,quay.io/calico/node:v3.27.3,quay.io/calico/node:<none>,quay.io/calico/pod2daemon-flexvol:v3.27.3,quay.io/calico/pod2daemon-flexvol:<none>,registry.k8s.io/coredns/coredns:v1.11.1,registry.k8s.io/coredns/coredns:<none>,registry.k8s.io/cpa/cluster-proportional-autoscaler:v1.8.8,registry.k8s.io/cpa/cluster-proportional-autoscaler:<none>,registry.k8s.io/dns/k8s-dns-node-cache:1.22.28,registry.k8s.io/dns/k8s-dns-node-cache:<none>,registry.k8s.io/kube-apiserver:v1.29.3,registry.k8s.io/kube-apiserver:<none>,registry.k8s.io/kube-controller-manager:v1.29.3,registry.k8s.io/kube-controller-manager:<none>,registry.k8s.io/kube-proxy:v1.29.3,registry.k8s.io/kube-proxy:<none>,registry.k8s.io/kube-scheduler:v1.29.3,registry.k8s.io/kube-scheduler:<none>,registry.k8s.io/metrics-server/metrics-server:v0.7.0,registry.k8s.io/pause:3.9,registry.k8s.io/pause:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,"
}
ok: [kubespray-worker-1] => {
    "msg": "nginx:1.25.2-alpine,nginx:<none>,quay.io/calico/cni:v3.27.3,quay.io/calico/cni:<none>,quay.io/calico/kube-controllers:v3.27.3,quay.io/calico/kube-controllers:<none>,quay.io/calico/node:v3.27.3,quay.io/calico/node:<none>,quay.io/calico/pod2daemon-flexvol:v3.27.3,quay.io/calico/pod2daemon-flexvol:<none>,registry.k8s.io/coredns/coredns:v1.11.1,registry.k8s.io/coredns/coredns:<none>,registry.k8s.io/dns/k8s-dns-node-cache:1.22.28,registry.k8s.io/dns/k8s-dns-node-cache:<none>,registry.k8s.io/kube-proxy:v1.29.3,registry.k8s.io/kube-proxy:<none>,registry.k8s.io/metrics-server/metrics-server:v0.7.0,registry.k8s.io/metrics-server/metrics-server:<none>,registry.k8s.io/pause:3.9,registry.k8s.io/pause:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,"

#no download_container includes

third run with download_always_pull: true

TASK [download : Show node images] *********************************************
ok: [kubespray-control-plane-1] => {
    "msg": "quay.io/calico/cni:v3.27.3,quay.io/calico/cni:<none>,quay.io/calico/node:v3.27.3,quay.io/calico/node:<none>,quay.io/calico/pod2daemon-flexvol:v3.27.3,quay.io/calico/pod2daemon-flexvol:<none>,registry.k8s.io/coredns/coredns:v1.11.1,registry.k8s.io/coredns/coredns:<none>,registry.k8s.io/cpa/cluster-proportional-autoscaler:v1.8.8,registry.k8s.io/cpa/cluster-proportional-autoscaler:<none>,registry.k8s.io/dns/k8s-dns-node-cache:1.22.28,registry.k8s.io/dns/k8s-dns-node-cache:<none>,registry.k8s.io/kube-apiserver:v1.29.3,registry.k8s.io/kube-apiserver:<none>,registry.k8s.io/kube-controller-manager:v1.29.3,registry.k8s.io/kube-controller-manager:<none>,registry.k8s.io/kube-proxy:v1.29.3,registry.k8s.io/kube-proxy:<none>,registry.k8s.io/kube-scheduler:v1.29.3,registry.k8s.io/kube-scheduler:<none>,registry.k8s.io/pause:3.9,registry.k8s.io/pause:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,"
}
ok: [kubespray-worker-1] => {
    "msg": "nginx:1.25.2-alpine,nginx:<none>,quay.io/calico/cni:v3.27.3,quay.io/calico/cni:<none>,quay.io/calico/kube-controllers:v3.27.3,quay.io/calico/kube-controllers:<none>,quay.io/calico/node:v3.27.3,quay.io/calico/node:<none>,quay.io/calico/pod2daemon-flexvol:v3.27.3,quay.io/calico/pod2daemon-flexvol:<none>,registry.k8s.io/coredns/coredns:v1.11.1,registry.k8s.io/coredns/coredns:<none>,registry.k8s.io/dns/k8s-dns-node-cache:1.22.28,registry.k8s.io/dns/k8s-dns-node-cache:<none>,registry.k8s.io/kube-proxy:v1.29.3,registry.k8s.io/kube-proxy:<none>,registry.k8s.io/metrics-server/metrics-server:v0.7.0,registry.k8s.io/metrics-server/metrics-server:<none>,registry.k8s.io/pause:3.9,registry.k8s.io/pause:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,<none>:<none>,"
--
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'calico_node', 'value': {'enabled': True, 'container': True, 'repo': 'quay.io/calico/node', 'tag': 'v3.27.3', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'calico_cni', 'value': {'enabled': True, 'container': True, 'repo': 'quay.io/calico/cni', 'tag': 'v3.27.3', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'calico_flexvol', 'value': {'enabled': True, 'container': True, 'repo': 'quay.io/calico/pod2daemon-flexvol', 'tag': 'v3.27.3', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'calico_policy', 'value': {'enabled': True, 'container': True, 'repo': 'quay.io/calico/kube-controllers', 'tag': 'v3.27.3', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_file.yml for kubespray-control-plane-1 => (item={'key': 'calico_crds', 'value': {'file': True, 'enabled': True, 'version': 'v3.27.3', 'dest': '/tmp/releases/calico-v3.27.3-kdd-crds/v3.27.3.tar.gz', 'sha256': 'd11a32919bff389f642af5df8180ad3cec586030decd35adb2a7d4a8aa3b298e', 'url': 'https://github.com/projectcalico/calico/archive/v3.27.3.tar.gz', 'unarchive': True, 'unarchive_extra_opts': ['--strip=3', '--wildcards', '*/libcalico-go/config/crd/'], 'owner': 'root', 'mode': '0755', 'groups': ['kube_control_plane']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'pod_infra', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/pause', 'tag': '3.9', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'coredns', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/coredns/coredns', 'tag': 'v1.11.1', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1, kubespray-worker-1 => (item={'key': 'nodelocaldns', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/dns/k8s-dns-node-cache', 'tag': '1.22.28', 'sha256': '', 'groups': ['k8s_cluster']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1 => (item={'key': 'dnsautoscaler', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/cpa/cluster-proportional-autoscaler', 'tag': 'v1.8.8', 'sha256': '', 'groups': ['kube_control_plane']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-control-plane-1 => (item={'key': 'metrics_server', 'value': {'enabled': True, 'container': True, 'repo': 'registry.k8s.io/metrics-server/metrics-server', 'tag': 'v0.7.0', 'sha256': '', 'groups': ['kube_control_plane']}})
included: /Users/jelinek/git/kubespray/roles/download/tasks/download_container.yml for kubespray-worker-1 => (item={'key': 'nginx', 'value': {'enabled': True, 'container': True, 'repo': 'docker.io/library/nginx', 'tag': '1.25.2-alpine', 'sha256': '', 'groups': ['kube_node']}})
Friday 17 May 2024  21:54:12 +0200 (0:00:00.803)       0:01:09.291 ************ 

@Payback159 Payback159 marked this pull request as ready for review May 17, 2024 20:23
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 17, 2024
@yankay
Copy link
Member

yankay commented May 20, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 20, 2024
@VannTen
Copy link
Contributor

VannTen commented May 21, 2024

I'm gonna take a look, but not right now, I'm a bit busy

@Payback159
Copy link
Contributor Author

Hello @VannTen,

have you had time to look at the PR? 😇

Copy link
Contributor

@VannTen VannTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look after a pretty long time, sorry for the delay.

One remark and some suggestions.

I have worked on that role before for a similar stuff, (it's currently reverted because it broke some other stuff thanks to the way the download role), https://github.com/kubernetes-sigs/kubespray/pull/11105/files maybe that can provide inspiration for the loop preparation stuff

@@ -26,5 +38,9 @@
- not skip_downloads | default(false)
- download.enabled
- item.value.enabled
- (not (item.value.container | default(false))) or (item.value.container and download_container)
- (not (item.value.container | default(false))) or (item.value.container and download_container and (not (((item.value.repo | regex_replace('^docker\.io/(library/)?', '')) ~ ':' ~ item.value.tag in node_images.stdout.split(',')) or (item.value.repo ~ '@sha256:' ~ (item.value.sha256 | default(None)) in node_images.stdout.split(','))) or download_always_pull))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of handling the filtering at the task/condition, level, we should rather do it a the loop level, which means preparing the loop to exclude whichever images are already present (the hard part is figuring how to merge the downloads var with the variable computed from listing images on each host, all without breaking localhost pull and all that...)

The reason is that a skipped task is still a task, and Ansible has quite a bit of overhead for each task, so not having them in the first place is better.

( I also tend to think that it would be more readable, but this is pretty subjective)

Of course, if a condition is substantially simpler than doing this in the loop, the condition is still acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to move the when-condition into the loop, but at the moment I lack the creativity to map this logic in the loop.

Perhaps you have a few more hints for me. Otherwise I have to admit that I can only manage it with the condition.

# The image_info_command depends on the Container Runtime and will output something like the following:
# nginx:1.15,gcr.io/google-containers/kube-proxy:v1.14.1,gcr.io/google-containers/kube-proxy@sha256:44af2833c6cbd9a7fc2e9d2f5244a39dfd2e31ad91bf9d4b7d810678db738ee9,gcr.io/google-containers/kube-apiserver:v1.14.1,etc...
- name: Download | Generate a list of information about the images on a node # noqa command-instead-of-shell - image_info_command contains a pipe, therefore requiring shell
shell: "{{ image_info_command }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image_info_command is only used in the download role, so we can do whatever we want with it to make this easier.
First, I suggest moving the definitions (main var and the crictl/nerdctl/docker ones) from kubespray-default to either download/defaults, download/vars, or even the vars section of that task, since it's only used here (I think).
Second, we can check if we can tweaks the command to have a more useful format from the start, without needing sheel tricks (the docker_image_info_command in particular is not pretty.)
If all the command support json output, that might be the best way to have structured data easily without reparsing it ourselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now moved the commands into download/vars and changed all commands to json output. Docker and nerdctl offer the option of creating your own json structure via gotemplate so that I don't have to worry about different JSON structures later on.

On the one hand, I am a fan of the output of nerdctl and Docker, on the other hand, piping the crictl command into jq makes the whole thing “unclean” again in my eyes. How do you see it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to pipe to jq, you can register the json and then use from_json and other native ansible filters to get the data you need.
For instance, see #11507 which refactor a pipeline into ansible filters.
This also has the advantage that you can define intermediate vars at the task level if that helps with readability (or reuse, if you need to split then zip or that sort of thing).

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2024
@k8s-ci-robot k8s-ci-robot added do-not-merge/contains-merge-commits do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. labels Sep 24, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Payback159
Once this PR has been reviewed and has the lgtm label, please assign cristicalin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 24, 2024
Copy link

linux-foundation-easycla bot commented Sep 24, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@@ -26,5 +39,9 @@
- not skip_downloads | default(false)
- download.enabled
- item.value.enabled
- (not (item.value.container | default(false))) or (item.value.container and download_container)
- (not (item.value.container | default(false))) or (item.value.container and download_container and (not ((node_images.stdout_lines | map('from_json') | selectattr('Repository', 'equalto', (item.value.repo | regex_replace('^docker\.io/(library/)?', ''))) | selectattr('Tag', 'equalto', item.value.tag) | list | length > 0) or (node_images.stdout_lines | map('from_json') | selectattr('Digest', 'equalto', 'sha256:' ~ (item.value.sha256 | default(None))) | list | length > 0)) or download_always_pull))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a more thorough look later, but that can probably be split over several lines for readability.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 24, 2024
@Payback159
Copy link
Contributor Author

Perhaps a few dates as well. I have now set download_run_once to true and always test it against a control-plane and a node. The VMs run locally on my computer and I run ansible directly on my system.

  • Host-System
    • Macbook Pro 2020
    • CPU: M1
    • Memory: 16GB
  • ControlPlane Node
    • CPU: 4 vCPUs
    • Memory: 4 GB
    • OS: Ubuntu 24.04
  • K8s-Node
    • CPU: 4 vCPUs
    • Memory: 4 GB
    • OS: Ubuntu 24.04

There are no additional addons enabled, I use the default configurations. I mention it because, roughly speaking, every additional add-on adds at least one container.

I have the nodes provisioned once initially and then I do a run-through with image caching. This took 3 minutes and 7 seconds.

Then I orchestrated the same nodes with download_always_pull and that took 3 minutes and 43 seconds.

This is far from an empirical survey but I believe it can “generously” optimize orchestration time in large existing clusters.

@Payback159
Copy link
Contributor Author

I would like to complete this PR cleanly, but in the course of my work I have also seen a lot about file downloads. I could imagine extending this logic to file downloads in another PR to further optimize the orchestration.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 25, 2024
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 25, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 25, 2024
@Payback159
Copy link
Contributor Author

/retest

@Payback159
Copy link
Contributor Author

Payback159 commented Oct 11, 2024

An update on what has been done since the last review.

I have adjusted all image_info_commands so that they either already output the desired json format (nerdctl, docker) or at least output a json format (crictl). As a result, we can use a command-task instead of shell-task to create the raw node image list. The processing of the image list has thus also "migrated to ansible" and is no longer done via awk or other shell commands.

Since currently crictl can provide json output but no special formatting, I currently have 2 logic sections regarding setting the node_image facts. I add these via include_tasks as defining the task directly in main.yml resulted in "Download | Set node_images (crictl)" always being executed (no matter what image_info_command was) as with_items results in the when-conditions being applied per item. This caused the task to always fail with nerdctl or docker because no valid json for the with_items validation could take place.

I hope that in the future crictl might support a custom json output format, then we could merge the tasks.

I have now run through all 3 variants (nerdctl, docker, crictl) again locally. Both with download_always_pull: true and download_always_pull: false. Everything went smoothly, let's see what CI has to say.

@Payback159 Payback159 requested a review from VannTen October 12, 2024 12:40
@Payback159
Copy link
Contributor Author

Thanks to @tico88612, I was able to discover another logic error that caused the playbook run to break with dedicated etcd nodes. CI is now happy too :D. @VannTen may I ask you again for a review? :)

@VannTen
Copy link
Contributor

VannTen commented Oct 12, 2024 via email

Copy link
Contributor

@VannTen VannTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comment mostly on the filtering vs condition.

- name: Download | Set node_images
include_tasks: "{{ include_file }}"
vars:
include_file: "set_node_facts{% if image_command_tool == 'crictl' %}_crictl{% endif %}.yml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this include gave us much, and it does break reading flow.
Can you instead put the task here and distinguish using when ?

# {"Digest":"sha256:34fc87c4a60c0b3ba3b3608871f4494de8072c02808a4151953068f2d7c87743","Repository":"flannel/flannel","Tag":"latest"}
- name: Download | Set node_images (nerdctl, docker)
set_fact:
node_images: "{{ node_images_raw.stdout_lines | map('from_json') | list }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a variable instead of a fact for this.

(node_images | default([])) +
[{'Digest': (item.repoDigests[0] | split('@'))[1], 'Repository': ((item.repoTags[0] | split(':'))[0] | regex_replace('^docker\.io/(library/)?', '')), 'Tag': (item.repoTags[0] | split(':'))[1]}]
}}
with_items: "{{ (node_images_raw.stdout | from_json).images }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid looping over all images using some json_query and filters from community.general:

     node_images: "{{ ids | zip(repos, _tags) | map('zip', ['Digest', 'Repository', 'Tag'])
            | map('map', 'reverse') | map('community.general.dict') }}"
    vars:
      queried: "{{ node_images_raw.stdout | from_json | json_query(query) }}"
      query: "images[*].{hash: @.id, rep: @.repoTags[0]}"
      reps:  "{{ queried | map(attribute='rep') | map('split', ':') }}"
      ids: "{{ queried | map(attribute='hash') }}"
      _tags: "{{ reps | map(attribute=1) }}"
      repos: "{{ reps | map(attribute=0) }}"

(in order to avoid the loop overhead)

@@ -26,5 +59,29 @@
- not skip_downloads | default(false)
- download.enabled
- item.value.enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this in two tasks (download_container / download_file) to simplify the loops and condition, see below

Comment on lines +62 to +70
- >
(
not (
item.value.container | default(false)
)
)
or (
item.value.container and
download_container and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get rid of those conditions by filtering the list before feeding it to the loop instead:

vars:
    base_dls: "{{ downloads | combine(kubeadm_images) | dict2items }}"
    containers_dls: "{{ base_dls | selectattr('value.container', 'defined') | selectattr('value.container') }}"

# In case of nerdctl or docker, the output is already in the expected format
# {"Digest":"sha256:847423221ed040798e47df36450edce5581ed642cd087ccb210532764da38b23","Repository":"quay.io/coreos/etcd","Tag":"v3.5.12"}
# {"Digest":"sha256:34fc87c4a60c0b3ba3b3608871f4494de8072c02808a4151953068f2d7c87743","Repository":"flannel/flannel","Tag":"latest"}
- name: Download | Set node_images (nerdctl, docker)
Copy link
Contributor

@VannTen VannTen Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, we only use download.repo/sha256/tag in one place :

$ grep -R 'download\.tag' roles/download/
roles/download/tasks/set_container_facts.yml:      {%- if pull_by_digest %}{{ download.repo }}@sha256:{{ download.sha256 }}{%- else -%}{{ download.repo }}:{{ download.tag }}{%- endif -%}
roles/download/tasks/check_pull_required.yml:    that: "{{ download.repo }}:{{ download.tag }} in docker_images.stdout.split(',')"

Instead of separating those, maybe we could instead have {'by_repo_digest': ..., by_repo_tag : 'flannel/flannel:latest' }

In particular, with this filtering becomes easiers, because you can use contains on a fixed set (obtening from nodes images) while mapping)
(something like containers_dls | selectattr('sha256', 'contains', (node_images | map(attribute='repo_by_digest')) (wrong syntax I think can't remember the proper order on top of my head)

@VannTen
Copy link
Contributor

VannTen commented Dec 10, 2024

Did you had a change to take a look at the review ?

@Payback159
Copy link
Contributor Author

I have seen the reviews and have already started to implement them. Unfortunately there has been too much going on recently, I hope that I'll have more time during the coming holidays to push the adapted version.

@VannTen
Copy link
Contributor

VannTen commented Dec 13, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache Node Images List in facts
4 participants