kubespray play spends a lot of time doing nothing #9279

rptaylor · 2022-09-15T21:22:45Z

Kubespray plays can take quite a long time, e.g. around 30-60 minutes even for a small cluster of 6 nodes, and many hours for large clusters. (Some timing measurements discussed in #8050)

A lot of the time goes by with ansible output like this:

PLAY [k8s_cluster] **********************************************************************************************************************************************************
Thursday 15 September 2022  20:46:03 +0000 (0:00:00.124)       0:00:06.836 **** 
Thursday 15 September 2022  20:46:06 +0000 (0:00:02.881)       0:00:09.718 **** 
Thursday 15 September 2022  20:46:09 +0000 (0:00:02.926)       0:00:12.645 **** 
Thursday 15 September 2022  20:46:12 +0000 (0:00:02.726)       0:00:15.371 **** 
Thursday 15 September 2022  20:46:15 +0000 (0:00:03.139)       0:00:18.511 **** 
Thursday 15 September 2022  20:46:18 +0000 (0:00:02.987)       0:00:21.498 **** 
Thursday 15 September 2022  20:46:18 +0000 (0:00:00.120)       0:00:21.619 **** 
Thursday 15 September 2022  20:46:18 +0000 (0:00:00.102)       0:00:21.721 **** 
Thursday 15 September 2022  20:46:21 +0000 (0:00:02.846)       0:00:24.568 **** 
Thursday 15 September 2022  20:46:24 +0000 (0:00:02.979)       0:00:27.547 **** 
Thursday 15 September 2022  20:46:27 +0000 (0:00:03.015)       0:00:30.562 **** 
Thursday 15 September 2022  20:46:30 +0000 (0:00:03.025)       0:00:33.588 **** 
Thursday 15 September 2022  20:46:30 +0000 (0:00:00.105)       0:00:33.693 **** 
Thursday 15 September 2022  20:46:33 +0000 (0:00:03.176)       0:00:36.870 **** 
Thursday 15 September 2022  20:46:36 +0000 (0:00:03.163)       0:00:40.034 **** 
Thursday 15 September 2022  20:46:39 +0000 (0:00:03.112)       0:00:43.147 **** 
Thursday 15 September 2022  20:46:42 +0000 (0:00:03.056)       0:00:46.203 **** 
Thursday 15 September 2022  20:46:45 +0000 (0:00:02.953)       0:00:49.157 ****

not visibly achieving anything, and the ansible process(es) are usually mostly CPU bound, although doing strace shows a lot of repetitive IOPS operations as well, stat-ing files etc.

I believe it is because of https://github.com/kubernetes-sigs/kubespray/blob/master/roles/kubespray-defaults/meta/main.yml

The kubespray-defaults role is invoked no fewer than 12 times in the cluster.yml play, and each time it runs the downloads role as a dependency. The downloads role has several import_tasks, and it also does import_role of container-engine/crictl and container-engine/nerdctl (until d01b181 anyway). Every task in the downloads role has when: not skip_downloads|default(false) and the meta/main.yml has skip_downloads: true , but because of how import_ works with conditionals , this means that ansible processes all the imports , going through all the loops and tasks on all the nodes and basically executing them with when: false. So it really is doing nothing, but wasting a lot of time on it.

Update: the default variables from the kubespray-defaults role (and download role) are needed, but that takes practically no time. There needs to be a way to get the vars without wasting time on tasks , maybe either by switching to include_role instead of import_role, or otherwise maybe refactor the download default vars into the kubespray-defaults default vars.

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2022-12-22T18:30:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rptaylor · 2023-01-03T22:15:42Z

/remove-lifecycle stale

k8s-triage-robot · 2023-04-03T22:33:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rptaylor · 2023-04-03T22:37:54Z

/remove-lifecycle stale

rptaylor · 2023-06-20T20:41:21Z

In a recent deployment it spent about 15 minutes on this, I believe it happens on each play of which there are several in the playbook so it adds up to ~ 45-60 minutes or so doing nothing. And that is for only running on half of the nodes at a time.

Tuesday 20 June 2023  20:24:21 +0000 (0:00:07.725)       0:31:52.042 ********** 
Tuesday 20 June 2023  20:24:27 +0000 (0:00:06.830)       0:31:58.873 ********** 
...
Tuesday 20 June 2023  20:38:26 +0000 (0:00:11.259)       0:45:57.979 **********

rptaylor · 2024-01-11T19:11:41Z

Very exciting, thanks for working on this @VannTen !

For the record I am noting some other ansible scalability issues that can cause slowness with large inventories or dynamically building inventories:

VannTen · 2024-01-12T08:18:26Z

The linked ones are gonna be harder to tackle 😆

rptaylor added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 15, 2022

rptaylor mentioned this issue Sep 15, 2022

speed up by removing unneeded dependency of kubespray-defaults on downloads #9280

Closed

rptaylor mentioned this issue Sep 23, 2022

containerd is not upgraded because it is not restarted #9019

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2023

zhan9san mentioned this issue Feb 28, 2023

Update nodes in etc hosts after cluster scale #9837

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2023

VannTen mentioned this issue Nov 17, 2023

Decouple kubespray-defaults from download #10626

Merged

k8s-ci-robot closed this as completed in #10626 Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubespray play spends a lot of time doing nothing #9279

kubespray play spends a lot of time doing nothing #9279

rptaylor commented Sep 15, 2022 •

edited

Loading

k8s-triage-robot commented Dec 22, 2022

rptaylor commented Jan 3, 2023

k8s-triage-robot commented Apr 3, 2023

rptaylor commented Apr 3, 2023

rptaylor commented Jun 20, 2023

rptaylor commented Jan 11, 2024

VannTen commented Jan 12, 2024

kubespray play spends a lot of time doing nothing #9279

kubespray play spends a lot of time doing nothing #9279

Comments

rptaylor commented Sep 15, 2022 • edited Loading

k8s-triage-robot commented Dec 22, 2022

rptaylor commented Jan 3, 2023

k8s-triage-robot commented Apr 3, 2023

rptaylor commented Apr 3, 2023

rptaylor commented Jun 20, 2023

rptaylor commented Jan 11, 2024

VannTen commented Jan 12, 2024

rptaylor commented Sep 15, 2022 •

edited

Loading