Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubespray play spends a lot of time doing nothing #9279

Closed
rptaylor opened this issue Sep 15, 2022 · 7 comments · Fixed by #10626
Closed

kubespray play spends a lot of time doing nothing #9279

rptaylor opened this issue Sep 15, 2022 · 7 comments · Fixed by #10626
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@rptaylor
Copy link
Contributor

rptaylor commented Sep 15, 2022

Kubespray plays can take quite a long time, e.g. around 30-60 minutes even for a small cluster of 6 nodes, and many hours for large clusters. (Some timing measurements discussed in #8050)

A lot of the time goes by with ansible output like this:

PLAY [k8s_cluster] **********************************************************************************************************************************************************
Thursday 15 September 2022  20:46:03 +0000 (0:00:00.124)       0:00:06.836 **** 
Thursday 15 September 2022  20:46:06 +0000 (0:00:02.881)       0:00:09.718 **** 
Thursday 15 September 2022  20:46:09 +0000 (0:00:02.926)       0:00:12.645 **** 
Thursday 15 September 2022  20:46:12 +0000 (0:00:02.726)       0:00:15.371 **** 
Thursday 15 September 2022  20:46:15 +0000 (0:00:03.139)       0:00:18.511 **** 
Thursday 15 September 2022  20:46:18 +0000 (0:00:02.987)       0:00:21.498 **** 
Thursday 15 September 2022  20:46:18 +0000 (0:00:00.120)       0:00:21.619 **** 
Thursday 15 September 2022  20:46:18 +0000 (0:00:00.102)       0:00:21.721 **** 
Thursday 15 September 2022  20:46:21 +0000 (0:00:02.846)       0:00:24.568 **** 
Thursday 15 September 2022  20:46:24 +0000 (0:00:02.979)       0:00:27.547 **** 
Thursday 15 September 2022  20:46:27 +0000 (0:00:03.015)       0:00:30.562 **** 
Thursday 15 September 2022  20:46:30 +0000 (0:00:03.025)       0:00:33.588 **** 
Thursday 15 September 2022  20:46:30 +0000 (0:00:00.105)       0:00:33.693 **** 
Thursday 15 September 2022  20:46:33 +0000 (0:00:03.176)       0:00:36.870 **** 
Thursday 15 September 2022  20:46:36 +0000 (0:00:03.163)       0:00:40.034 **** 
Thursday 15 September 2022  20:46:39 +0000 (0:00:03.112)       0:00:43.147 **** 
Thursday 15 September 2022  20:46:42 +0000 (0:00:03.056)       0:00:46.203 **** 
Thursday 15 September 2022  20:46:45 +0000 (0:00:02.953)       0:00:49.157 **** 

not visibly achieving anything, and the ansible process(es) are usually mostly CPU bound, although doing strace shows a lot of repetitive IOPS operations as well, stat-ing files etc.

I believe it is because of https://github.com/kubernetes-sigs/kubespray/blob/master/roles/kubespray-defaults/meta/main.yml

The kubespray-defaults role is invoked no fewer than 12 times in the cluster.yml play, and each time it runs the downloads role as a dependency. The downloads role has several import_tasks, and it also does import_role of container-engine/crictl and container-engine/nerdctl (until d01b181 anyway). Every task in the downloads role has when: not skip_downloads|default(false) and the meta/main.yml has skip_downloads: true , but because of how import_ works with conditionals , this means that ansible processes all the imports , going through all the loops and tasks on all the nodes and basically executing them with when: false. So it really is doing nothing, but wasting a lot of time on it.

Update: the default variables from the kubespray-defaults role (and download role) are needed, but that takes practically no time. There needs to be a way to get the vars without wasting time on tasks , maybe either by switching to include_role instead of import_role, or otherwise maybe refactor the download default vars into the kubespray-defaults default vars.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2022
@rptaylor
Copy link
Contributor Author

rptaylor commented Jan 3, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2023
@rptaylor
Copy link
Contributor Author

rptaylor commented Apr 3, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2023
@rptaylor
Copy link
Contributor Author

In a recent deployment it spent about 15 minutes on this, I believe it happens on each play of which there are several in the playbook so it adds up to ~ 45-60 minutes or so doing nothing. And that is for only running on half of the nodes at a time.

Tuesday 20 June 2023  20:24:21 +0000 (0:00:07.725)       0:31:52.042 ********** 
Tuesday 20 June 2023  20:24:27 +0000 (0:00:06.830)       0:31:58.873 ********** 
...
Tuesday 20 June 2023  20:38:26 +0000 (0:00:11.259)       0:45:57.979 ********** 

@rptaylor
Copy link
Contributor Author

Very exciting, thanks for working on this @VannTen !

For the record I am noting some other ansible scalability issues that can cause slowness with large inventories or dynamically building inventories:

@VannTen
Copy link
Contributor

VannTen commented Jan 12, 2024

The linked ones are gonna be harder to tackle 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
4 participants