update affinity assistant creation implementation #6596

pritidesai · 2023-04-28T05:48:37Z

Changes

Before this commit, the affinity assistant was created in the beginning of the pipelineRun. And the same affinity assistant was used throughout the entire lifecycle of a pipelineRun. Now, there could be a case when the node on which affinity assistant pod is created is cordoned for maintenance purpose. In this case, the rest of the pipelineRun is stuck and cannot run to the competition since the affinity assistant (StatefulSet) tries to schedule rest of the pods (taskRuns) on the same node but that node is cordoned or not scheduling anything new.

This commit always makes an attempt to create Affinity Assistant (StatefulSet) in case it does not exist. If it exist, the controller checks if the node on which Affinity Assistant pod is created is healthy to schedule subsequent pods. If not, the controller deletes Affinity Assistant pod so that StatefulSet can upscale the replicas (set to 1) on other node in the cluster.

Closes #6586.

~~Testing is pending, will update the PR after testing locally.~~

These changes are now tested locally on a cluster with two nodes and works as expected. The affinity assistant pod along with remaining taskRun pods are created on a healthy node when the existing node is cordoned.

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
Has Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Resilient Affinity Assistant - make sure the Affinity Assistant pod is always on a healthy node during the entire life cycle of the pipelineRun

tekton-robot · 2023-04-28T05:48:39Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

linux-foundation-easycla · 2023-04-28T05:48:41Z

The committers listed above are authorized under a signed CLA.

✅ login: pritidesai / name: Priti Desai (f19cef0)

pritidesai · 2023-04-28T05:50:36Z

@lbernick @skaegi

This is an alternate solution to fix the pipelineRun with affinity assistant where the node could go down while the pipelineRun is still running.

tekton-robot · 2023-04-28T05:55:47Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	75.7%	-17.3

tekton-robot · 2023-04-28T21:15:25Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	75.7%	-17.3

pkg/reconciler/pipelinerun/pipelinerun.go

tekton-robot · 2023-05-06T05:54:14Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	92.6%	-0.3

tekton-robot · 2023-05-06T05:56:14Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	92.6%	-0.3

tekton-robot · 2023-05-06T05:56:44Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	92.6%	-0.3

tekton-robot · 2023-05-06T23:40:30Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	92.6%	-0.3

tekton-robot · 2023-05-06T23:41:27Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	92.6%	-0.3

tekton-robot · 2023-05-07T00:18:52Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	92.6%	-0.3

pritidesai · 2023-05-18T18:58:00Z

thanks a bunch @lbernick 🙏

@QuanZhang-William please let me know if you have any feedback on these changes, thanks!

tekton-robot · 2023-05-19T05:57:02Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	94.5%	1.5

tekton-robot · 2023-05-19T05:58:57Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	94.5%	1.5

QuanZhang-William · 2023-05-19T05:59:29Z

thanks a bunch @lbernick 🙏

@QuanZhang-William please let me know if you have any feedback on these changes, thanks!

Thanks @pritidesai ! LGMT!

tekton-robot · 2023-05-19T16:41:08Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	94.5%	1.5

tekton-robot · 2023-05-19T16:44:24Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	94.5%	1.5

docs/developers/affinity-assistant.md

afrittoli · 2023-05-22T17:38:07Z

docs/developers/affinity-assistant.md

+affinity-assistant-c7b485007a-0   1/1     Running             0          4s    10.244.2.144    kind-multinode-worker2   <none>           <none>
+```
+
+And, the `pipelineRun` finishes to completion:


I think there's something missing here.
For the pipeline run to succeed, the operator performing the cluster upgrade needs to wait until there is no Tekton running pod on the node before draining it.

Since the affinity assistant moved to a new node, new tasks will start on different nodes, but running tasks must complete before the update may continue, or the pipeline run will fail.

draining is disruptive whereas cordoning is not.

We are emphasizing in this doc that the affinity assistant mechanism is capable of handling the node being cordoned (scheduling is disabled).

Now, cordon that node to mark it unschedulable for any new pods:

kubectl cordon kind-multinode-worker1 node/kind-multinode-worker1 cordoned

The node is cordoned:

kubectl get node NAME STATUS ROLES AGE VERSION kind-multinode-control-plane Ready control-plane 13d v1.26.3 kind-multinode-worker1 Ready,SchedulingDisabled <none> 13d v1.26.3 kind-multinode-worker2 Ready <none> 13d v1.26.3

docs/developers/affinity-assistant.md

Before this commit, the affinity assistant was created in the beginning of the pipleineRun. And the same affinity assistant was relied on for the entire lifecycle of a PR. Now, there could be a case when the node on which affinity assistant pod is created goes down. In this case, the rest of the pipelineRun is stuck and cannot run to the completition since the affinity assistant (StatefulSet) tries to schedule rest of the pods (taskRuns) on the same node but that node is cordoned or not scheduling anything new. This commit always makes an attempt to create Affinity Assistant (StatefulSet) in case it does not exist. If it exist, the controller checks if the node on which Affinity Assistant pod is created is healthy to schedule subsequent pods. If not, the controller deletes Affinity Assistant pod so that StatefulSet can upscale the replicas (set to 1) on other node in the cluster. Signed-off-by: Priti Desai <pdesai@us.ibm.com>

tekton-robot · 2023-05-22T18:57:28Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	94.5%	1.5

tekton-robot · 2023-05-22T19:00:42Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/affinity_assistant.go	93.0%	94.5%	1.5

pritidesai · 2023-05-24T15:31:00Z

Hey @afrittoli @jerop any outstanding review on this? I would like to get it in 0.48. This is open since past three weeks and have gone through multiple rounds of reviews. Please let me know if there is anything I need to address. Thanks!

skaegi · 2023-05-24T16:24:39Z

Just to add further voice to this as we would really like to see this in 0.48. This PR is important to us as we currently have a problem when performing maintenance on nodes in our cluster. This PR let's us run more of a lights out operation vs. requiring manual intervention.

afrittoli

Thanks for the updates!
/lgtm

pritidesai · 2023-06-22T17:31:17Z

We would like to upgrade to the latest LTS release which is 0.47. These changes along with the PR #6860 are required for us to manage the clusters. I would like to request cherry picking these changes into 0.47. Thanks!

/cherry-pick release-v0.47.x

tekton-robot · 2023-06-22T17:32:08Z

@pritidesai: new pull request created: #6863

In response to this:

We would like to upgrade to the latest LTS release which is 0.47. These changes along with the PR #6860 are required for us to manage the clusters. I would like to request cherry picking these changes into 0.47. Thanks!

/cherry-pick release-v0.47.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Apr 28, 2023

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 28, 2023

tekton-robot requested review from abayer and imjasonh April 28, 2023 05:48

pritidesai force-pushed the affinity_assistant branch from e828393 to f5e2595 Compare April 28, 2023 21:07

lbernick reviewed May 1, 2023

View reviewed changes

pkg/reconciler/pipelinerun/pipelinerun.go Show resolved Hide resolved

lbernick reviewed May 1, 2023

View reviewed changes

pkg/reconciler/pipelinerun/pipelinerun.go Show resolved Hide resolved

pritidesai force-pushed the affinity_assistant branch from f5e2595 to c741725 Compare May 6, 2023 05:48

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 6, 2023

pritidesai force-pushed the affinity_assistant branch from c741725 to c548577 Compare May 6, 2023 05:48

pritidesai marked this pull request as ready for review May 6, 2023 05:48

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2023

tekton-robot requested review from dibyom and lbernick May 6, 2023 05:48

pritidesai force-pushed the affinity_assistant branch from c548577 to e28348b Compare May 6, 2023 23:35

pritidesai force-pushed the affinity_assistant branch from e28348b to 4dfbf1f Compare May 7, 2023 00:13

pritidesai force-pushed the affinity_assistant branch from 7453f53 to bb122cb Compare May 19, 2023 05:51

pritidesai force-pushed the affinity_assistant branch from bb122cb to 0b245f7 Compare May 19, 2023 16:35

imjasonh removed their request for review May 19, 2023 16:52

jerop self-assigned this May 22, 2023

afrittoli reviewed May 22, 2023

View reviewed changes

docs/developers/affinity-assistant.md Outdated Show resolved Hide resolved

afrittoli reviewed May 22, 2023

View reviewed changes

docs/developers/affinity-assistant.md Outdated Show resolved Hide resolved

afrittoli reviewed May 22, 2023

View reviewed changes

docs/developers/affinity-assistant.md Show resolved Hide resolved

pritidesai force-pushed the affinity_assistant branch from 0b245f7 to 0b7ead8 Compare May 22, 2023 18:52

jerop modified the milestones: Pipelines v0.48, Pipelines v0.49 May 24, 2023

afrittoli reviewed May 24, 2023

View reviewed changes

tekton-robot assigned afrittoli May 24, 2023

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 24, 2023

tekton-robot merged commit 6590b21 into tektoncd:main May 24, 2023

pritidesai deleted the affinity_assistant branch May 24, 2023 18:01

tekton-robot mentioned this pull request Jun 22, 2023

[release-v0.47.x] update affinity assistant creation implementation #6863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update affinity assistant creation implementation #6596

update affinity assistant creation implementation #6596

pritidesai commented Apr 28, 2023 •

edited

Loading

tekton-robot commented Apr 28, 2023

linux-foundation-easycla bot commented Apr 28, 2023 •

edited

Loading

pritidesai commented Apr 28, 2023

tekton-robot commented Apr 28, 2023

tekton-robot commented Apr 28, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 7, 2023

pritidesai commented May 18, 2023

tekton-robot commented May 19, 2023

tekton-robot commented May 19, 2023

QuanZhang-William commented May 19, 2023

tekton-robot commented May 19, 2023

tekton-robot commented May 19, 2023

afrittoli May 22, 2023

pritidesai May 22, 2023

tekton-robot commented May 22, 2023

tekton-robot commented May 22, 2023

pritidesai commented May 24, 2023

skaegi commented May 24, 2023 •

edited

Loading

afrittoli left a comment

pritidesai commented Jun 22, 2023

tekton-robot commented Jun 22, 2023

update affinity assistant creation implementation #6596

update affinity assistant creation implementation #6596

Conversation

pritidesai commented Apr 28, 2023 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Apr 28, 2023

linux-foundation-easycla bot commented Apr 28, 2023 • edited Loading

pritidesai commented Apr 28, 2023

tekton-robot commented Apr 28, 2023

tekton-robot commented Apr 28, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 6, 2023

tekton-robot commented May 7, 2023

pritidesai commented May 18, 2023

tekton-robot commented May 19, 2023

tekton-robot commented May 19, 2023

QuanZhang-William commented May 19, 2023

tekton-robot commented May 19, 2023

tekton-robot commented May 19, 2023

afrittoli May 22, 2023

Choose a reason for hiding this comment

pritidesai May 22, 2023

Choose a reason for hiding this comment

tekton-robot commented May 22, 2023

tekton-robot commented May 22, 2023

pritidesai commented May 24, 2023

skaegi commented May 24, 2023 • edited Loading

afrittoli left a comment

Choose a reason for hiding this comment

pritidesai commented Jun 22, 2023

tekton-robot commented Jun 22, 2023

pritidesai commented Apr 28, 2023 •

edited

Loading

linux-foundation-easycla bot commented Apr 28, 2023 •

edited

Loading

skaegi commented May 24, 2023 •

edited

Loading