-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update affinity assistant creation implementation #6596
Conversation
Skipping CI for Draft Pull Request. |
|
The following is the coverage report on the affected files.
|
e828393
to
f5e2595
Compare
The following is the coverage report on the affected files.
|
f5e2595
to
c741725
Compare
c741725
to
c548577
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
c548577
to
e28348b
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
e28348b
to
4dfbf1f
Compare
The following is the coverage report on the affected files.
|
thanks a bunch @lbernick 🙏 @QuanZhang-William please let me know if you have any feedback on these changes, thanks! |
7453f53
to
bb122cb
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
Thanks @pritidesai ! LGMT! |
bb122cb
to
0b245f7
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
affinity-assistant-c7b485007a-0 1/1 Running 0 4s 10.244.2.144 kind-multinode-worker2 <none> <none> | ||
``` | ||
|
||
And, the `pipelineRun` finishes to completion: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's something missing here.
For the pipeline run to succeed, the operator performing the cluster upgrade needs to wait until there is no Tekton running pod on the node before draining it.
Since the affinity assistant moved to a new node, new tasks will start on different nodes, but running tasks must complete before the update may continue, or the pipeline run will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
draining
is disruptive whereas cordoning
is not.
We are emphasizing in this doc that the affinity assistant mechanism is capable of handling the node being cordoned
(scheduling is disabled).
Now, cordon
that node to mark it unschedulable for any new pods:
kubectl cordon kind-multinode-worker1
node/kind-multinode-worker1 cordoned
The node is cordoned:
kubectl get node
NAME STATUS ROLES AGE VERSION
kind-multinode-control-plane Ready control-plane 13d v1.26.3
kind-multinode-worker1 Ready,SchedulingDisabled <none> 13d v1.26.3
kind-multinode-worker2 Ready <none> 13d v1.26.3
Before this commit, the affinity assistant was created in the beginning of the pipleineRun. And the same affinity assistant was relied on for the entire lifecycle of a PR. Now, there could be a case when the node on which affinity assistant pod is created goes down. In this case, the rest of the pipelineRun is stuck and cannot run to the completition since the affinity assistant (StatefulSet) tries to schedule rest of the pods (taskRuns) on the same node but that node is cordoned or not scheduling anything new. This commit always makes an attempt to create Affinity Assistant (StatefulSet) in case it does not exist. If it exist, the controller checks if the node on which Affinity Assistant pod is created is healthy to schedule subsequent pods. If not, the controller deletes Affinity Assistant pod so that StatefulSet can upscale the replicas (set to 1) on other node in the cluster. Signed-off-by: Priti Desai <pdesai@us.ibm.com>
0b245f7
to
0b7ead8
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
Hey @afrittoli @jerop any outstanding review on this? I would like to get it in 0.48. This is open since past three weeks and have gone through multiple rounds of reviews. Please let me know if there is anything I need to address. Thanks! |
Just to add further voice to this as we would really like to see this in 0.48. This PR is important to us as we currently have a problem when performing maintenance on nodes in our cluster. This PR let's us run more of a lights out operation vs. requiring manual intervention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
/lgtm
We would like to upgrade to the latest LTS release which is 0.47. These changes along with the PR #6860 are required for us to manage the clusters. I would like to request cherry picking these changes into 0.47. Thanks! /cherry-pick release-v0.47.x |
@pritidesai: new pull request created: #6863 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Changes
Before this commit, the affinity assistant was created in the beginning of the
pipelineRun
. And the same affinity assistant was used throughout the entire lifecycle of apipelineRun
. Now, there could be a case when the node on which affinity assistant pod is created iscordoned
for maintenance purpose. In this case, the rest of thepipelineRun
is stuck and cannot run to the competition since the affinity assistant (StatefulSet) tries to schedule rest of the pods (taskRuns
) on the same node but that node iscordoned
or not scheduling anything new.This commit always makes an attempt to create Affinity Assistant (StatefulSet) in case it does not exist. If it exist, the controller checks if the node on which Affinity Assistant pod is created is healthy to schedule subsequent pods. If not, the controller deletes Affinity Assistant pod so that StatefulSet can upscale the replicas (set to 1) on other node in the cluster.
Closes #6586.
Testing is pending, will update the PR after testing locally.These changes are now tested locally on a cluster with two nodes and works as expected. The affinity assistant pod along with remaining
taskRun
pods are created on a healthy node when the existing node is cordoned./kind bug
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes