Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kubernetes): taint nodes on cluster upgrade #10705

Merged

Conversation

maxime1907
Copy link
Contributor

@maxime1907 maxime1907 commented Dec 8, 2023

What type of PR is this?
/kind bug

What this PR does / why we need it:
Currently we are setting node taints through kubelet with the option --register-with-taints.
This works fine when first deploying a cluster but when we need to upgrade, it just does not work because they are only added at node registration.
This PR attempts to fix this issue by simply using kubectl to taint nodes like its done with labels:

Does this PR introduce a user-facing change?:

fix(kubernetes): taint nodes on cluster upgrade

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 8, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 8, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @maxime1907. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 8, 2023
@yankay
Copy link
Member

yankay commented Dec 11, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 11, 2023
@yankay yankay closed this Dec 12, 2023
@yankay yankay reopened this Dec 12, 2023
Copy link
Contributor

@VannTen VannTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to add so much set_fact here, couldn't you use vars instead ?

Regarding using kubectl taint vs --register... Hum, is there no way to re-register on upgrade instead ?

Each new tasks means more playbooks run time.

roles/kubernetes/node-taint/tasks/main.yml Outdated Show resolved Hide resolved
roles/kubernetes/node-taint/tasks/main.yml Outdated Show resolved Hide resolved
@maxime1907
Copy link
Contributor Author

maxime1907 commented Dec 12, 2023

I don't think we need to add so much set_fact here, couldn't you use vars instead ?

Regarding using kubectl taint vs --register... Hum, is there no way to re-register on upgrade instead ?

Each new tasks means more playbooks run time.

Oh right we can just do this:

  1. remove the node from the cluster with kubectl delete node
  2. restart kubelet

I tried this on a vanilla kubernetes 1.27.7 and it works!
It also fixes another problem that i was facing when updating the cloud_provider value on an already running node, it then re register correctly with the ccm
The only problem i see with this behavior is that it changes the node age since its re added to the cluster (with creationTimestamp)
Also it removes the taints/roles on master nodes

But i dont think its a good idea to delete the node because in case of failure it will be bad...
What are your thoughts on this ?

@maxime1907 maxime1907 changed the title fix(kubernetes): taint nodes with kubectl fix(kubernetes): taint nodes on cluster upgrade Dec 12, 2023
@maxime1907 maxime1907 force-pushed the feat/kubectl-node-taints branch 2 times, most recently from f222dcb to 5a483e5 Compare December 13, 2023 10:09
@VannTen
Copy link
Contributor

VannTen commented Dec 14, 2023

I'm not sold on deleting the node either, I was more thinking along the lines of "Maybe there is a way to re-register without deleting". It might be wishful thinking on my end though ^

@maxime1907 maxime1907 force-pushed the feat/kubectl-node-taints branch from 5a483e5 to c908d21 Compare December 19, 2023 12:56
Signed-off-by: Maxime Leroy <19607336+maxime1907@users.noreply.github.com>
@maxime1907 maxime1907 force-pushed the feat/kubectl-node-taints branch from c908d21 to d293326 Compare December 19, 2023 12:57
@maxime1907
Copy link
Contributor Author

maxime1907 commented Dec 19, 2023

I'm not sold on deleting the node either, I was more thinking along the lines of "Maybe there is a way to re-register without deleting". It might be wishful thinking on my end though ^

Yeah i don't think that it's possible to re-register a node without deleting it:

So i guess we can go with this solution as for now

@maxime1907
Copy link
Contributor Author

@VannTen Any updates on this?

@VannTen
Copy link
Contributor

VannTen commented Dec 29, 2023 via email

@VannTen
Copy link
Contributor

VannTen commented Jan 11, 2024

What bugs me is that I don't think this will erase old taints if you happen to remove them from your inventory... not very declarative.
OTOH it's better than the current situation where nodes just keep their labels forever.
/lgtm
/assign @floryut
wdyt ?
Forcing to delete the node to re-register it with new tains is probably a bit extreme... 🤔

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2024
Copy link
Member

@floryut floryut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bugs me is that I don't think this will erase old taints if you happen to remove them from your inventory... not very declarative. OTOH it's better than the current situation where nodes just keep their labels forever. /lgtm /assign @floryut wdyt ? Forcing to delete the node to re-register it with new tains is probably a bit extreme... 🤔

Agree with you, this PR is better than the actual process, so let's go with it as is even if the limitation you are mentioning is indeed not perfect

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: floryut, maxime1907

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2024
@k8s-ci-robot k8s-ci-robot merged commit ab0163a into kubernetes-sigs:master Jan 23, 2024
64 checks passed
@mzaian mzaian mentioned this pull request Apr 26, 2024
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
Signed-off-by: Maxime Leroy <19607336+maxime1907@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants