✨ Taint nodes with PreferNoSchedule during rollouts #7631

hiromu-a5a · 2022-11-28T01:44:01Z

What this PR does / why we need it:

If the MachineSet or MachineDeployment has many replicas, and each node has many pods, changes to an existing MachineDeployment or MachineSet infrastructure template can result in unnecessary pod churn. As the first node is drained, pods previously running on that node may be scheduled onto nodes who have yet to be replaced, but will be torn down soon. This can happen over and over again.

To avoid the above problem, this PR changes the machine controller to add PreferNoSchedule taint to nodes in old MachineDeployment. As mentioned in #7043 (comment), tainting should be triggered by MachineDeployment controller. In the MachineDeployment controller, the old MachineSets are given an annotation, then the annotation is propagated from MachineSet to Machines, and finally Machine with the annotation sets taint to Node.

This behavior may be related to #493

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #7043

k8s-ci-robot · 2022-11-28T01:44:09Z

Welcome @hiromu-a5a!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2022-11-28T01:44:10Z

Hi @hiromu-a5a. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hiromu-a5a · 2022-11-29T02:55:57Z

/cc @enxebre

sbueringer · 2022-11-29T02:57:26Z

/ok-to-test

internal/controllers/machine/machine_controller_noderef.go

api/v1beta1/common_types.go

internal/controllers/machinedeployment/machinedeployment_sync.go

enxebre · 2022-12-05T15:30:57Z

This approach is effectively implementing in place propagation of an annotation->taint. If we continue that direction this should be blocked until https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20221003-In-place-propagation-of-Kubernetes-objects-only-changes.md is implemented so we can have a clearer path.

Alternatively I can see this being built-in business logic as suggested in https://github.com/kubernetes-sigs/cluster-api/pull/7631/files#r1039744048

cc @sbueringer @fabriziopandini thoughts?

fabriziopandini · 2022-12-05T20:42:49Z

This approach is effectively implementing in place propagation of an annotation->taint. If we continue that direction this should be blocked until https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20221003-In-place-propagation-of-Kubernetes-objects-only-changes.md is implemented so we can have a clearer path.

I kind of agree that we should wait for label propagation from machine set to machines to be implemented in the context of the work above. instead, annotation to -> taint is something specific of this PR and it won't be addressed by the work above

fabriziopandini

@hiromu-a5a thanks for taking up this issue, this will be super valuable for users!
MD/MS code base is not ideal for first-time contributors, but it seems you are navigating it well; however we should sync with the label propagation effort in order to avoid duplicating/conflicting work
cc @ykakarap who is starting to look into this

internal/controllers/machine/machine_controller_noderef.go

ykakarap · 2022-12-09T01:05:42Z

This is a really good issue to solve. The node labels proposal is also looking at using taints to solve the problem of workloads getting scheduled to nodes that it should not. I will take a closer look at it in a day or two and see if there are any conflicts and how we can get proceed further.

linux-foundation-easycla · 2022-12-14T08:08:37Z

The committers listed above are authorized under a signed CLA.

✅ login: hiromu-a5a / name: Hiromu Asahina (b049fc4)

k8s-ci-robot · 2023-07-10T13:41:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign neolit123 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2023-07-10T15:35:23Z

/retitle ✨ Taint nodes with PreferNoSchedule during rollouts

hiromu-a5a · 2023-07-11T09:52:09Z

@enxebre Yes. I did.

vincepri · 2023-10-31T18:55:02Z

Changes LGTM

/assign @sbueringer

sbueringer · 2023-11-15T13:38:28Z

I'll try to get to this soon

enxebre · 2023-11-22T12:34:37Z

I wonder if it is really the right approach to propagate down the information via the MD controller to MachineSets and then to Machines. This requires a lot of API calls across the entire hierarchy.

Tangential discussion, but If we supported API input for taint propagation then MD could use computeDesiredMachineSet to just apply the opinionated taints to old machinesets while reusing already required api calls to perform a rollout.

Code changes lgtm.
Note the the PR is including watcher on MSs, not clear to me if wanted to include that initially based on #7631 (comment)

vincepri

/lgtm
/assign @sbueringer

This seems all very reasonable to get in

k8s-ci-robot · 2023-11-29T21:15:33Z

LGTM label has been added.

Git tree hash: 2b1ef82b6213fdd832b79c253676dbe451a90b62

sbueringer · 2024-01-15T17:41:51Z

Sorry for the delay. Taking a look now

fabriziopandini · 2024-01-15T19:48:58Z

I will try to get a look as well, it seems a nice change to have

sbueringer

Skipped reviewing unit tests for now.

Thank you very much for your continued work on this. Mostly smaller findings from my side.

I'll try to review this PR quicker from now on, so we can get it merged soon :)

util/util.go

test/e2e/data/test-extension/deployment.yaml

internal/controllers/machine/machine_controller_noderef.go

internal/controllers/machine/machine_controller.go

sbueringer · 2024-01-31T19:08:24Z

@hiromu-a5a Do you have time to follow-up on the review comments?

sbueringer · 2024-02-09T09:53:12Z

@hiromu-a5a Do you have time to follow-up? Otherwise we could take over the remaining work. (we would like to get this feature into the next CAPI minor release)

chrischdi · 2024-03-01T09:44:34Z

/close

Superseeded by:

✨ Taint nodes with PreferNoSchedule during rollouts #10223

Kudos to @hiromu-a5a for the awesome work!

k8s-ci-robot · 2024-03-01T09:44:39Z

@chrischdi: Closed this PR.

In response to this:

/close

Superseeded by:

✨ Taint nodes with PreferNoSchedule during rollouts #10223

Kudos to @hiromu-a5a for the awesome work!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 28, 2022

k8s-ci-robot requested a review from CecileRobertMichon November 28, 2022 01:44

k8s-ci-robot requested a review from sbueringer November 28, 2022 01:44

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 28, 2022

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 28, 2022

k8s-ci-robot requested a review from enxebre November 29, 2022 02:55

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 29, 2022

hiromu-a5a force-pushed the avoid-churn branch from dcfcd40 to a14517c Compare November 30, 2022 04:12

enxebre reviewed Dec 5, 2022

View reviewed changes

internal/controllers/machine/machine_controller_noderef.go Outdated Show resolved Hide resolved

enxebre reviewed Dec 5, 2022

View reviewed changes

api/v1beta1/common_types.go Outdated Show resolved Hide resolved

enxebre reviewed Dec 5, 2022

View reviewed changes

internal/controllers/machinedeployment/machinedeployment_sync.go Outdated Show resolved Hide resolved

fabriziopandini reviewed Dec 5, 2022

View reviewed changes

internal/controllers/machine/machine_controller_noderef.go Outdated Show resolved Hide resolved

ykakarap mentioned this pull request Dec 12, 2022

Umbrella issue for Label Sync Between Machines and underlying Kubernetes Nodes proposal implementation #7730

Closed

4 tasks

hiromu-a5a force-pushed the avoid-churn branch from 55aee21 to d6efe70 Compare December 14, 2022 08:11

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 14, 2022

hiromu-a5a force-pushed the avoid-churn branch from d6efe70 to b0aca44 Compare December 14, 2022 15:20

hiromu-a5a force-pushed the avoid-churn branch from 1c03c69 to 5ab00bc Compare July 10, 2023 13:41

k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 10, 2023

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 10, 2023

Add Taint during rolling update

b049fc4

hiromu-a5a force-pushed the avoid-churn branch from 5ab00bc to b049fc4 Compare July 10, 2023 14:09

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2023

k8s-ci-robot changed the title ~~🌱 Add Taint during rolling update~~ ✨ Taint nodes with PreferNoSchedule during rollouts Jul 10, 2023

cnmcavoy approved these changes Oct 31, 2023

View reviewed changes

k8s-ci-robot assigned sbueringer Oct 31, 2023

vincepri approved these changes Nov 29, 2023

View reviewed changes

k8s-ci-robot assigned vincepri Nov 29, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2023

sbueringer reviewed Jan 16, 2024

View reviewed changes

fabriziopandini mentioned this pull request Jan 25, 2024

Improve Cluster Validation webhooks for .spec.topology.version #10049

Closed

sbueringer added this to the v1.7 milestone Feb 9, 2024

chrischdi mentioned this pull request Mar 1, 2024

✨ Taint nodes with PreferNoSchedule during rollouts #10223

Merged

k8s-ci-robot closed this Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Taint nodes with PreferNoSchedule during rollouts #7631

✨ Taint nodes with PreferNoSchedule during rollouts #7631

hiromu-a5a commented Nov 28, 2022

k8s-ci-robot commented Nov 28, 2022

k8s-ci-robot commented Nov 28, 2022

hiromu-a5a commented Nov 29, 2022

sbueringer commented Nov 29, 2022

enxebre commented Dec 5, 2022

fabriziopandini commented Dec 5, 2022

fabriziopandini left a comment •

edited

Loading

ykakarap commented Dec 9, 2022

linux-foundation-easycla bot commented Dec 14, 2022 •

edited

Loading

k8s-ci-robot commented Jul 10, 2023

vincepri commented Jul 10, 2023

hiromu-a5a commented Jul 11, 2023

vincepri commented Oct 31, 2023

sbueringer commented Nov 15, 2023

enxebre commented Nov 22, 2023 •

edited

Loading

vincepri left a comment

k8s-ci-robot commented Nov 29, 2023

sbueringer commented Jan 15, 2024

fabriziopandini commented Jan 15, 2024

sbueringer left a comment •

edited

Loading

sbueringer commented Jan 31, 2024

sbueringer commented Feb 9, 2024

chrischdi commented Mar 1, 2024

k8s-ci-robot commented Mar 1, 2024

✨ Taint nodes with PreferNoSchedule during rollouts #7631

✨ Taint nodes with PreferNoSchedule during rollouts #7631

Conversation

hiromu-a5a commented Nov 28, 2022

k8s-ci-robot commented Nov 28, 2022

k8s-ci-robot commented Nov 28, 2022

hiromu-a5a commented Nov 29, 2022

sbueringer commented Nov 29, 2022

enxebre commented Dec 5, 2022

fabriziopandini commented Dec 5, 2022

fabriziopandini left a comment • edited Loading

Choose a reason for hiding this comment

ykakarap commented Dec 9, 2022

linux-foundation-easycla bot commented Dec 14, 2022 • edited Loading

k8s-ci-robot commented Jul 10, 2023

vincepri commented Jul 10, 2023

hiromu-a5a commented Jul 11, 2023

vincepri commented Oct 31, 2023

sbueringer commented Nov 15, 2023

enxebre commented Nov 22, 2023 • edited Loading

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 29, 2023

sbueringer commented Jan 15, 2024

fabriziopandini commented Jan 15, 2024

sbueringer left a comment • edited Loading

Choose a reason for hiding this comment

sbueringer commented Jan 31, 2024

sbueringer commented Feb 9, 2024

chrischdi commented Mar 1, 2024

k8s-ci-robot commented Mar 1, 2024

fabriziopandini left a comment •

edited

Loading

linux-foundation-easycla bot commented Dec 14, 2022 •

edited

Loading

enxebre commented Nov 22, 2023 •

edited

Loading

sbueringer left a comment •

edited

Loading