Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 KCP should defer remediation when a control plane machine is still provisioning #9734

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
With this PR KCP defers remediation when another control plane machine is still provisioning

Which issue(s) this PR fixes:
Fixes #9398

/area control-plane

@k8s-ci-robot k8s-ci-robot added area/control-plane Issues or PRs related to control-plane lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 17, 2023
@fabriziopandini
Copy link
Member Author

cc @sbueringer

controlplane/kubeadm/internal/controllers/remediation.go Outdated Show resolved Hide resolved
Comment on lines 257 to 258
func (c *ControlPlane) HasHealthyMachineStillProvisioning() bool {
return len(c.Machines.Filter(collections.Not(collections.HasUnhealthyCondition), collections.And(collections.Not(collections.HasNode())))) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, this is checking if any of the machines available don't have a node associated AND are not marked as unhealthy (yet)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes
The idea is that we want to wait for all the machines to be provisioned before remediation, but we should preserve the capability to remediate a machine that fails to provision

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the filter and added unit tests

Copy link
Contributor

@g-gaston g-gaston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

controlplane/kubeadm/internal/controllers/remediation.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e39a29b97024ffb55dbd0b8f9171e4bb509f4e28

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 20, 2023
Copy link
Contributor

@g-gaston g-gaston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 20, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a92f6d44d7b9c0cafd8241663442cba9e6d8e29a

@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-full-main

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, otherwise lgtm

t.Run("Remediation deletes unhealthy machine failed to provision - 4 CP (during 3 CP rolling upgrade)", func(t *testing.T) {
g := NewWithT(t)

m1 := createMachine(ctx, g, ns.Name, "m1-unhealthy-", withMachineHealthCheckFailed(), withWaitBeforeDeleteFinalizer(), withoutNodeRef())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify, does this withoutNodeRef() makes any actual difference from the test above "Remediation deletes unhealthy machine - 4 CP (during 3 CP rolling upgrade)"?
I assume they both would run the same code path because of withMachineHealthCheckFailed. I can still see the value of the test case, but just wanted to clarify that's the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are testing the same code path.
However, considering that this PR defers remediation in case a healthy machine is still provisioning, I figured out that having a test preventing any regression when a machine is still provisioning (withoutNodeRef), but unhealty, could help.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2023
@k8s-ci-robot k8s-ci-robot requested a review from g-gaston December 5, 2023 14:01
@sbueringer
Copy link
Member

Thx!!

/lgtm

merge/approve pending squash

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 8, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 75959fb18b3255faeae3ab13540c7b013a225a0c

@fabriziopandini fabriziopandini force-pushed the defer-remediation-when-provisioning branch from a2e25e8 to 2cb95f7 Compare December 13, 2023 12:25
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 13, 2023
@fabriziopandini
Copy link
Member Author

rebased + squashed

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 13, 2023

@fabriziopandini: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-e2e-full-main f82cb20 link true /test pull-cluster-api-e2e-full-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fabriziopandini fabriziopandini force-pushed the defer-remediation-when-provisioning branch from 2cb95f7 to 34c09b4 Compare December 13, 2023 13:47
@sbueringer
Copy link
Member

Thank you!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8b2d3fd3f51ed46ce9495c7b02dc06aea85a4e7a

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2023
@k8s-ci-robot k8s-ci-robot merged commit 24ef14f into kubernetes-sigs:main Dec 13, 2023
21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.7 milestone Dec 13, 2023
@fabriziopandini fabriziopandini deleted the defer-remediation-when-provisioning branch January 8, 2024 19:36
@sbueringer sbueringer mentioned this pull request Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane Issues or PRs related to control-plane lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

controlplane rollout stuck when existing node and new adding node fail at same time
6 participants