Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Add Node related condition to Machine conditions #3670

Merged

Conversation

sedefsavas
Copy link

What this PR does / why we need it:
This PR adds a condition for Node health to Machine conditions that will be managed by Machine controller.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # #3138

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 21, 2020
@sedefsavas
Copy link
Author

cc @fabriziopandini

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 21, 2020
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass

api/v1alpha3/condition_consts.go Show resolved Hide resolved
api/v1alpha3/condition_consts.go Show resolved Hide resolved
api/v1alpha3/condition_consts.go Outdated Show resolved Hide resolved
controllers/machine_controller.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
@sedefsavas sedefsavas force-pushed the machine_conditiononly branch from 6a2d6b7 to eb2fff6 Compare September 23, 2020 06:19
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 23, 2020
@sedefsavas sedefsavas marked this pull request as ready for review September 25, 2020 06:43
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2020
@sedefsavas
Copy link
Author

Unit test fails because CacheTracker panics in the tests that doesn't use testEnv. Probably will need to use testEnv for the tests using CacheTracker.

@sedefsavas sedefsavas force-pushed the machine_conditiononly branch from eb2fff6 to e2c6e86 Compare September 25, 2020 06:56
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 28, 2020
api/v1alpha3/condition_consts.go Outdated Show resolved Hide resolved
controllers/machine_controller_node.go Outdated Show resolved Hide resolved
controllers/machine_controller_node.go Outdated Show resolved Hide resolved
controllers/machine_controller_node_test.go Outdated Show resolved Hide resolved
controllers/machine_controller_node.go Outdated Show resolved Hide resolved
controllers/machine_controller_node.go Outdated Show resolved Hide resolved
@vincepri
Copy link
Member

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Sep 29, 2020
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 29, 2020
@sedefsavas sedefsavas force-pushed the machine_conditiononly branch from 995c571 to 663a3eb Compare September 29, 2020 19:46
@sedefsavas
Copy link
Author

/test pull-cluster-api-test

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2020
@sedefsavas sedefsavas force-pushed the machine_conditiononly branch from 42b6249 to 52b6450 Compare October 20, 2020 05:22
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 20, 2020
@sedefsavas
Copy link
Author

This was ready to go, just solved conflicts.
cc @vincepri @fabriziopandini

@fabriziopandini
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 20, 2020
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to take a bit more time to review but it seems we're potentially changing some behaviors that we should first chat about

}
conditions.MarkFalse(m, clusterv1.MachineNodeHealthyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")
if err := patchMachine(ctx, patchHelper, m); err != nil {
conditions.MarkFalse(m, clusterv1.MachineNodeHealthyCondition, clusterv1.DeletionFailedReason, clusterv1.ConditionSeverityInfo, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to fail as well if the patch doesn't go through above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Machines are patched in Reconcile after returning from reconcileDelete, so setting this in case it won't fail there.

controllers/machine_controller.go Outdated Show resolved Hide resolved
node, err := r.getNode(remoteClient, providerID)
if err != nil {
if err == ErrNodeNotFound {
// While a NodeRef is set in the status, failing to get that node means the node is deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's a transient error? The node might not be deleted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are setting the condition and reconciling, if it is transient it will be fixed right away. Otherwise how to detect the transient error?

if err != nil {
if err == ErrNodeNotFound {
// While a NodeRef is set in the status, failing to get that node means the node is deleted.
// If Status.NodeRef is not set before, node still can be in the provisioning state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little confusing, don't we set the node reference a few lines down?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR we were returning directly if the NodeRef is set.

if machine.Status.NodeRef != nil {

Removed that check to be able to check the health of the node.

And here, if node cannot be found when the NodeRef is already assigned, that means the node is deleted. But if NodeRef is not yet set, that means there was never a node before as NodeRef is nil, and hence it is not a NodeNotFoundReason issue.

controllers/machine_controller.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2020
Comment on lines 78 to 79
Kind: node.Kind,
APIVersion: node.APIVersion,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these properly populated? Usually they're not, and they need to be filled in manually

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is same with what was used before:

return &apicorev1.ObjectReference{
Kind: node.Kind,
APIVersion: node.APIVersion,
Name: node.Name,
UID: node.UID,

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 21, 2020
@sedefsavas
Copy link
Author

I had renamed machine_controller_noderef to machine_controller_node after addressing the previous comments.
But since there are new comments, reverted it back for better visibility.

cc @vincepri

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Oct 21, 2020

@sedefsavas: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-apidiff 6a11470 link /test pull-cluster-api-apidiff
pull-cluster-api-make-main 07e4fc3 link /test pull-cluster-api-make-main
pull-cluster-api-test-main 07e4fc3 link /test pull-cluster-api-test-main
pull-cluster-api-apidiff-main 07e4fc3 link /test pull-cluster-api-apidiff-main
pull-cluster-api-build-main 07e4fc3 link /test pull-cluster-api-build-main
pull-cluster-api-e2e-main 07e4fc3 link /test pull-cluster-api-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
@vincepri
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 21, 2020
@fabriziopandini
Copy link
Member

@sedefsavas if you squash commits, then this one can be merged

MachineNodeHealthyCondition ConditionType = "NodeHealthy"

// WaitingForNodeRefReason (Severity=Info) documents a machine.spec.providerId is not assigned yet.
WaitingForNodeRefReason = "WaitingForNodeRef"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about

Suggested change
WaitingForNodeRefReason = "WaitingForNodeRef"
WaitingForNodeProvisioningReason = "WaitingForNodeProvisioning"

NodeRef is a bit too technical for a user that does not know CAPI internals

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a NodeProvisioningReason, I think it will be confusing.
WaitingForNodeRefReason is waiting for machine.Spec.ProviderID and
NodeProvisioningReason is waiting for machine.Status.NodeRef when there is no node found.

controllers/machine_controller_noderef.go Outdated Show resolved Hide resolved
@sedefsavas sedefsavas force-pushed the machine_conditiononly branch from 394b4bb to 9506423 Compare October 26, 2020 15:56
@fabriziopandini
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2020
@fabriziopandini
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2020
@k8s-ci-robot k8s-ci-robot merged commit cb3095b into kubernetes-sigs:release-0.3 Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants