Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: drain and volume detachment status conditions #1876

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jmdeal
Copy link
Member

@jmdeal jmdeal commented Dec 11, 2024

Fixes #N/A

Description
Adds status conditions for node drain and volume detachment to improve observability for the individual termination stages. This is a scoped down version of #1837, which takes these changes along with splitting each termination stage into a separate controller. I will continue to work on that refactor, but I'm decoupling to work on higher priority work.

How was this change tested?
make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 11, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jmdeal
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 11, 2024
@coveralls
Copy link

coveralls commented Dec 11, 2024

Pull Request Test Coverage Report for Build 12284394559

Details

  • 54 of 85 (63.53%) changed or added relevant lines in 2 files are covered.
  • 11 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 80.684%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/node/termination/controller.go 45 76 59.21%
Files with Coverage Reduction New Missed Lines %
pkg/utils/termination/termination.go 4 87.18%
pkg/controllers/node/termination/controller.go 7 63.72%
Totals Coverage Status
Change from base Build 12266383791: -0.2%
Covered Lines: 8960
Relevant Lines: 11105

💛 - Coveralls

@engedaam
Copy link
Contributor

/assign @engedaam

if cloudprovider.IsNodeClaimNotFoundError(err) {
return reconcile.Result{}, c.removeFinalizer(ctx, node)
stored := nodeClaim.DeepCopy()
if modified := nodeClaim.StatusConditions().SetFalse(v1.ConditionTypeDrained, "Draining", "Draining"); modified {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want both Reason and Message to be Draining? Any extra details we can add here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scoping this to the drain error handling block also means that we're not going to be adding this status condition if the node was empty in the first place. From a functionality perspective this is fine, but also makes it a bit confusing to trace the steps in history later. Thoughts on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want both Reason and Message to be Draining? Any extra details we can add here?

It would be nice, but would result in a lot of additional writes to the resource. That's why I opted to leave additional information on the event where it can be appropriately deduped.

Scoping this to the drain error handling block also means that we're not going to be adding this status condition if the node was empty in the first place.

Yeah, this is intentional. If there were no drainable pods on the Node in the first place, it wouldn't make sense to transition the status condition to false. We should transition from unknown -> true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice, but would result in a lot of additional writes to the resource. That's why I opted to leave additional information on the event where it can be appropriately deduped.

Maybe we can do as a followup, but it'd be interesting to have our reason here call back to which group of pods we're currently draining (e.g. non-critical daemon, critical daemon, non-critical non-daemon, critical non-daemon)

I agree with your second point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed some additional follow-ups, we could set the reason based on the group of pods currently being evicted (e.g. critical, system-critical, etc.) This will result in up to 4 additional writes per Node. We're going to decouple for now, but I'll open an issue to track this as an additional feature once this has merged.

Comment on lines +107 to +108
if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil {
return reconcile.Result{}, client.IgnoreNotFound(fmt.Errorf("deleting nodeclaim, %w", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we want to client.IgnoreNotFound the error handling block so that we continue to the rest of the controller

Copy link
Member Author

@jmdeal jmdeal Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the nodeClaim isn't found, we shouldn't be able to proceed with the rest of the loop anyway. This is just a short-circuit. Same answer for anywhere else we short circuit on NotFound.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, we'll return and requeue if the NodeClaim isn't found. An appropriate error message will then be printed by the initial get call, and we'll return without requeue.

Comment on lines +143 to +147
if err := c.kubeClient.Status().Patch(ctx, nodeClaim, client.MergeFromWithOptions(stored, client.MergeFromWithOptimisticLock{})); err != nil {
if errors.IsConflict(err) {
return reconcile.Result{Requeue: true}, nil
}
return reconcile.Result{}, fmt.Errorf("getting nodeclaim, %w", err)
return reconcile.Result{}, client.IgnoreNotFound(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be doing ignore not found in this block and just continuing if it doesn't exist?

Comment on lines +161 to +163
NodesDrainedTotal.Inc(map[string]string{
metrics.NodePoolLabel: node.Labels[v1.NodePoolLabelKey],
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation that we still emit this metric even if we didn't do any draining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what we want. I'm considering "drained" the end state, not the process. It would also be confusing / concerning to me as an operator if total nodes drained was less than the total nodes terminated, since that would indicate to me Karpenter is terminating nodes.

We do need to check though that it drained successfully, and we haven't passed over the drain block due to TGP expiration. If that's what you were calling out, you're right and I'll address that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, this isn't actually an issue with drain, but it is an issue with VolumeDrained. We shouldn't set that condition to true if we proceeded due to TPG expiration.

// getting the NodeClaim again. This prevents conflict errors on subsequent writes.
// USE CAUTION when determining whether to increase this timeout or remove this line
time.Sleep(time.Second)
nodeClaim, err = nodeutils.NodeClaimForNode(ctx, c.kubeClient, node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can just return here and requeue the controller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I didn't do that was the additional testability burden. This would increase the number of reconciliations required for the termination controller. Requiring multiple reconciliations for instance termination can already be hard enough to reason about, I'd really rather not increase this further.

Long-term I'm still tracking #1837 which will split these stages into individual controller or subreconcilers and address this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, we'll return and requeue after 1 second rather than doing the sleep.

c.recorder.Publish(terminatorevents.NodeAwaitingVolumeDetachmentEvent(node))
stored := nodeClaim.DeepCopy()
if modified := nodeClaim.StatusConditions().SetFalse(v1.ConditionTypeVolumesDetached, "AwaitingVolumeDetachment", "AwaitingVolumeDetachment"); modified {
if err := c.kubeClient.Status().Patch(ctx, nodeClaim, client.MergeFromWithOptions(stored, client.MergeFromWithOptimisticLock{})); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here on ignoring not found here rather than L194

// getting the NodeClaim again. This prevents conflict errors on subsequent writes.
// USE CAUTION when determining whether to increase this timeout or remove this line
time.Sleep(time.Second)
nodeClaim, err = nodeutils.NodeClaimForNode(ctx, c.kubeClient, node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here on just returning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants