Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dennisme/rebase pod events #631

Closed
wants to merge 5 commits into from

Conversation

dennisme
Copy link

@dennisme dennisme commented Apr 21, 2022

Issue #635, if available:

Description of changes:
This PR rebases an earlier PR on our fork by @dthorsen that adds pod eventing to draining. We have been running this for a few months and it does help a ton with visibility.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dennisme dennisme requested a review from a team as a code owner April 21, 2022 18:21
@dennisme
Copy link
Author

Tests failing locally

2022/04/21 18:46:09 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/04/21 18:46:09 INF Pods on node node_name=e2e-test-7621cf83-worker pod_names=["e2e-test-7621cf83-aemm-amazon-ec2-metadata-mock-5445f44f8fjk9zw","regular-pod-test-b5fb5fc6-jfttf","webhook-test-proxy-hdm49","e2e-test-7621cf83-anth-aws-node-termination-handler-j5zkb","kindnet-mcj95","kube-proxy-vhc7n"]
2022/04/21 18:46:09 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2022/04/21 18:46:09 INF Draining the node
2022/04/21 18:46:09 INF Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1b39097]

goroutine 113 [running]:
github.com/aws/aws-node-termination-handler/pkg/observability.(*K8sEventRecorder).AnnotatedEventf(0xc000848200, 0x24d7808, 0xc00044fc70, 0xc000873350, 0x21a2e27, 0x6, 0x21a8219, 0xb, 0x21c484b, 0x1d, ...)
	<autogenerated>:1 +0x37
github.com/aws/aws-node-termination-handler/pkg/node.Node.CordonAndDrain(0x0, 0xc00004003a, 0x18, 0xc00003e059, 0x39, 0xc000048016, 0x46, 0x101, 0xc0000401f8, 0x9, ...)
	/node-termination-handler/pkg/node/node.go:140 +0x515
main.cordonAndDrainNode(0x0, 0xc00004003a, 0x18, 0xc00003e059, 0x39, 0xc000048016, 0x46, 0x101, 0xc0000401f8, 0x9, ...)
	/node-termination-handler/cmd/node-termination-handler.go:404 +0x12b
main.drainOrCordonIfNecessary(0xc000256240, 0xc0006741e0, 0x0, 0xc00004003a, 0x18, 0xc00003e059, 0x39, 0xc000048016, 0x46, 0x101, ...)
	/node-termination-handler/cmd/node-termination-handler.go:354 +0x665
created by main.main
	/node-termination-handler/cmd/node-termination-handler.go:247 +0x1107
⏰ Took 988sec

@brycahta
Copy link
Contributor

brycahta commented May 3, 2022

@dennisme thanks for the PR. Could you include the issue # or requested feature in the description? If no issue exists, please open one with some context as to why these changes are needed/beneficial.

@dennisme
Copy link
Author

dennisme commented May 3, 2022

@dennisme thanks for the PR. Could you include the issue # or requested feature in the description? If no issue exists, please open one with some context as to why these changes are needed/beneficial.

Yes of course! Also next commit will fix the test 😬

@dennisme dennisme force-pushed the dennisme/rebase-pod-events branch from 2f6d527 to 22a31cb Compare May 4, 2022 06:32
@dennisme
Copy link
Author

dennisme commented May 4, 2022

ok we should be good, test fixed :)

@@ -61,7 +63,8 @@ func TestDryRun(t *testing.T) {
tNode, err := node.New(config.Config{DryRun: true})
h.Ok(t, err)

err = tNode.CordonAndDrain(nodeName, "cordonReason")
fakeRecorder := record.NewFakeRecorder(100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Make this const please
  • Is it necessary for the buffer size to be this large?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but if it set to small the tests will hang indefinitely.

@@ -98,6 +101,7 @@ func TestNewFailure(t *testing.T) {
}

func TestDrainSuccess(t *testing.T) {
controllerBool := true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to be more descriptive, ex: isOwnerController

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

h.Ok(t, err)
close(fakeRecorder.Events)
Copy link
Contributor

@brycahta brycahta May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to explicitly close this channel? If so, please update and use defer in case h.Ok fails.

@brycahta brycahta requested a review from snay2 May 9, 2022 19:04
// PodEvictReason is the event reason emitted for Pod evictions during node drain
PodEvictReason = "PodEviction"
// PodEvictMsg is the event message emitted for Pod evictions during node drain
PodEvictMsg = "Pod evicted due to node drain"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient detail for the event? Do you need it to specify that node-termination-handler initiated the drain in response to a particular AWS event (Spot ITN, ASG event, etc.)?

for k, v := range pod.GetLabels() {
annotations[k] = v
}
recorder.AnnotatedEventf(podRef, annotations, "Normal", PodEvictReason, PodEvictMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "Normal" also be in a constant, like PodEvictReason and PodEvictMsg? What are the other possible values for that argument?

@@ -113,7 +113,7 @@ func InitK8sEventRecorder(enabled bool, nodeName string, sqsMode bool, nodeMetad
}

broadcaster := record.NewBroadcaster()
broadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{Interface: clientSet.CoreV1().Events("default")})
broadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{Interface: clientSet.CoreV1().Events("")})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some context on this change? What is the difference in behavior between "default" and ""?

@@ -800,3 +827,7 @@ func filterPodForDeletion(podName string) func(pod corev1.Pod) drain.PodDeleteSt
return drain.MakePodDeleteStatusOkay()
}
}

type recorderInterface interface {
AnnotatedEventf(object runtime.Object, annotations map[string]string, eventtype, reason, messageFmt string, args ...interface{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: eventtype -> eventType

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch. I agree.

@github-actions
Copy link

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this PR to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label May 25, 2022
@github-actions
Copy link

This PR was closed because it has become stale with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues / PRs with no activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants