Fix slowdown and stoppage in the main event loop #499

mechanical-fish · 2021-10-12T16:01:56Z

Issue #, if available: #498

Description of changes:

Prevent the event loop from receiving stale events marked InProgress and exiting immediately, before reaching urgent events that need handling.
Periodically log the size of the event queue.
Periodically "garbage-collect" the event queue to trim already-handled events and slow the rate of its unbounded growth.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- Prevent the event loop from exiting immediately if it receives a stale event marked InProgress - Periodically log the size of the event queue. - Periodically "garbage-collect" the event queue to trim already-handled events and slow the rate of its unbounded growth.

bwagner5

I think GC of the event store is a good work around. Also, in the drainOrCordonIfNecessary func, the err case can probably be used to delete the event from the store and rely on the sqs visibility timeout elapsing where another instance of NTH can retry or the same, wdyt?

https://github.com/aws/aws-node-termination-handler/pull/499/files#diff-858f040f931a5e934ecdcca9652193aa35727018161574b85a4c8abe06078fd4R368

cmd/node-termination-handler.go

mechanical-fish · 2021-10-12T17:17:54Z

I think you're right that the correct thing to do is to delete the event completely if drainOrCordonIfNecessary's call to drainOrCordonNode or cordonNode return an error. After all, it looks like because the event has InProgress set it will never do anything again, and i'm definitely still seeing residual uncollected stale events after the garbage-collector runs; this might help eliminate that.

mechanical-fish · 2021-10-12T17:52:05Z

turns out the answer to "why didn't I put the logging and GC stuff into the event store itself?" is that I didn't want to risk creating deadlocks, but I believe the deadlock-detecting unit test has saved me from myself. I do think this might be a tidier implementation now.

mechanical-fish · 2021-10-13T14:41:24Z

I have sneaked one more important bugfix into this PR: I got NTH to run out of worker processes, and it started trying to log messages as fast as it possibly could, which is not the desired behavior! Now it logs at most one message per second.

bwagner5

/lgtm thanks for submitting this!

bwagner5 reviewed Oct 12, 2021

View reviewed changes

bwagner5 added the Type: Bug Something isn't working label Oct 12, 2021

bwagner5 reviewed Oct 12, 2021

View reviewed changes

cmd/node-termination-handler.go Outdated Show resolved Hide resolved

mechanical-fish added 2 commits October 12, 2021 13:21

Cancel events which fail to drain or cordon

495d08f

Move event store logging and GC into the store itself

b6735c7

snay2 linked an issue Oct 12, 2021 that may be closed by this pull request

An event store bug slows event handling over time #498

Closed

snay2 previously approved these changes Oct 12, 2021

View reviewed changes

Don't print more than 1 error per second when running out of workers

3697b83

mechanical-fish dismissed snay2’s stale review via 3697b83 October 13, 2021 14:39

bwagner5 approved these changes Oct 13, 2021

View reviewed changes

bwagner5 merged commit 425faa2 into aws:main Oct 13, 2021

ec2-bot mentioned this pull request Oct 18, 2021

🥳 node-termination-handler v1.13.4 Automated Release! 🥑 aws/eks-charts#614

Merged

mechanical-fish deleted the fix-get-active-event-bug branch October 18, 2021 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slowdown and stoppage in the main event loop #499

Fix slowdown and stoppage in the main event loop #499

mechanical-fish commented Oct 12, 2021

bwagner5 left a comment

mechanical-fish commented Oct 12, 2021

mechanical-fish commented Oct 12, 2021 •

edited

Loading

mechanical-fish commented Oct 13, 2021

bwagner5 left a comment

Fix slowdown and stoppage in the main event loop #499

Fix slowdown and stoppage in the main event loop #499

Conversation

mechanical-fish commented Oct 12, 2021

bwagner5 left a comment

Choose a reason for hiding this comment

mechanical-fish commented Oct 12, 2021

mechanical-fish commented Oct 12, 2021 • edited Loading

mechanical-fish commented Oct 13, 2021

bwagner5 left a comment

Choose a reason for hiding this comment

mechanical-fish commented Oct 12, 2021 •

edited

Loading