Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slowdown and stoppage in the main event loop #499

Merged
merged 4 commits into from
Oct 13, 2021

Conversation

mechanical-fish
Copy link
Contributor

Issue #, if available: #498

Description of changes:

  • Prevent the event loop from receiving stale events marked InProgress and exiting immediately, before reaching urgent events that need handling.

  • Periodically log the size of the event queue.

  • Periodically "garbage-collect" the event queue to trim already-handled events and slow the rate of its unbounded growth.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- Prevent the event loop from exiting immediately if it receives a stale event
marked InProgress

- Periodically log the size of the event queue.

- Periodically "garbage-collect" the event queue to trim already-handled
events and slow the rate of its unbounded growth.
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GC of the event store is a good work around. Also, in the drainOrCordonIfNecessary func, the err case can probably be used to delete the event from the store and rely on the sqs visibility timeout elapsing where another instance of NTH can retry or the same, wdyt?

https://github.com/aws/aws-node-termination-handler/pull/499/files#diff-858f040f931a5e934ecdcca9652193aa35727018161574b85a4c8abe06078fd4R368

@bwagner5 bwagner5 added the Type: Bug Something isn't working label Oct 12, 2021
@mechanical-fish
Copy link
Contributor Author

I think you're right that the correct thing to do is to delete the event completely if drainOrCordonIfNecessary's call to drainOrCordonNode or cordonNode return an error. After all, it looks like because the event has InProgress set it will never do anything again, and i'm definitely still seeing residual uncollected stale events after the garbage-collector runs; this might help eliminate that.

@mechanical-fish
Copy link
Contributor Author

mechanical-fish commented Oct 12, 2021

turns out the answer to "why didn't I put the logging and GC stuff into the event store itself?" is that I didn't want to risk creating deadlocks, but I believe the deadlock-detecting unit test has saved me from myself. I do think this might be a tidier implementation now.

@snay2 snay2 linked an issue Oct 12, 2021 that may be closed by this pull request
snay2
snay2 previously approved these changes Oct 12, 2021
@mechanical-fish
Copy link
Contributor Author

I have sneaked one more important bugfix into this PR: I got NTH to run out of worker processes, and it started trying to log messages as fast as it possibly could, which is not the desired behavior! Now it logs at most one message per second.

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm thanks for submitting this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An event store bug slows event handling over time
3 participants