-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multithreaded event processor #317
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have mixed feelings about this
Certainly for IMDS mode, this is unnecessary. The event queue is restricted to each node and doesn't change often.
For Queue Processor mode, I see the benefit mostly for medium-sized clusters.
For small clusters I wouldn't think the are so many events coming in that this is very helpful. I'm not sure this scales well for large clusters either, to me the better solution would be to run multiple replicas of the pod.
I wasn't as involved with Queue processor mode, though, so let me sync up with @bwagner5 next week to get more details. In the meantime, if you can give me some more details into the setup you're running (size of your cluster, time seen for events to be processed, etc.) maybe I'll be more convinced
Thank you for your reply. Think one fact that was not taken into account when going to queue that a daemonset was effectively multithreaded execution with n*worker where n==node count. However queue is strictly single thread which will not cope with when the loop is busy. And there are couple of reasons that the loop is busy or blocked with retries (to evict a limiting pdb etc.) and does eventually ignore other events. I will post an example to reproduce later. Simply roll more than one node at a time (batchSize) and have one pod unevictable, reproduces this issue. Maybe AWS doesn’t operate this itself yet? |
@haugenj simplified the approach by calling the goroutine directly from main, requires only little changes. Hopefully the underlying single threaded problem is more clear now |
@universam1 thank you for your patience, this looks good to me. Can you also add the config variable to the Readme and Helm charts? Example: https://github.com/aws/aws-node-termination-handler/pull/312/files |
f9f57f4
to
583e2b0
Compare
@haugenj Thank you - certainly, updated the Helm chart and Readme - please let me know if anything missing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding a WaitGroup to track the drainOrCordonIfNecessary
go-routines.
Everything else looks great! Thanks for taking this on and contributing! 🚀
This replaces the single process execution of events to parallel processing, solving the issue that happens when NTH is busy / blocked (retrying to evict) and eventually will miss to process events for other nodes going down the same time time. Example: 3 nodes roll at a time because of batchSize or spot interruption. A deployment has a pdb limit of maxUnavailable of 1 - that will block NTH in a eviction retry loop and it will miss the third node eviction. The amount of workers are capped to prevent a memory runnaway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
This replaces the single process execution of events to parallel processing, solving the issue that happens when NTH is busy / blocked (retrying to evict) and eventually will miss to process events for other nodes going down the same time time. Example: 3 nodes roll at a time because of batchSize or spot interruption. A deployment has a pdb limit of maxUnavailable of 1 - that will block NTH in a eviction retry loop and it will miss the third node eviction. The amount of workers are capped to prevent a memory runnaway
Issue #310
Description of changes:
This replaces the single process execution of events to parallel processing, solving the issue that happens when NTH is busy / blocked (retrying to evict) and eventually will miss to process events for other nodes going down the same time time.
Example:
3 nodes roll at a time because of batchSize or spot interruption.
A deployment has a pdb limit of maxUnavailable of 1 - that will block NTH in a eviction retry loop and it will miss the third node eviction.
The amount of workers are capped to prevent a memory runnaway.