-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in AbstractSearchAsyncAction request throttling #116264
Fix race in AbstractSearchAsyncAction request throttling #116264
Conversation
We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Maybe important to note, this fix is a stop-gap measure only to fix tests. A follow-up that is incoming will just simplify this to not be as much of an issue any longer by pre-computing the tasks per node and then running them (for the most part at least, some you can't pre-compute). |
} | ||
if (firstError != null) { | ||
fail(firstError.getMessage()); | ||
assertHitCount(prepareSearch("test"), numOfDocs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the rationale behind the changes in this test? I get confused because I thought the code fix below would be enough to fix this test failure that gets unmuted in the same PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I had this cleanup in my reproducer branch and kept it because the test was super dirty from before with the types removal leftover etc. Let me revert this real quick, I'll open a separate test cleanup :)
} | ||
|
||
private void executeAndRelease(Consumer<Releasable> task) { | ||
while (task != null) { | ||
do { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not strictly necessary but I figured it'd be nice to make it obvious that we never work on null
here when entering the loop :)
All cleaned up! Thanks Luca :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good getting the fix in. I do wonder if we could write contained unit tests for PendingExecutions that verifies its behaviour. Sounds like it would make these issues easier to find and debug compared to IT tests
Thanks Luca, I'll look into a reproducer UT :) |
) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
…117426) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
This is long fixed by elastic#116264
) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
This is long fixed by elastic#116264 Fixes elastic#115728
…117638) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
We had a race here where the non-blocking pending execution would be starved of executing threads.
This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
Also, modernised the test a little. Ever since there's no more types the loop can be cut in half and run for twice as many iterations + the assertion is just a hit count assertion :)
non-issue as this hasn't been released yet.