Fix race in AbstractSearchAsyncAction request throttling #116264

original-brownbear · 2024-11-05T16:04:11Z

We had a race here where the non-blocking pending execution would be starved of executing threads.
This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
Also, modernised the test a little. Ever since there's no more types the loop can be cut in half and run for twice as many iterations + the assertion is just a hit count assertion :)

non-issue as this hasn't been released yet.

We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.

elasticsearchmachine · 2024-11-05T16:04:42Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

original-brownbear · 2024-11-05T16:48:49Z

Maybe important to note, this fix is a stop-gap measure only to fix tests. A follow-up that is incoming will just simplify this to not be as much of an issue any longer by pre-computing the tasks per node and then running them (for the most part at least, some you can't pre-compute).
This is all just needlessly complicated by the fact that we throttle on the fly as we loop through the per-shard-tasks.
The locking(ish) approach here might be somewhat neatly reusable for response merging though :)

javanna · 2024-11-07T11:10:33Z

server/src/internalClusterTest/java/org/elasticsearch/indexing/IndexActionIT.java

-            }
-            if (firstError != null) {
-                fail(firstError.getMessage());
+                assertHitCount(prepareSearch("test"), numOfDocs);


what the rationale behind the changes in this test? I get confused because I thought the code fix below would be enough to fix this test failure that gets unmuted in the same PR.

Ah sorry I had this cleanup in my reproducer branch and kept it because the test was super dirty from before with the types removal leftover etc. Let me revert this real quick, I'll open a separate test cleanup :)

…arch-async-action

original-brownbear · 2024-11-07T11:31:08Z

server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

        }

        private void executeAndRelease(Consumer<Releasable> task) {
-            while (task != null) {
+            do {


This is not strictly necessary but I figured it'd be nice to make it obvious that we never work on null here when entering the loop :)

original-brownbear · 2024-11-07T11:31:22Z

All cleaned up! Thanks Luca :)

javanna

I am good getting the fix in. I do wonder if we could write contained unit tests for PendingExecutions that verifies its behaviour. Sounds like it would make these issues easier to find and debug compared to IT tests

original-brownbear · 2024-11-07T14:23:24Z

Thanks Luca, I'll look into a reproducer UT :)

) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.

These are all fixed by #116264 closes #115664 #113430 #115717 #115705 #115970 #115988 #115810 #116027 #115754 #116097 #115818 #116377 #114824

We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.

These are all fixed by #116264 closes #115664 #113430 #115717 #115705 #115970 #115988 #115810 #116027 #115754 #116097 #115818 #116377 #114824

These are all fixed by elastic#116264 closes elastic#115664 elastic#113430 elastic#115717 elastic#115705 elastic#115970 elastic#115988 elastic#115810 elastic#116027 elastic#115754 elastic#116097 elastic#115818 elastic#116377 elastic#114824

) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.

…117426) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.

This is long fixed by elastic#116264

This is long fixed by #116264 Fixes #115728

) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.

These are all fixed by elastic#116264 closes elastic#115664 elastic#113430 elastic#115717 elastic#115705 elastic#115970 elastic#115988 elastic#115810 elastic#116027 elastic#115754 elastic#116097 elastic#115818 elastic#116377 elastic#114824

This is long fixed by elastic#116264 Fixes elastic#115728

…117638) We had a race here where the non-blocking pending execution would be starved of executing threads. This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task. => need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.

original-brownbear added >non-issue v9.0.0 :Search Foundations/Search Catch all for Search Foundations labels Nov 5, 2024

elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Nov 5, 2024

original-brownbear requested review from javanna and piergm November 5, 2024 17:25

javanna reviewed Nov 7, 2024

View reviewed changes

original-brownbear added 2 commits November 7, 2024 12:21

Merge remote-tracking branch 'elastic/main' into fix-race-abstract-se…

9db4a9e

…arch-async-action

cleanup

523167d

original-brownbear commented Nov 7, 2024

View reviewed changes

original-brownbear requested a review from javanna November 7, 2024 11:31

original-brownbear mentioned this pull request Nov 7, 2024

[CI] IndexActionIT testAutoGenerateIdNoDuplicates failing #115716

Closed

javanna approved these changes Nov 7, 2024

View reviewed changes

original-brownbear merged commit bcd6c1d into elastic:main Nov 7, 2024
16 checks passed

original-brownbear deleted the fix-race-abstract-search-async-action branch November 7, 2024 14:23

original-brownbear mentioned this pull request Nov 11, 2024

Unmute a lot of fixed tests from search race condition #116587

Merged

jozala pushed a commit that referenced this pull request Nov 13, 2024

Unmute a lot of fixed tests from search race condition (#116587)

d2e8a30

These are all fixed by #116264 closes #115664 #113430 #115717 #115705 #115970 #115988 #115810 #116027 #115754 #116097 #115818 #116377 #114824

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 24, 2024

Unmute 115728

534a47e

This is long fixed by elastic#116264

original-brownbear mentioned this pull request Nov 24, 2024

Unmute and close #115728 #117431

Merged

elasticsearchmachine pushed a commit that referenced this pull request Nov 24, 2024

Unmute 115728 (#117431)

5f3b380

This is long fixed by #116264 Fixes #115728

original-brownbear added the v8.18.0 label Nov 27, 2024

original-brownbear mentioned this pull request Nov 27, 2024

Fix race in AbstractSearchAsyncAction request throttling (#116264) #117638

Merged

alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024

Unmute 115728 (elastic#117431)

d7730da

This is long fixed by elastic#116264 Fixes elastic#115728

original-brownbear restored the fix-race-abstract-search-async-action branch November 30, 2024 10:08

This was referenced Dec 5, 2024

[CI] AdjacencyMatrixIT testWithContextBasedSubAggregation failing #117598

Closed

[CI] IndexActionIT testAutoGenerateIdNoDuplicates failing #117110

Closed

[CI] PartitionedRoutingIT testVariousPartitionSizes failing #116294

Closed

cbuescher mentioned this pull request Jan 10, 2025

[CI] SearchWithRandomIOExceptionsIT testRandomDirectoryIOExceptions failing #118733

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in AbstractSearchAsyncAction request throttling #116264

Fix race in AbstractSearchAsyncAction request throttling #116264

original-brownbear commented Nov 5, 2024

elasticsearchmachine commented Nov 5, 2024

original-brownbear commented Nov 5, 2024 •

edited

Loading

javanna Nov 7, 2024

original-brownbear Nov 7, 2024

original-brownbear Nov 7, 2024

original-brownbear commented Nov 7, 2024

javanna left a comment

original-brownbear commented Nov 7, 2024

Fix race in AbstractSearchAsyncAction request throttling #116264

Fix race in AbstractSearchAsyncAction request throttling #116264

Conversation

original-brownbear commented Nov 5, 2024

elasticsearchmachine commented Nov 5, 2024

original-brownbear commented Nov 5, 2024 • edited Loading

javanna Nov 7, 2024

Choose a reason for hiding this comment

original-brownbear Nov 7, 2024

Choose a reason for hiding this comment

original-brownbear Nov 7, 2024

Choose a reason for hiding this comment

original-brownbear commented Nov 7, 2024

javanna left a comment

Choose a reason for hiding this comment

original-brownbear commented Nov 7, 2024

original-brownbear commented Nov 5, 2024 •

edited

Loading