Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in AbstractSearchAsyncAction request throttling #116264

Conversation

original-brownbear
Copy link
Member

We had a race here where the non-blocking pending execution would be starved of executing threads.
This happened when all the current holders of permits from the semaphore would release their permit after a producer thread failed to acquire a permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to acquire a new permit if there's work left to be done to avoid this scenario.
Also, modernised the test a little. Ever since there's no more types the loop can be cut in half and run for twice as many iterations + the assertion is just a hit count assertion :)

non-issue as this hasn't been released yet.

We had a race here where the non-blocking pending execution
would be starved of executing threads.
This happened when all the current holders of permits from the semaphore
would release their permit after a producer thread failed to acquire a
permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to
acquire a new permit if there's work left to be done to avoid this
scenario.
@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Nov 5, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@original-brownbear
Copy link
Member Author

original-brownbear commented Nov 5, 2024

Maybe important to note, this fix is a stop-gap measure only to fix tests. A follow-up that is incoming will just simplify this to not be as much of an issue any longer by pre-computing the tasks per node and then running them (for the most part at least, some you can't pre-compute).
This is all just needlessly complicated by the fact that we throttle on the fly as we loop through the per-shard-tasks.
The locking(ish) approach here might be somewhat neatly reusable for response merging though :)

}
if (firstError != null) {
fail(firstError.getMessage());
assertHitCount(prepareSearch("test"), numOfDocs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the rationale behind the changes in this test? I get confused because I thought the code fix below would be enough to fix this test failure that gets unmuted in the same PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I had this cleanup in my reproducer branch and kept it because the test was super dirty from before with the types removal leftover etc. Let me revert this real quick, I'll open a separate test cleanup :)

}

private void executeAndRelease(Consumer<Releasable> task) {
while (task != null) {
do {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly necessary but I figured it'd be nice to make it obvious that we never work on null here when entering the loop :)

@original-brownbear
Copy link
Member Author

All cleaned up! Thanks Luca :)

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good getting the fix in. I do wonder if we could write contained unit tests for PendingExecutions that verifies its behaviour. Sounds like it would make these issues easier to find and debug compared to IT tests

@original-brownbear
Copy link
Member Author

Thanks Luca, I'll look into a reproducer UT :)

@original-brownbear original-brownbear merged commit bcd6c1d into elastic:main Nov 7, 2024
16 checks passed
@original-brownbear original-brownbear deleted the fix-race-abstract-search-async-action branch November 7, 2024 14:23
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Nov 7, 2024
)

We had a race here where the non-blocking pending execution
would be starved of executing threads.
This happened when all the current holders of permits from the semaphore
would release their permit after a producer thread failed to acquire a
permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to
acquire a new permit if there's work left to be done to avoid this
scenario.
jozala pushed a commit that referenced this pull request Nov 13, 2024
We had a race here where the non-blocking pending execution
would be starved of executing threads.
This happened when all the current holders of permits from the semaphore
would release their permit after a producer thread failed to acquire a
permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to
acquire a new permit if there's work left to be done to avoid this
scenario.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 24, 2024
)

We had a race here where the non-blocking pending execution
would be starved of executing threads.
This happened when all the current holders of permits from the semaphore
would release their permit after a producer thread failed to acquire a
permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to
acquire a new permit if there's work left to be done to avoid this
scenario.
elasticsearchmachine pushed a commit that referenced this pull request Nov 24, 2024
…117426)

We had a race here where the non-blocking pending execution
would be starved of executing threads.
This happened when all the current holders of permits from the semaphore
would release their permit after a producer thread failed to acquire a
permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to
acquire a new permit if there's work left to be done to avoid this
scenario.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 24, 2024
This is long fixed by elastic#116264
elasticsearchmachine pushed a commit that referenced this pull request Nov 24, 2024
This is long fixed by #116264

Fixes #115728
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 27, 2024
)

We had a race here where the non-blocking pending execution
would be starved of executing threads.
This happened when all the current holders of permits from the semaphore
would release their permit after a producer thread failed to acquire a
permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to
acquire a new permit if there's work left to be done to avoid this
scenario.
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
)

We had a race here where the non-blocking pending execution
would be starved of executing threads.
This happened when all the current holders of permits from the semaphore
would release their permit after a producer thread failed to acquire a
permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to
acquire a new permit if there's work left to be done to avoid this
scenario.
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
@original-brownbear original-brownbear restored the fix-race-abstract-search-async-action branch November 30, 2024 10:08
elasticsearchmachine pushed a commit that referenced this pull request Dec 2, 2024
…117638)

We had a race here where the non-blocking pending execution
would be starved of executing threads.
This happened when all the current holders of permits from the semaphore
would release their permit after a producer thread failed to acquire a
permit and then enqueued its task.
=> need to peek the queue again after releasing the permit and try to
acquire a new permit if there's work left to be done to avoid this
scenario.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants