Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the minimum size of the management pool to 2 #90193

Merged
merged 4 commits into from
Sep 21, 2022

Conversation

arteam
Copy link
Contributor

@arteam arteam commented Sep 21, 2022

Make sure that individual wait_for_completion=true requests don't block concurrent stats and cat requests.

See #89564

Make sure that individual `wait_for_completion=true` requests don't block concurrent
stats and cat requests.
@arteam arteam added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. auto-backport Automatically create backport pull requests when merged v8.5.0 >bug >enhancement and removed >bug labels Sep 21, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @arteam, I've created a changelog YAML for you.

@arteam arteam marked this pull request as ready for review September 21, 2022 16:13
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Sep 21, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fixes the bug that #89564 exposes unfortunately - if there were two concurrent calls to the list-tasks API then they could both deadlock within TaskManager#waitForTaskCompletion even after this change.

Instead, I think we should make TaskManager#waitForTaskCompletion async rather than blocking the management thread as it does today..

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a short term fix, provided that we follow up with a proper fix for the task management wait and in that fix lower the minimum again to 1.

@DaveCTurner DaveCTurner dismissed their stale review September 21, 2022 16:34

discussion in another channel

@droberts195
Copy link
Contributor

Instead, I think we should make TaskManager#waitForTaskCompletion async rather than blocking the management thread as it does today.

I agree that is the ideal fix. However, if that cannot be done in time for 8.5.0 then increasing the minimum number of threads to 2 at least keeps ESS in the same situation as it was in in 8.4 and earlier. In 8.4 and earlier ESS told every node it had at least 2 processors to avoid strange lockups, and those strange lockups (undiagnosed at the time they were originally mentioned) were almost certainly the same problem as #89564. In 8.5 ESS is going to tell nodes with one processor that they have one processor, so the strange lockups will be seen by ESS users on small nodes unless something is done in Elasticsearch. So by all means fix this properly for 8.5.0 if there's time, but something has to be done for 8.5.0 even if that's a quick workaround.

@arteam
Copy link
Contributor Author

arteam commented Sep 21, 2022

@elasticmachine run elasticsearch-ci/packaging-tests-windows-sample

@arteam arteam merged commit beea2b0 into elastic:main Sep 21, 2022
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:

There are no branches to backport to. Aborting.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 90193

@arteam arteam deleted the increase-size-of-management-pool branch September 21, 2022 18:35
@arteam
Copy link
Contributor Author

arteam commented Sep 21, 2022

The commit went to the master branch before the 8.5 branch was cut, so this change is included in 8.5 and doesn't need a backport.

weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 22, 2022
* main: (186 commits)
  [DOCS] Add 8.5 release notes and fix links (elastic#90201)
  Mute DownsampleActionSingleNodeTests.testCannotRollupWhileOtherRollupInProgress (elastic#90198)
  Bump to version 8.6.0
  Increase the minimum size of the management pool to 2 (elastic#90193)
  Speed `getIntLE` from `BytesReference` (elastic#90147)
  Restrict nodes for testClusterPrimariesActive1 (elastic#90191)
  Fix bug with `BigIntArray` serialization (elastic#90142)
  Synthetic _source: test _source filtering (elastic#90138)
  Modernize cardinality agg tests (elastic#90114)
  Mute failing test (elastic#90186)
  Move assertion in ES85BloomFilterPostingsFormat to fix test (elastic#90150)
  Restrict nodes for testClusterPrimariesActive2 (elastic#90184)
  Batch index delete cluster state updates (elastic#90033)
  Register stable plugins in ActionModule (elastic#90067)
  Mute failing test (elastic#90180)
  [HealthAPI] Disk: Use _ for diagnosis id (elastic#90179)
  [HealtAPI] Disk: use shorter help URLs (elastic#90178)
  Fixing disk health indicator unit tests (elastic#90175)
  Enable the health node and the disk health indicator elastic#84811 (elastic#90085)
  Add missing Disk Indicator health api IDs (elastic#90174)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 22, 2022
* main: (121 commits)
  [DOCS] Add 8.5 release notes and fix links (elastic#90201)
  Mute DownsampleActionSingleNodeTests.testCannotRollupWhileOtherRollupInProgress (elastic#90198)
  Bump to version 8.6.0
  Increase the minimum size of the management pool to 2 (elastic#90193)
  Speed `getIntLE` from `BytesReference` (elastic#90147)
  Restrict nodes for testClusterPrimariesActive1 (elastic#90191)
  Fix bug with `BigIntArray` serialization (elastic#90142)
  Synthetic _source: test _source filtering (elastic#90138)
  Modernize cardinality agg tests (elastic#90114)
  Mute failing test (elastic#90186)
  Move assertion in ES85BloomFilterPostingsFormat to fix test (elastic#90150)
  Restrict nodes for testClusterPrimariesActive2 (elastic#90184)
  Batch index delete cluster state updates (elastic#90033)
  Register stable plugins in ActionModule (elastic#90067)
  Mute failing test (elastic#90180)
  [HealthAPI] Disk: Use _ for diagnosis id (elastic#90179)
  [HealtAPI] Disk: use shorter help URLs (elastic#90178)
  Fixing disk health indicator unit tests (elastic#90175)
  Enable the health node and the disk health indicator elastic#84811 (elastic#90085)
  Add missing Disk Indicator health api IDs (elastic#90174)
  ...
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 22, 2022
droberts195 added a commit that referenced this pull request Sep 23, 2022
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 4, 2023
Instead of blocking either a transport or management thread here (the latter will lead to
a dead-lock if there's only a single management thread available, we should just queue
up those sends and run them async via listener).
This test is currently only working on `main` because of elastic#90193, this makes it work
with a single management thread as well so it works on 7.x again.

closes elastic#92629
original-brownbear added a commit that referenced this pull request Jan 4, 2023
Instead of blocking either a transport or management thread here (the latter will lead to
a dead-lock if there's only a single management thread available, we should just queue
up those sends and run them async via listener).
This test is currently only working on `main` because of #90193, this makes it work
with a single management thread as well so it works on 7.x again.

closes #92629
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 4, 2023
Instead of blocking either a transport or management thread here (the latter will lead to
a dead-lock if there's only a single management thread available, we should just queue
up those sends and run them async via listener).
This test is currently only working on `main` because of elastic#90193, this makes it work
with a single management thread as well so it works on 7.x again.

closes elastic#92629
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 4, 2023
Instead of blocking either a transport or management thread here (the latter will lead to
a dead-lock if there's only a single management thread available, we should just queue
up those sends and run them async via listener).
This test is currently only working on `main` because of elastic#90193, this makes it work
with a single management thread as well so it works on 7.x again.

closes elastic#92629
elasticsearchmachine pushed a commit that referenced this pull request Jan 4, 2023
Instead of blocking either a transport or management thread here (the latter will lead to
a dead-lock if there's only a single management thread available, we should just queue
up those sends and run them async via listener).
This test is currently only working on `main` because of #90193, this makes it work
with a single management thread as well so it works on 7.x again.

closes #92629
elasticsearchmachine pushed a commit that referenced this pull request Jan 4, 2023
Instead of blocking either a transport or management thread here (the latter will lead to
a dead-lock if there's only a single management thread available, we should just queue
up those sends and run them async via listener).
This test is currently only working on `main` because of #90193, this makes it work
with a single management thread as well so it works on 7.x again.

closes #92629
arteam added a commit to arteam/elasticsearch that referenced this pull request Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement Team:Distributed Meta label for distributed team v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants