Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakiness in MasterServiceTests.testThrottlingForMultipleTaskTypes #8901

Merged
merged 2 commits into from
Jul 27, 2023

Conversation

andrross
Copy link
Member

The test configured a timeout duration of zero for certain tasks and asserted that all tasks were throttled or timed out. This is not a valid assertion because it is possible for a task to complete before the asynchronous timeout operation runs, which means the task would complete successfully. The fix is to adjust the assertion to allow for successful tasks in this case.

Related Issues

Resolves #5958

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@sohami
Copy link
Collaborator

sohami commented Jul 27, 2023

@andrross I think we should instead handle timeout of 0 to fail the task immediately instead of scheduling the asynchronous task to remove it from the queue on timeout.

@andrross
Copy link
Member Author

@andrross I think we should instead handle timeout of 0 to fail the task immediately instead of scheduling the asynchronous task to remove it from the queue on timeout.

@sohami I don't love the inherent non-determinism here, but is a zero timeout useful in practice here? I'm assuming not (why schedule a task that is already timed out?), so I also don't love putting in additional logic for this case. The test would also just be testing this additional logic of the zero timeout case and not testing the normal timeout path, so I don't think it helps with coverage. Does that make sense or is there a different way to handle this case?

@andrross
Copy link
Member Author

andrross commented Jul 27, 2023

@sohami I think it is probably better to deterministically test the timeout case in a separate test case. I'm thinking something like:

  • create an instance using a single-threaded thread pool
  • start one task and have it block on a latch
  • start a second task with a short timeout and verify it times out (because the thread pool is fully occupied)
  • unblock the first task and clean everything up

What do you think?

@sohami
Copy link
Collaborator

sohami commented Jul 27, 2023

I don't love the inherent non-determinism here, but is a zero timeout useful in practice here? I'm assuming not (why schedule a task that is already timed out?), so I also don't love putting in additional logic for this case.

I don't foresee any use case with 0 timeout but seems like we are allowing that to be set anyways. If we want to allow it than handling it separately would avoid adding extra timeout tasks for which most of the time behavior will be indeterministic. The other option is if there is no use case for timeout of 0 should we fail the task setting it to that value instead ? This will also help to catch issues probably where it is set to 0 incorrectly.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@sohami
Copy link
Collaborator

sohami commented Jul 27, 2023

@sohami I think it is probably better to deterministically test the timeout case in a separate test case. I'm thinking something like:

  • create an instance using a single-threaded thread pool
  • start one task and have it block on a latch
  • start a second task with a short timeout and verify it times out (because the thread pool is fully occupied)
  • unblock the first task and clean everything up

What do you think?

Sounds good to me. thanks!

@andrross andrross force-pushed the fix-master-service-flakiness branch from c6a8457 to 951ff07 Compare July 27, 2023 18:05
The test configured a [timeout duration of zero][1] for certain tasks
and asserted that all tasks were throttled or timed out. This is not a
valid assertion because it is possible for a task to complete before the
[asynchronous timeout operation runs][2], which means the task would
complete successfully. The fix is to adjust the assertion to allow for
successful tasks in this case.

[1]: https://github.com/opensearch-project/OpenSearch/blob/60985bc300d9eafd36c1ab25d46235e1c925c565/server/src/test/java/org/opensearch/cluster/service/MasterServiceTests.java#L941
[2]: https://github.com/opensearch-project/OpenSearch/blob/9fc3f4096958159ec9b53012fc7ced19fd793e1b/server/src/main/java/org/opensearch/common/util/concurrent/PrioritizedOpenSearchThreadPoolExecutor.java#L266

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
@andrross andrross force-pushed the fix-master-service-flakiness branch from 951ff07 to 8c5f7e0 Compare July 27, 2023 18:06
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@andrross andrross added the backport 2.x Backport to 2.x branch label Jul 27, 2023
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.index.engine.NRTReplicationReaderManagerTests.testUpdateSegmentsWhileRefreshing

@codecov
Copy link

codecov bot commented Jul 27, 2023

Codecov Report

Merging #8901 (8c5f7e0) into main (91bc891) will decrease coverage by 0.06%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main    #8901      +/-   ##
============================================
- Coverage     71.05%   70.99%   -0.06%     
  Complexity    57198    57198              
============================================
  Files          4758     4758              
  Lines        269863   269863              
  Branches      39484    39484              
============================================
- Hits         191755   191596     -159     
- Misses        61982    62128     +146     
- Partials      16126    16139      +13     

see 473 files with indirect coverage changes

@sohami sohami merged commit e2a664c into opensearch-project:main Jul 27, 2023
12 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 27, 2023
#8901)

* Fix flakiness in MasterServiceTests.testThrottlingForMultipleTaskTypes

The test configured a [timeout duration of zero][1] for certain tasks
and asserted that all tasks were throttled or timed out. This is not a
valid assertion because it is possible for a task to complete before the
[asynchronous timeout operation runs][2], which means the task would
complete successfully. The fix is to adjust the assertion to allow for
successful tasks in this case.

[1]: https://github.com/opensearch-project/OpenSearch/blob/60985bc300d9eafd36c1ab25d46235e1c925c565/server/src/test/java/org/opensearch/cluster/service/MasterServiceTests.java#L941
[2]: https://github.com/opensearch-project/OpenSearch/blob/9fc3f4096958159ec9b53012fc7ced19fd793e1b/server/src/main/java/org/opensearch/common/util/concurrent/PrioritizedOpenSearchThreadPoolExecutor.java#L266

Signed-off-by: Andrew Ross <andrross@amazon.com>

* Add a deterministic test case for timeout

Signed-off-by: Andrew Ross <andrross@amazon.com>

---------

Signed-off-by: Andrew Ross <andrross@amazon.com>
(cherry picked from commit e2a664c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@andrross andrross deleted the fix-master-service-flakiness branch July 27, 2023 21:40
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 28, 2023
opensearch-project#8901)

* Fix flakiness in MasterServiceTests.testThrottlingForMultipleTaskTypes

The test configured a [timeout duration of zero][1] for certain tasks
and asserted that all tasks were throttled or timed out. This is not a
valid assertion because it is possible for a task to complete before the
[asynchronous timeout operation runs][2], which means the task would
complete successfully. The fix is to adjust the assertion to allow for
successful tasks in this case.

[1]: https://github.com/opensearch-project/OpenSearch/blob/60985bc300d9eafd36c1ab25d46235e1c925c565/server/src/test/java/org/opensearch/cluster/service/MasterServiceTests.java#L941
[2]: https://github.com/opensearch-project/OpenSearch/blob/9fc3f4096958159ec9b53012fc7ced19fd793e1b/server/src/main/java/org/opensearch/common/util/concurrent/PrioritizedOpenSearchThreadPoolExecutor.java#L266

Signed-off-by: Andrew Ross <andrross@amazon.com>

* Add a deterministic test case for timeout

Signed-off-by: Andrew Ross <andrross@amazon.com>

---------

Signed-off-by: Andrew Ross <andrross@amazon.com>
kotwanikunal pushed a commit that referenced this pull request Jul 28, 2023
#8901) (#8934)

* Fix flakiness in MasterServiceTests.testThrottlingForMultipleTaskTypes

The test configured a [timeout duration of zero][1] for certain tasks
and asserted that all tasks were throttled or timed out. This is not a
valid assertion because it is possible for a task to complete before the
[asynchronous timeout operation runs][2], which means the task would
complete successfully. The fix is to adjust the assertion to allow for
successful tasks in this case.

[1]: https://github.com/opensearch-project/OpenSearch/blob/60985bc300d9eafd36c1ab25d46235e1c925c565/server/src/test/java/org/opensearch/cluster/service/MasterServiceTests.java#L941
[2]: https://github.com/opensearch-project/OpenSearch/blob/9fc3f4096958159ec9b53012fc7ced19fd793e1b/server/src/main/java/org/opensearch/common/util/concurrent/PrioritizedOpenSearchThreadPoolExecutor.java#L266



* Add a deterministic test case for timeout



---------


(cherry picked from commit e2a664c)

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross added a commit to andrross/OpenSearch that referenced this pull request Jul 29, 2023
After the fix in opensearch-project#8901, I ran the following in a loop all day (about
1400 times) and saw no failures:

```
./gradlew ':server:test' --tests "org.opensearch.cluster.service.MasterServiceTests"
```

Signed-off-by: Andrew Ross <andrross@amazon.com>
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 29, 2023
opensearch-project#8901)

* Fix flakiness in MasterServiceTests.testThrottlingForMultipleTaskTypes

The test configured a [timeout duration of zero][1] for certain tasks
and asserted that all tasks were throttled or timed out. This is not a
valid assertion because it is possible for a task to complete before the
[asynchronous timeout operation runs][2], which means the task would
complete successfully. The fix is to adjust the assertion to allow for
successful tasks in this case.

[1]: https://github.com/opensearch-project/OpenSearch/blob/60985bc300d9eafd36c1ab25d46235e1c925c565/server/src/test/java/org/opensearch/cluster/service/MasterServiceTests.java#L941
[2]: https://github.com/opensearch-project/OpenSearch/blob/9fc3f4096958159ec9b53012fc7ced19fd793e1b/server/src/main/java/org/opensearch/common/util/concurrent/PrioritizedOpenSearchThreadPoolExecutor.java#L266

Signed-off-by: Andrew Ross <andrross@amazon.com>

* Add a deterministic test case for timeout

Signed-off-by: Andrew Ross <andrross@amazon.com>

---------

Signed-off-by: Andrew Ross <andrross@amazon.com>
saratvemulapalli pushed a commit that referenced this pull request Jul 31, 2023
After the fix in #8901, I ran the following in a loop all day (about
1400 times) and saw no failures:

```
./gradlew ':server:test' --tests "org.opensearch.cluster.service.MasterServiceTests"
```

Signed-off-by: Andrew Ross <andrross@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 31, 2023
After the fix in #8901, I ran the following in a loop all day (about
1400 times) and saw no failures:

```
./gradlew ':server:test' --tests "org.opensearch.cluster.service.MasterServiceTests"
```

Signed-off-by: Andrew Ross <andrross@amazon.com>
(cherry picked from commit 8b751f8)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kotwanikunal pushed a commit that referenced this pull request Jul 31, 2023
After the fix in #8901, I ran the following in a loop all day (about
1400 times) and saw no failures:

```
./gradlew ':server:test' --tests "org.opensearch.cluster.service.MasterServiceTests"
```


(cherry picked from commit 8b751f8)

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
opensearch-project#8901)

* Fix flakiness in MasterServiceTests.testThrottlingForMultipleTaskTypes

The test configured a [timeout duration of zero][1] for certain tasks
and asserted that all tasks were throttled or timed out. This is not a
valid assertion because it is possible for a task to complete before the
[asynchronous timeout operation runs][2], which means the task would
complete successfully. The fix is to adjust the assertion to allow for
successful tasks in this case.

[1]: https://github.com/opensearch-project/OpenSearch/blob/60985bc300d9eafd36c1ab25d46235e1c925c565/server/src/test/java/org/opensearch/cluster/service/MasterServiceTests.java#L941
[2]: https://github.com/opensearch-project/OpenSearch/blob/9fc3f4096958159ec9b53012fc7ced19fd793e1b/server/src/main/java/org/opensearch/common/util/concurrent/PrioritizedOpenSearchThreadPoolExecutor.java#L266

Signed-off-by: Andrew Ross <andrross@amazon.com>

* Add a deterministic test case for timeout

Signed-off-by: Andrew Ross <andrross@amazon.com>

---------

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
After the fix in opensearch-project#8901, I ran the following in a loop all day (about
1400 times) and saw no failures:

```
./gradlew ':server:test' --tests "org.opensearch.cluster.service.MasterServiceTests"
```

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
opensearch-project#8901)

* Fix flakiness in MasterServiceTests.testThrottlingForMultipleTaskTypes

The test configured a [timeout duration of zero][1] for certain tasks
and asserted that all tasks were throttled or timed out. This is not a
valid assertion because it is possible for a task to complete before the
[asynchronous timeout operation runs][2], which means the task would
complete successfully. The fix is to adjust the assertion to allow for
successful tasks in this case.

[1]: https://github.com/opensearch-project/OpenSearch/blob/60985bc300d9eafd36c1ab25d46235e1c925c565/server/src/test/java/org/opensearch/cluster/service/MasterServiceTests.java#L941
[2]: https://github.com/opensearch-project/OpenSearch/blob/9fc3f4096958159ec9b53012fc7ced19fd793e1b/server/src/main/java/org/opensearch/common/util/concurrent/PrioritizedOpenSearchThreadPoolExecutor.java#L266

Signed-off-by: Andrew Ross <andrross@amazon.com>

* Add a deterministic test case for timeout

Signed-off-by: Andrew Ross <andrross@amazon.com>

---------

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
After the fix in opensearch-project#8901, I ran the following in a loop all day (about
1400 times) and saw no failures:

```
./gradlew ':server:test' --tests "org.opensearch.cluster.service.MasterServiceTests"
```

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
opensearch-project#8901)

* Fix flakiness in MasterServiceTests.testThrottlingForMultipleTaskTypes

The test configured a [timeout duration of zero][1] for certain tasks
and asserted that all tasks were throttled or timed out. This is not a
valid assertion because it is possible for a task to complete before the
[asynchronous timeout operation runs][2], which means the task would
complete successfully. The fix is to adjust the assertion to allow for
successful tasks in this case.

[1]: https://github.com/opensearch-project/OpenSearch/blob/60985bc300d9eafd36c1ab25d46235e1c925c565/server/src/test/java/org/opensearch/cluster/service/MasterServiceTests.java#L941
[2]: https://github.com/opensearch-project/OpenSearch/blob/9fc3f4096958159ec9b53012fc7ced19fd793e1b/server/src/main/java/org/opensearch/common/util/concurrent/PrioritizedOpenSearchThreadPoolExecutor.java#L266

Signed-off-by: Andrew Ross <andrross@amazon.com>

* Add a deterministic test case for timeout

Signed-off-by: Andrew Ross <andrross@amazon.com>

---------

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
After the fix in opensearch-project#8901, I ran the following in a loop all day (about
1400 times) and saw no failures:

```
./gradlew ':server:test' --tests "org.opensearch.cluster.service.MasterServiceTests"
```

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MasterServiceTests.testThrottlingForMultipleTaskTypes
2 participants