[BUG] Flaky test failure SegmentReplicationAllocationIT.testAllocationWithDisruption #6565

amkhar · 2023-03-07T07:01:57Z

Describe the bug

SegmentReplicationAllocationIT is failing randomly, test name : testAllocationWithDisruption
Failure link - https://build.ci.opensearch.org/job/gradle-check/12003/testReport/junit/org.opensearch.indices.replication/SegmentReplicationAllocationIT/testAllocationWithDisruption/

To Reproduce
PR build

Expected behavior
Test should pass all the time.

Plugins
N/A

Screenshots
N/A

adnapibar · 2023-03-10T20:52:55Z

Able to reproduce the issue consistently with the random seed ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationAllocationIT.testAllocationWithDisruption" -Dtests.seed=D9AC0DD44C11343B

dreamer-89 · 2023-03-20T22:37:08Z

One more unstable gradle run: https://build.ci.opensearch.org/job/gradle-check/12707

dreamer-89 · 2023-03-26T17:15:27Z

Able to reproduce the issue consistently with the random seed ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationAllocationIT.testAllocationWithDisruption" -Dtests.seed=D9AC0DD44C11343B

The test failure here is due to the condition when only 1 node was added but 2 were stopped resulting in lesser number of nodes where re-balance is not possible due to SameShardAllocationDecider. Stopping more number of nodes is problematic when we starts with maxReplicaCount + 1 number of nodes (1 primary, maxReplicaCount number of replicas).

dreamer-89 · 2023-03-26T17:50:57Z

Even with #6838 fix, the test occassionally fails. It fails with same reason where target node can not accept primary shard due to SameShardAllocationDecider.

Example failure.
4 nodes and problematic index has 4 primary, 2 replica settings.
Node d2SnfsWaS0qx0LEypSEIpQ (say N1) doesn't contain any primary shard, while d2SnfsWaS0qx0LEypSEIpQ (say N2) doesn't contain any primary. Remaining two nodes contain 1 primary shard each. The primary shard can't be relocated N1 -> N2 because N2 already contains the replica copies.

...
routing_table (version 36):
-- index [[test1/HBoZ6rVKR32GSj6GajB3_g]]
----shard_id [test1][0]
--------[test1][0], node[oaPUrupWQYK3iRiGemYjOQ], [R], s[STARTED], a[id=T1cHLn3bTa-OloXdam4fZg]
--------[test1][0], node[d2SnfsWaS0qx0LEypSEIpQ], [P], s[STARTED], a[id=vq7TDwcXRNq6Dc30mtLimg]
--------[test1][0], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=u6k96BLMTlCZXAnLAic9RQ]
----shard_id [test1][1]
--------[test1][1], node[XaiKkPGfRYy9zswj_9dOPw], [P], s[STARTED], a[id=qgumP2oIS-mLXl8_O7RX6g]
--------[test1][1], node[oaPUrupWQYK3iRiGemYjOQ], [R], s[STARTED], a[id=LFzrcSXcSg-frk12n7oqEQ]
--------[test1][1], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=Ub6DzexdTzegDOFozwv1cA]

-- index [[test0/-6ZIazh2SyOpSbs5S6qwug]]
----shard_id [test0][0]
--------[test0][0], node[oaPUrupWQYK3iRiGemYjOQ], [P], s[STARTED], a[id=NURU6NT6R5OGRjdVaM4y8Q]
--------[test0][0], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=eZ5QhUujQJGRiYF0cV2Kgg]
--------[test0][0], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=CQmEvDKBTYiMZfWsWvyOtQ]
----shard_id [test0][1]
--------[test0][1], node[XaiKkPGfRYy9zswj_9dOPw], [P], s[STARTED], a[id=cSqRFH1QRaiwuOlq-xT0Xg]
--------[test0][1], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=fCxf47IVQpGZvonE2iyRGg]
--------[test0][1], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=3HNjwLGTSVO3x04yEr6KUA]
----shard_id [test0][2]
--------[test0][2], node[XaiKkPGfRYy9zswj_9dOPw], [R], s[STARTED], a[id=4HIJgAS6QzWQrH3XnYSiig]
--------[test0][2], node[oaPUrupWQYK3iRiGemYjOQ], [P], s[STARTED], a[id=MuxrmkSgSESaKio6mggyRQ]
--------[test0][2], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=J-pCowIESs25_g1NbqqB4Q]
----shard_id [test0][3]
--------[test0][3], node[XaiKkPGfRYy9zswj_9dOPw], [R], s[STARTED], a[id=R88wEpbMTHSs8ntlTfQbBA]
--------[test0][3], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=u9Ndw3EpRCC6mk9d6dPFGQ]
--------[test0][3], node[1tuBtge8R4OdbZnIu-SEiw], [P], s[STARTED], a[id=iauuiFzrT8ieUkm4bNEdIw]

routing_nodes:
-----node_id[XaiKkPGfRYy9zswj_9dOPw][V]
--------[test1][1], node[XaiKkPGfRYy9zswj_9dOPw], [P], s[STARTED], a[id=qgumP2oIS-mLXl8_O7RX6g]
--------[test0][1], node[XaiKkPGfRYy9zswj_9dOPw], [P], s[STARTED], a[id=cSqRFH1QRaiwuOlq-xT0Xg]
--------[test0][3], node[XaiKkPGfRYy9zswj_9dOPw], [R], s[STARTED], a[id=R88wEpbMTHSs8ntlTfQbBA]
--------[test0][2], node[XaiKkPGfRYy9zswj_9dOPw], [R], s[STARTED], a[id=4HIJgAS6QzWQrH3XnYSiig]
-----node_id[oaPUrupWQYK3iRiGemYjOQ][V]
--------[test0][2], node[oaPUrupWQYK3iRiGemYjOQ], [P], s[STARTED], a[id=MuxrmkSgSESaKio6mggyRQ]
--------[test0][0], node[oaPUrupWQYK3iRiGemYjOQ], [P], s[STARTED], a[id=NURU6NT6R5OGRjdVaM4y8Q]
--------[test1][1], node[oaPUrupWQYK3iRiGemYjOQ], [R], s[STARTED], a[id=LFzrcSXcSg-frk12n7oqEQ]
--------[test1][0], node[oaPUrupWQYK3iRiGemYjOQ], [R], s[STARTED], a[id=T1cHLn3bTa-OloXdam4fZg]
-----node_id[d2SnfsWaS0qx0LEypSEIpQ][V]
--------[test1][0], node[d2SnfsWaS0qx0LEypSEIpQ], [P], s[STARTED], a[id=vq7TDwcXRNq6Dc30mtLimg]
--------[test0][3], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=u9Ndw3EpRCC6mk9d6dPFGQ]
--------[test0][2], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=J-pCowIESs25_g1NbqqB4Q]
--------[test0][1], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=fCxf47IVQpGZvonE2iyRGg]
--------[test0][0], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=eZ5QhUujQJGRiYF0cV2Kgg]
-----node_id[1tuBtge8R4OdbZnIu-SEiw][V]
--------[test0][3], node[1tuBtge8R4OdbZnIu-SEiw], [P], s[STARTED], a[id=iauuiFzrT8ieUkm4bNEdIw]
--------[test1][1], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=Ub6DzexdTzegDOFozwv1cA]
--------[test1][0], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=u6k96BLMTlCZXAnLAic9RQ]
--------[test0][1], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=3HNjwLGTSVO3x04yEr6KUA]
--------[test0][0], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=CQmEvDKBTYiMZfWsWvyOtQ]
---- unassigned

dreamer-89 · 2023-03-27T17:42:43Z

Increasing starting node count to 5 also does not help this is due to number of existing primary shard count (5), which results in a possibility where one node (say N1) contains more number of primary shard(2), while one node (say N2) contains both the replicas and no primary and remaining 3 nodes contain one primary each (balanced). This prevents primary relocation from N1 -> N2 due to SameShardAllocationDecider.

One example came up from failing test locally representing above state. Node v2xEVr5ERCCCMPDXbk37CQ contains 2 primary, and node a17bCL1LR1qjY0BWtUtcBw contains corresponding replicas. SameShardAllocationDecider prevents primary relocations from v2xEVr5ERCCCMPDXbk37CQ -> a17bCL1LR1qjY0BWtUtcBw. Issue #6481 would help in such scenarios.

-- index [[test2/rjaVXYuESnC6XaTqFjntKw]]
----shard_id [test2][0]
--------[test2][0], node[yRnjpEXbQgaLNdGwYpUhHQ], [P], s[STARTED], a[id=Z91DnF6_Sa-Wei_wCL8vlw]
--------[test2][0], node[MMU2NeWCR-WxZ2BptkVLWQ], [R], s[STARTED], a[id=NG9aF9tKR3y7KW7WbFIDmA]
--------[test2][0], node[wKoL7BalQWexhGhH-7vXXw], [R], s[STARTED], a[id=wWWAYqHdSV-WjfKWE-eelA]
----shard_id [test2][1]
--------[test2][1], node[v2xEVr5ERCCCMPDXbk37CQ], [P], s[STARTED], a[id=X3hCMTYXTUC0XYFHZ0hqxg]
--------[test2][1], node[a17bCL1LR1qjY0BWtUtcBw], [R], s[STARTED], a[id=1gw6DNbGQfilITp-h_0dKg]
--------[test2][1], node[wKoL7BalQWexhGhH-7vXXw], [R], s[STARTED], a[id=OjM3OV58Qr-pn9w1BTvgLg]
----shard_id [test2][2]
--------[test2][2], node[yRnjpEXbQgaLNdGwYpUhHQ], [R], s[STARTED], a[id=XoaU2_bQQ-uRKW_Ycgwduw]
--------[test2][2], node[MMU2NeWCR-WxZ2BptkVLWQ], [R], s[STARTED], a[id=U1ParUQ4QdGgRjkrXXOOVg]
--------[test2][2], node[wKoL7BalQWexhGhH-7vXXw], [P], s[STARTED], a[id=VtoVFr-qSl-A2It7YEWwIQ]
----shard_id [test2][3]
--------[test2][3], node[v2xEVr5ERCCCMPDXbk37CQ], [P], s[STARTED], a[id=m5kDCLw1ST6zAEg5q2OWpQ]
--------[test2][3], node[yRnjpEXbQgaLNdGwYpUhHQ], [R], s[STARTED], a[id=zpBSm7t3Tvup5qL74q2swg]
--------[test2][3], node[a17bCL1LR1qjY0BWtUtcBw], [R], s[STARTED], a[id=KRXw5X53Qd6a3d5iM6cSHw]
----shard_id [test2][4]
--------[test2][4], node[MMU2NeWCR-WxZ2BptkVLWQ], [P], s[STARTED], a[id=jX4KNuGTR9GH5sxLnR-xng]
--------[test2][4], node[a17bCL1LR1qjY0BWtUtcBw], [R], s[STARTED], a[id=Vysv0VbSQRGIEj57meTBNA]
--------[test2][4], node[wKoL7BalQWexhGhH-7vXXw], [R], s[STARTED], a[id=8huCM1TvQaOeD-IFSmUaqg]

Finally, increased the number of nodes to [5,10] and used 2 as max shard count, after which the test doesn't fail anymore. Updated the PR.

amkhar added bug Something isn't working untriaged labels Mar 7, 2023

amkhar mentioned this issue Mar 7, 2023

Use ConcurrentHashMap for batching tasks per executor in TaskBatcher … #5827

Merged

6 tasks

xuezhou25 added the flaky-test Random test failure that succeeds on second run label Mar 7, 2023

adnapibar added distributed framework and removed untriaged labels Mar 10, 2023

owaiskazi19 mentioned this issue Mar 20, 2023

Bump com.diffplug.spotless from 6.15.0 to 6.17.0 #6751

Merged

dreamer-89 self-assigned this Mar 20, 2023

This was referenced Mar 23, 2023

[AUTOCUT] Gradle Check Failure on push to main #6806

Closed

[AUTOCUT] Gradle Check Failure on push to main #6624

Closed

[AUTOCUT] Gradle Check Failure on push to main #6564

Closed

dreamer-89 mentioned this issue Mar 23, 2023

[Segment Replication] Mute testAllocationWithDisruption flaky test #6815

Merged

6 tasks

dbwiddis mentioned this issue Mar 23, 2023

[Extensions] Add DynamicActionRegistry to ActionModule #6734

Merged

5 tasks

dreamer-89 mentioned this issue Mar 26, 2023

[Segment Replication] Fix testAllocationWithDisruption flakyness #6838

Merged

6 tasks

dreamer-89 closed this as completed in #6838 Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Flaky test failure SegmentReplicationAllocationIT.testAllocationWithDisruption #6565

[BUG] Flaky test failure SegmentReplicationAllocationIT.testAllocationWithDisruption #6565

amkhar commented Mar 7, 2023

adnapibar commented Mar 10, 2023

dreamer-89 commented Mar 20, 2023

dreamer-89 commented Mar 26, 2023

dreamer-89 commented Mar 26, 2023

dreamer-89 commented Mar 27, 2023 •

edited

Loading

[BUG] Flaky test failure SegmentReplicationAllocationIT.testAllocationWithDisruption #6565

[BUG] Flaky test failure SegmentReplicationAllocationIT.testAllocationWithDisruption #6565

Comments

amkhar commented Mar 7, 2023

adnapibar commented Mar 10, 2023

dreamer-89 commented Mar 20, 2023

dreamer-89 commented Mar 26, 2023

dreamer-89 commented Mar 26, 2023

dreamer-89 commented Mar 27, 2023 • edited Loading

dreamer-89 commented Mar 27, 2023 •

edited

Loading