Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky test failure SegmentReplicationAllocationIT.testAllocationWithDisruption #6565

Closed
amkhar opened this issue Mar 7, 2023 · 5 comments · Fixed by #6838
Closed
Assignees
Labels
bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run

Comments

@amkhar
Copy link
Contributor

amkhar commented Mar 7, 2023

Describe the bug

SegmentReplicationAllocationIT is failing randomly, test name : testAllocationWithDisruption
Failure link - https://build.ci.opensearch.org/job/gradle-check/12003/testReport/junit/org.opensearch.indices.replication/SegmentReplicationAllocationIT/testAllocationWithDisruption/

To Reproduce
PR build

Expected behavior
Test should pass all the time.

Plugins
N/A

Screenshots
N/A

@amkhar amkhar added bug Something isn't working untriaged labels Mar 7, 2023
@xuezhou25 xuezhou25 added the flaky-test Random test failure that succeeds on second run label Mar 7, 2023
@adnapibar
Copy link
Contributor

Able to reproduce the issue consistently with the random seed ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationAllocationIT.testAllocationWithDisruption" -Dtests.seed=D9AC0DD44C11343B

@dreamer-89
Copy link
Member

One more unstable gradle run: https://build.ci.opensearch.org/job/gradle-check/12707

@dreamer-89
Copy link
Member

Able to reproduce the issue consistently with the random seed ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationAllocationIT.testAllocationWithDisruption" -Dtests.seed=D9AC0DD44C11343B

The test failure here is due to the condition when only 1 node was added but 2 were stopped resulting in lesser number of nodes where re-balance is not possible due to SameShardAllocationDecider. Stopping more number of nodes is problematic when we starts with maxReplicaCount + 1 number of nodes (1 primary, maxReplicaCount number of replicas).

@dreamer-89
Copy link
Member

Even with #6838 fix, the test occassionally fails. It fails with same reason where target node can not accept primary shard due to SameShardAllocationDecider.

Example failure.
4 nodes and problematic index has 4 primary, 2 replica settings.
Node d2SnfsWaS0qx0LEypSEIpQ (say N1) doesn't contain any primary shard, while d2SnfsWaS0qx0LEypSEIpQ (say N2) doesn't contain any primary. Remaining two nodes contain 1 primary shard each. The primary shard can't be relocated N1 -> N2 because N2 already contains the replica copies.

...
routing_table (version 36):
-- index [[test1/HBoZ6rVKR32GSj6GajB3_g]]
----shard_id [test1][0]
--------[test1][0], node[oaPUrupWQYK3iRiGemYjOQ], [R], s[STARTED], a[id=T1cHLn3bTa-OloXdam4fZg]
--------[test1][0], node[d2SnfsWaS0qx0LEypSEIpQ], [P], s[STARTED], a[id=vq7TDwcXRNq6Dc30mtLimg]
--------[test1][0], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=u6k96BLMTlCZXAnLAic9RQ]
----shard_id [test1][1]
--------[test1][1], node[XaiKkPGfRYy9zswj_9dOPw], [P], s[STARTED], a[id=qgumP2oIS-mLXl8_O7RX6g]
--------[test1][1], node[oaPUrupWQYK3iRiGemYjOQ], [R], s[STARTED], a[id=LFzrcSXcSg-frk12n7oqEQ]
--------[test1][1], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=Ub6DzexdTzegDOFozwv1cA]

-- index [[test0/-6ZIazh2SyOpSbs5S6qwug]]
----shard_id [test0][0]
--------[test0][0], node[oaPUrupWQYK3iRiGemYjOQ], [P], s[STARTED], a[id=NURU6NT6R5OGRjdVaM4y8Q]
--------[test0][0], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=eZ5QhUujQJGRiYF0cV2Kgg]
--------[test0][0], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=CQmEvDKBTYiMZfWsWvyOtQ]
----shard_id [test0][1]
--------[test0][1], node[XaiKkPGfRYy9zswj_9dOPw], [P], s[STARTED], a[id=cSqRFH1QRaiwuOlq-xT0Xg]
--------[test0][1], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=fCxf47IVQpGZvonE2iyRGg]
--------[test0][1], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=3HNjwLGTSVO3x04yEr6KUA]
----shard_id [test0][2]
--------[test0][2], node[XaiKkPGfRYy9zswj_9dOPw], [R], s[STARTED], a[id=4HIJgAS6QzWQrH3XnYSiig]
--------[test0][2], node[oaPUrupWQYK3iRiGemYjOQ], [P], s[STARTED], a[id=MuxrmkSgSESaKio6mggyRQ]
--------[test0][2], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=J-pCowIESs25_g1NbqqB4Q]
----shard_id [test0][3]
--------[test0][3], node[XaiKkPGfRYy9zswj_9dOPw], [R], s[STARTED], a[id=R88wEpbMTHSs8ntlTfQbBA]
--------[test0][3], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=u9Ndw3EpRCC6mk9d6dPFGQ]
--------[test0][3], node[1tuBtge8R4OdbZnIu-SEiw], [P], s[STARTED], a[id=iauuiFzrT8ieUkm4bNEdIw]

routing_nodes:
-----node_id[XaiKkPGfRYy9zswj_9dOPw][V]
--------[test1][1], node[XaiKkPGfRYy9zswj_9dOPw], [P], s[STARTED], a[id=qgumP2oIS-mLXl8_O7RX6g]
--------[test0][1], node[XaiKkPGfRYy9zswj_9dOPw], [P], s[STARTED], a[id=cSqRFH1QRaiwuOlq-xT0Xg]
--------[test0][3], node[XaiKkPGfRYy9zswj_9dOPw], [R], s[STARTED], a[id=R88wEpbMTHSs8ntlTfQbBA]
--------[test0][2], node[XaiKkPGfRYy9zswj_9dOPw], [R], s[STARTED], a[id=4HIJgAS6QzWQrH3XnYSiig]
-----node_id[oaPUrupWQYK3iRiGemYjOQ][V]
--------[test0][2], node[oaPUrupWQYK3iRiGemYjOQ], [P], s[STARTED], a[id=MuxrmkSgSESaKio6mggyRQ]
--------[test0][0], node[oaPUrupWQYK3iRiGemYjOQ], [P], s[STARTED], a[id=NURU6NT6R5OGRjdVaM4y8Q]
--------[test1][1], node[oaPUrupWQYK3iRiGemYjOQ], [R], s[STARTED], a[id=LFzrcSXcSg-frk12n7oqEQ]
--------[test1][0], node[oaPUrupWQYK3iRiGemYjOQ], [R], s[STARTED], a[id=T1cHLn3bTa-OloXdam4fZg]
-----node_id[d2SnfsWaS0qx0LEypSEIpQ][V]
--------[test1][0], node[d2SnfsWaS0qx0LEypSEIpQ], [P], s[STARTED], a[id=vq7TDwcXRNq6Dc30mtLimg]
--------[test0][3], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=u9Ndw3EpRCC6mk9d6dPFGQ]
--------[test0][2], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=J-pCowIESs25_g1NbqqB4Q]
--------[test0][1], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=fCxf47IVQpGZvonE2iyRGg]
--------[test0][0], node[d2SnfsWaS0qx0LEypSEIpQ], [R], s[STARTED], a[id=eZ5QhUujQJGRiYF0cV2Kgg]
-----node_id[1tuBtge8R4OdbZnIu-SEiw][V]
--------[test0][3], node[1tuBtge8R4OdbZnIu-SEiw], [P], s[STARTED], a[id=iauuiFzrT8ieUkm4bNEdIw]
--------[test1][1], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=Ub6DzexdTzegDOFozwv1cA]
--------[test1][0], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=u6k96BLMTlCZXAnLAic9RQ]
--------[test0][1], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=3HNjwLGTSVO3x04yEr6KUA]
--------[test0][0], node[1tuBtge8R4OdbZnIu-SEiw], [R], s[STARTED], a[id=CQmEvDKBTYiMZfWsWvyOtQ]
---- unassigned

@dreamer-89
Copy link
Member

dreamer-89 commented Mar 27, 2023

Increasing starting node count to 5 also does not help this is due to number of existing primary shard count (5), which results in a possibility where one node (say N1) contains more number of primary shard(2), while one node (say N2) contains both the replicas and no primary and remaining 3 nodes contain one primary each (balanced). This prevents primary relocation from N1 -> N2 due to SameShardAllocationDecider.

One example came up from failing test locally representing above state. Node v2xEVr5ERCCCMPDXbk37CQ contains 2 primary, and node a17bCL1LR1qjY0BWtUtcBw contains corresponding replicas. SameShardAllocationDecider prevents primary relocations from v2xEVr5ERCCCMPDXbk37CQ -> a17bCL1LR1qjY0BWtUtcBw. Issue #6481 would help in such scenarios.

-- index [[test2/rjaVXYuESnC6XaTqFjntKw]]
----shard_id [test2][0]
--------[test2][0], node[yRnjpEXbQgaLNdGwYpUhHQ], [P], s[STARTED], a[id=Z91DnF6_Sa-Wei_wCL8vlw]
--------[test2][0], node[MMU2NeWCR-WxZ2BptkVLWQ], [R], s[STARTED], a[id=NG9aF9tKR3y7KW7WbFIDmA]
--------[test2][0], node[wKoL7BalQWexhGhH-7vXXw], [R], s[STARTED], a[id=wWWAYqHdSV-WjfKWE-eelA]
----shard_id [test2][1]
--------[test2][1], node[v2xEVr5ERCCCMPDXbk37CQ], [P], s[STARTED], a[id=X3hCMTYXTUC0XYFHZ0hqxg]
--------[test2][1], node[a17bCL1LR1qjY0BWtUtcBw], [R], s[STARTED], a[id=1gw6DNbGQfilITp-h_0dKg]
--------[test2][1], node[wKoL7BalQWexhGhH-7vXXw], [R], s[STARTED], a[id=OjM3OV58Qr-pn9w1BTvgLg]
----shard_id [test2][2]
--------[test2][2], node[yRnjpEXbQgaLNdGwYpUhHQ], [R], s[STARTED], a[id=XoaU2_bQQ-uRKW_Ycgwduw]
--------[test2][2], node[MMU2NeWCR-WxZ2BptkVLWQ], [R], s[STARTED], a[id=U1ParUQ4QdGgRjkrXXOOVg]
--------[test2][2], node[wKoL7BalQWexhGhH-7vXXw], [P], s[STARTED], a[id=VtoVFr-qSl-A2It7YEWwIQ]
----shard_id [test2][3]
--------[test2][3], node[v2xEVr5ERCCCMPDXbk37CQ], [P], s[STARTED], a[id=m5kDCLw1ST6zAEg5q2OWpQ]
--------[test2][3], node[yRnjpEXbQgaLNdGwYpUhHQ], [R], s[STARTED], a[id=zpBSm7t3Tvup5qL74q2swg]
--------[test2][3], node[a17bCL1LR1qjY0BWtUtcBw], [R], s[STARTED], a[id=KRXw5X53Qd6a3d5iM6cSHw]
----shard_id [test2][4]
--------[test2][4], node[MMU2NeWCR-WxZ2BptkVLWQ], [P], s[STARTED], a[id=jX4KNuGTR9GH5sxLnR-xng]
--------[test2][4], node[a17bCL1LR1qjY0BWtUtcBw], [R], s[STARTED], a[id=Vysv0VbSQRGIEj57meTBNA]
--------[test2][4], node[wKoL7BalQWexhGhH-7vXXw], [R], s[STARTED], a[id=8huCM1TvQaOeD-IFSmUaqg]

Finally, increased the number of nodes to [5,10] and used 2 as max shard count, after which the test doesn't fail anymore. Updated the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants