Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] NoClusterManagerNodeIT.testNoClusterManagerActionsWriteClusterManagerBlock fails with segment replication enabled #9837

Closed
Rishikesh1159 opened this issue Sep 6, 2023 · 2 comments
Assignees
Labels
bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@Rishikesh1159
Copy link
Member

Describe the bug
Integration test NoClusterManagerNodeIT.testNoClusterManagerActionsWriteClusterManagerBlock fails with segment replication enabled.

REPRODUCE WITH: ./gradlew 'null' --tests "org.opensearch.cluster.NoClusterManagerNodeIT.testNoClusterManagerActionsWriteClusterManagerBlock" -Dtests.seed=8D15B79A4A88A781 -Dtests.locale=en-CA -Dtests.timezone=Africa/Addis_Ababa -Druntime.java=14

NodeNotConnectedException[[node_t0][127.0.0.1:39295] Node not connected
]
	at __randomizedtesting.SeedInfo.seed([8D15B79A4A88A781:1543C53F711AF679]:0)
	at org.opensearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:206)
	at org.opensearch.test.transport.StubbableConnectionManager.getConnection(StubbableConnectionManager.java:93)
	at org.opensearch.transport.TransportService.getConnection(TransportService.java:917)
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:815)
	at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:280)
	at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:231)
	at org.opensearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:125)
	at org.opensearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:78)
	at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218)
	at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188)
	at org.opensearch.action.support.TransportAction.execute(TransportAction.java:107)
	at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:110)
	at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:97)
	at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:476)
	at org.opensearch.client.FilterClient.doExecute(FilterClient.java:83)
	at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:476)
	at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:463)
	at org.opensearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:66)
	at org.opensearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:73)
	at org.opensearch.cluster.NoClusterManagerNodeIT.testNoClusterManagerActionsWriteClusterManagerBlock(NoClusterManagerNodeIT.java:294)

To Reproduce
Steps to reproduce the behavior:
-> enable segment replication.
-> run integ test NoClusterManagerNodeIT.testNoClusterManagerActionsWriteClusterManagerBlock

REPRODUCE WITH: ./gradlew 'null' --tests "org.opensearch.cluster.NoClusterManagerNodeIT.testNoClusterManagerActionsWriteClusterManagerBlock" -Dtests.seed=8D15B79A4A88A781 -Dtests.locale=en-CA -Dtests.timezone=Africa/Addis_Ababa -Druntime.java=14

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@Rishikesh1159 Rishikesh1159 added bug Something isn't working untriaged Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed untriaged labels Sep 6, 2023
@Poojita-Raj
Copy link
Contributor

Poojita-Raj commented Sep 12, 2023

This test fails with 2 other failures as well:

java.lang.AssertionError: timed out waiting for green state

	at __randomizedtesting.SeedInfo.seed([3C6DE3836C4B226B:A43B912657D97393]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:1012)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:943)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:932)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureSearchable(OpenSearchIntegTestCase.java:1300)
	at org.opensearch.cluster.NoClusterManagerNodeIT.testNoClusterManagerActionsWriteClusterManagerBlock(NoClusterManagerNodeIT.java:275)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:832)

and

NodeDisconnectedException[[node_t2][127.0.0.1:51616][indices:data/read/get[s]] disconnected
]
	at __randomizedtesting.SeedInfo.seed([86485FDB91B5D6CD:1E1E2D7EAA278735]:0) 

Initial problem of timed out waiting for green state was fixed by adding in the below call to the class:

@Override
    protected boolean addMockInternalEngine() {
        return false;
    }

This ensured that when segment replication enabled, we always choose the NRTReplicationEngine and never the MockInternalEngine for replica shards. This also avoided the NPE seen when updating the sequence ID that took place in InternalEngine's call to index method - when it should have called NRTReplicationEngine's index method.

The other get calls fail when segment replication is enabled and a network disruption is introduced as is the case in this test. This is because by default all get calls are realtime. Realtime get calls are supported when segment replication is enabled by routing to the primary shard currently which is expected behavior.

This test passes intermittently and in all the passing cases the get response is served by a primary shard. In the case it requests from replica shard (because pri is unreachable), it fails since replicas are not caught up to realtime data of primary.

@Poojita-Raj
Copy link
Contributor

Expected behavior for segrep differs from docrep for this test so failure is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

No branches or pull requests

2 participants