Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] o.o.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery #1746

Closed
nknize opened this issue Dec 16, 2021 · 5 comments
Closed
Labels
bug Something isn't working CI CI related discuss Issues intended to help drive brainstorming and decision making flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Comments

@nknize
Copy link
Collaborator

nknize commented Dec 16, 2021

Failed on unrelated PR #1742. Not reproducible locally. Opening to track if this continues to fail.

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery" -Dtests.seed=1183E5842BAA4635 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m -Djava.security.manager=allow" -Dtests.locale=ja-JP -Dtests.timezone=Pacific/Kwajalein -Druntime.java=17
org.opensearch.gateway.RecoveryFromGatewayIT > testReuseInFileBasedPeerRecovery FAILED
    java.lang.AssertionError: shard [test][0] on node [node_t1] has pending operations:
     --> RetentionLeaseBackgroundSyncAction.Request{retentionLeases=RetentionLeases{primaryTerm=1, version=1468, leases={peer_recovery/_23_6236SQekFQ4X2S5HWQ=RetentionLease{id='peer_recovery/_23_6236SQekFQ4X2S5HWQ', retainingSequenceNumber=1333, timestamp=1639665151286, source='peer recovery'}, peer_recovery/txItgQoGQoyJSR_SvgGAIQ=RetentionLease{id='peer_recovery/txItgQoGQoyJSR_SvgGAIQ', retainingSequenceNumber=1333, timestamp=1639665151286, source='peer recovery'}}}, shardId=[test][0], timeout=1m, index='test', waitForActiveShards=0}
    	at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:248)
    	at org.opensearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3231)
    	at org.opensearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:1117)
    	at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:434)
    	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
    	at org.opensearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:378)
    	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:91)
    	at org.opensearch.transport.TransportService$8.doRun(TransportService.java:944)
    	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
    	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    	at java.base/java.lang.Thread.run(Thread.java:833)
        at __randomizedtesting.SeedInfo.seed([1183E5842BAA4635:471B0010ABF29B6E]:0)
        at org.opensearch.test.InternalTestCluster.lambda$assertNoPendingIndexOperations$12(InternalTestCluster.java:1434)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1060)
        at org.opensearch.test.InternalTestCluster.assertNoPendingIndexOperations(InternalTestCluster.java:1421)
        at org.opensearch.test.InternalTestCluster.beforeIndexDeletion(InternalTestCluster.java:1349)
        at org.opensearch.test.OpenSearchIntegTestCase.beforeIndexDeletion(OpenSearchIntegTestCase.java:636)
@nknize nknize added bug Something isn't working >test-failure Test failure from CI, local build, etc. CI CI related v2.0.0 Version 2.0.0 untriaged labels Dec 16, 2021
@anasalkouz anasalkouz added flaky-test Random test failure that succeeds on second run and removed untriaged labels Dec 28, 2021
@dreamer-89
Copy link
Member

Another occurrence : PR 2026

Gradle log

@dblock
Copy link
Member

dblock commented Feb 14, 2022

#2069 (comment)

@anasalkouz anasalkouz added v2.1.0 Issues and PRs related to version 2.1.0 and removed v2.0.0 Version 2.0.0 labels Apr 12, 2022
@saratvemulapalli saratvemulapalli added v2.2.0 and removed v2.1.0 Issues and PRs related to version 2.1.0 labels Jun 28, 2022
@kartg
Copy link
Member

kartg commented Aug 2, 2022

It doesn't look like we've referenced this flakey test failure after April. But that said, I could not find any explicit fixes for this test that would suggest that this issue has been resolved. Should we close this issue and assume the issue has fixed itself along the way?

@kartg kartg added discuss Issues intended to help drive brainstorming and decision making and removed v2.2.0 labels Aug 2, 2022
@CEHENKLE
Copy link
Member

CEHENKLE commented Aug 3, 2022

The real friends are the tests we made along the way ;)

I'm for shooting it and seeing if it reappears, but interested to hear what other folks think.

@joshpalis
Copy link
Member

Ran this test 1000 times in isolation, was not able to reproduce. Closing as there have been no occurrences since April

./gradlew ':server:internalClusterTest' --tests "org.opensearch.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery" -Dtests.seed=1183E5842BAA4635 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m -Djava.security.manager=allow" -Dtests.locale=ja-JP -Dtests.timezone=Pacific/Kwajalein -Dtests.iters=1000 

> Configure project :qa:os
Cannot add task 'destructiveDistroTest.docker' as a task with that name already exists.
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 7.6
  OS Info               : Linux 5.4.225-139.416.amzn2int.x86_64 (amd64)
  JDK Version           : 17 (OpenJDK)
  JAVA_HOME             : /opt/jdk-17
  Random Testing Seed   : 1183E5842BAA4635
  In FIPS 140 mode      : false
=======================================

> Task :server:internalClusterTest
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/local/home/jpalis/repos/flaky-tests/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/local/home/jpalis/.gradle/wrapper/dists/gradle-7.6-all/9f832ih6bniajn45pbmqhk2cw/gradle-7.6/lib/plugins/gradle-testing-base-7.6.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 17m 56s
43 actionable tasks: 1 executed, 42 up-to-date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI CI related discuss Issues intended to help drive brainstorming and decision making flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.
Projects
None yet
Development

No branches or pull requests

8 participants