Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmokeTestMultiNodeClientYamlTestSuiteIT/indices.stats/20_translog failed retaining too much translog #46425

Closed
DaveCTurner opened this issue Sep 6, 2019 · 1 comment · Fixed by #46476
Assignees
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Sep 6, 2019

On repeated runs of org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT I hit the following failure:

org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT > test {yaml=indices.stats/20_translog/Translog retention without soft_deletes} FAILED
    java.lang.AssertionError: Failure at [indices.stats/20_translog:62]: field [indices.test.primaries.translog.size_in_bytes] is not less than or equal to [$creation_size]
    Expected: a value less than or equal to <110>
         but: <285> was greater than <110>

        Caused by:
        java.lang.AssertionError: field [indices.test.primaries.translog.size_in_bytes] is not less than or equal to [$creation_size]
        Expected: a value less than or equal to <110>
             but: <285> was greater than <110>

The REPRODUCE WITH line said:

REPRODUCE WITH: ./gradlew ':qa:smoke-test-multinode:integTestRunner' --tests "org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT" -Dtests.method="test {yaml=indices.stats/20_translog/Translog retention without soft_deletes}" -Dtests.seed=D3278D281C0378A6 -Dtests.security.manager=true -Dtests.jvms=4 -Dtests.locale=es-IC -Dtests.timezone=Kwajalein -Dcompiler.java=12 -Druntime.java=12

However this did not reproduce for me in ~30 retries.

I see a small number of similar failures from CI too, for instance https://gradle-enterprise.elastic.co/s/wi2oqndt4254w/console-log?task=:qa:smoke-test-multinode:integTestRunner.

There's not much information in the logs either:

  1> [2019-09-06T16:30:09,601][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.stats/20_translog/Translog retention without soft_deletes] before test
  1> [2019-09-06T16:30:10,653][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "_shards" : {
  1>         "total" : 2,
  1>         "successful" : 2,
  1>         "failed" : 0
  1>       },
  1>       "_all" : {
  1>         "primaries" : {
  1>           "translog" : {
  1>             "operations" : 1,
  1>             "size_in_bytes" : 285,
  1>             "uncommitted_operations" : 0,
  1>             "uncommitted_size_in_bytes" : 55,
  1>             "earliest_last_modified_age" : 0
  1>           }
  1>         },
  1>         "total" : {
  1>           "translog" : {
  1>             "operations" : 1,
  1>             "size_in_bytes" : 285,
  1>             "uncommitted_operations" : 0,
  1>             "uncommitted_size_in_bytes" : 55,
  1>             "earliest_last_modified_age" : 0
  1>           }
  1>         }
  1>       },
  1>       "indices" : {
  1>         "test" : {
  1>           "uuid" : "ZymXAFZgRFaFdk3a1DetOg",
  1>           "primaries" : {
  1>             "translog" : {
  1>               "operations" : 1,
  1>               "size_in_bytes" : 285,
  1>               "uncommitted_operations" : 0,
  1>               "uncommitted_size_in_bytes" : 55,
  1>               "earliest_last_modified_age" : 0
  1>             }
  1>           },
  1>           "total" : {
  1>             "translog" : {
  1>               "operations" : 1,
  1>               "size_in_bytes" : 285,
  1>               "uncommitted_operations" : 0,
  1>               "uncommitted_size_in_bytes" : 55,
  1>               "earliest_last_modified_age" : 0
  1>             }
  1>           }
  1>         }
  1>       }
  1>     },
  1>     "creation_size" : 110
  1>   }
  1> }]
  1> [2019-09-06T16:30:11,095][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.stats/20_translog/Translog retention without soft_deletes] after test

More logs attached here: failure-1567755613.tar.gz

@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 labels Sep 6, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn self-assigned this Sep 7, 2019
dnhatn added a commit that referenced this issue Sep 9, 2019
We leave replicas unassigned until we reroute after the primary shard
starts. If a cluster health request with wait_for_no_initializing_shards
is executed before the reroute, it will return immediately although
there will be some initializing replicas. Peer recoveries of those
shards can prevent translog on the primary from trimming.

We add wait_for_events to the cluster health request so that it will
execute after the reroute.

Closes #46425
dnhatn added a commit that referenced this issue Sep 10, 2019
We leave replicas unassigned until we reroute after the primary shard
starts. If a cluster health request with wait_for_no_initializing_shards
is executed before the reroute, it will return immediately although
there will be some initializing replicas. Peer recoveries of those
shards can prevent translog on the primary from trimming.

We add wait_for_events to the cluster health request so that it will
execute after the reroute.

Closes #46425
dnhatn added a commit that referenced this issue Sep 11, 2019
We leave replicas unassigned until we reroute after the primary shard
starts. If a cluster health request with wait_for_no_initializing_shards
is executed before the reroute, it will return immediately although
there will be some initializing replicas. Peer recoveries of those
shards can prevent translog on the primary from trimming.

We add wait_for_events to the cluster health request so that it will
execute after the reroute.

Closes #46425
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants