Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UpgradeClusterClientYamlTestSuiteIT fails intermittently #30456

Closed
alpar-t opened this issue May 8, 2018 · 30 comments
Closed

UpgradeClusterClientYamlTestSuiteIT fails intermittently #30456

alpar-t opened this issue May 8, 2018 · 30 comments
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI v6.4.0 v7.0.0-beta1

Comments

@alpar-t
Copy link
Contributor

alpar-t commented May 8, 2018

Discovered while building #30367 Failed

Reproducible using :

./gradlew :x-pack:qa:rolling-upgrade:with-system-key:v6.4.0-SNAPSHOT#mixedClusterTestRunner
-Dtests.seed=D2CB09031F59783F
-Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT
-Dtests.security.manager=true -Dtests.locale=en-BB -Dtests.timezone=Australia/ACT
-Dtests.rest.suite=mixed_cluster

Saw different failures while trying to reproduce it in isolation:

 2> REPRODUCE WITH: ./gradlew :x-pack:qa:rolling-upgrade:with-system-key:v6.4.0-SNAPSHOT#mixedClusterTestRunner -Dtests.seed=D2CB09031F59783F -Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT -Dtests.method="test {p0=mixed_cluster/30_ml_jobs_crud/Test get old cluster job}" -Dtests.security.manager=true -Dtests.locale=en-BB -Dtests.timezone=Australia/ACT -Dtests.rest.suite=mixed_cluster
FAILURE 1.40s | UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/30_ml_jobs_crud/Test get old cluster job} <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: Failure at [mixed_cluster/30_ml_jobs_crud:33]: field [jobs.0.node] doesn't have a true value
   > Expected: not null
   >      but: was null
   >    at __randomizedtesting.SeedInfo.seed([D2CB09031F59783F:5A9F36D9B1A515C7]:0)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:343)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:320)
   >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   >    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
   >    at java.base/java.lang.Thread.run(Thread.java:844)
   > Caused by: java.lang.AssertionError: field [jobs.0.node] doesn't have a true value
   > Expected: not null
   >      but: was null
   >    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >    at org.elasticsearch.test.rest.yaml.section.IsTrueAssertion.doAssert(IsTrueAssertion.java:55)
   >    at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:336)
   >    ... 37 more

and

> Task :x-pack:qa:rolling-upgrade:with-system-key:v6.4.0-SNAPSHOT#oldClusterTestRunner
==> Test Summary: 1 suite, 10 tests, 1 ignored (1 assumption)
   [junit4] JVM J0:     1.08 ..    22.93 =    21.85s
   [junit4] Execution time total: 22 seconds
   [junit4] Tests summary: 1 suite, 10 tests, 1 ignored (1 assumption)

> Task :x-pack:qa:rolling-upgrade:with-system-key:v6.4.0-SNAPSHOT#mixedClusterTestRunner
   [junit4] <JUnit4> says ciao! Master seed: D2CB09031F59783F
==> Test Info: seed=D2CB09031F59783F; jvm=1; suite=1

Suite: org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT
  1> [2018-05-08T21:32:10,505][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/40_ml_datafeed_crud/Put job and datafeed in mixed cluster}]: before test
  1> [2018-05-08T21:32:10,636][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] initializing REST clients against [http://[::1]:33111]
  1> [2018-05-08T21:32:11,479][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] initializing yaml client, minimum es version: [6.4.0] master version: [6.4.0] hosts: [http://[::1]:33111]
  1> [2018-05-08T21:32:12,210][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/40_ml_datafeed_crud/Put job and datafeed in mixed cluster}]: after test
  1> [2018-05-08T21:32:12,234][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/20_security/Verify user and role in mixed cluster}]: before test
  1> [2018-05-08T21:32:12,965][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/20_security/Verify user and role in mixed cluster}]: after test
  1> [2018-05-08T21:32:12,974][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/10_basic/Index data and search on the mixed cluster}]: before test
  1> [2018-05-08T21:32:13,795][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/10_basic/Index data and search on the mixed cluster}]: after test
  1> [2018-05-08T21:32:13,805][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/30_ml_jobs_crud/Test get old cluster job}]: before test
  1> [2018-05-08T21:32:14,042][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "count" : 1,
  1>       "jobs" : [
  1>         {
  1>           "job_id" : "old-cluster-job",
  1>           "data_counts" : {
  1>             "job_id" : "old-cluster-job",
  1>             "processed_record_count" : 0,
  1>             "processed_field_count" : 0,
  1>             "input_bytes" : 0,
  1>             "input_field_count" : 0,
  1>             "invalid_date_count" : 0,
  1>             "missing_field_count" : 0,
  1>             "out_of_order_timestamp_count" : 0,
  1>             "empty_bucket_count" : 0,
  1>             "sparse_bucket_count" : 0,
  1>             "bucket_count" : 0,
  1>             "input_record_count" : 0
  1>           },
  1>           "model_size_stats" : {
  1>             "job_id" : "old-cluster-job",
  1>             "result_type" : "model_size_stats",
  1>             "model_bytes" : 146088,
  1>             "total_by_field_count" : 4,
  1>             "total_over_field_count" : 0,
  1>             "total_partition_field_count" : 2,
  1>             "bucket_allocation_failures_count" : 0,
  1>             "memory_status" : "ok",
  1>             "log_time" : 1525779101000,
  1>             "timestamp" : 1403481600000
  1>           },
  1>           "state" : "closed"
  1>         }
  1>       ]
  1>     }
  1>   }
  1> }]
  1> [2018-05-08T21:32:14,168][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/30_ml_jobs_crud/Test get old cluster job}]: after test
  2> REPRODUCE WITH: ./gradlew :x-pack:qa:rolling-upgrade:with-system-key:v6.4.0-SNAPSHOT#mixedClusterTestRunner -Dtests.seed=D2CB09031F59783F -Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT -Dtests.method="test {p0=mixed_cluster/30_ml_jobs_crud/Test get old cluster job}" -Dtests.security.manager=true -Dtests.locale=en-BB -Dtests.timezone=Australia/ACT -Dtests.rest.suite=mixed_cluster
FAILURE 0.38s | UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/30_ml_jobs_crud/Test get old cluster job} <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: Failure at [mixed_cluster/30_ml_jobs_crud:19]: jobs.0.data_counts.processed_record_count didn't match expected value:
   > jobs.0.data_counts.processed_record_count: expected [2] but was [0]
   >    at __randomizedtesting.SeedInfo.seed([D2CB09031F59783F:5A9F36D9B1A515C7]:0)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:343)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:320)
   >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   >    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
   >    at java.base/java.lang.Thread.run(Thread.java:844)
   > Caused by: java.lang.AssertionError: jobs.0.data_counts.processed_record_count didn't match expected value:
   > jobs.0.data_counts.processed_record_count: expected [2] but was [0]
   >    at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:87)
   >    at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
   >    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:336)
   >    ... 37 more
  1> [2018-05-08T21:32:14,183][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/50_token_auth/Get the indexed token and use if to authenticate}]: before test
  1> [2018-05-08T21:32:14,671][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] There are still tasks running after this test that might break subsequent tests [internal:index/shard/recovery/prepare_translog, internal:index/shard/recovery/start_recovery, internal:index/shard/recovery/translog_ops].
  1> [2018-05-08T21:32:14,671][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/50_token_auth/Get the indexed token and use if to authenticate}]: after test
  1> [2018-05-08T21:32:14,704][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/30_ml_jobs_crud/Test get job with rules}]: before test
  1> [2018-05-08T21:32:15,014][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] There are still tasks running after this test that might break subsequent tests [internal:index/shard/recovery/start_recovery, internal:index/shard/recovery/translog_ops].
  1> [2018-05-08T21:32:15,014][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/30_ml_jobs_crud/Test get job with rules}]: after test
  1> [2018-05-08T21:32:15,045][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/20_security/verify users for default password migration in mixed cluster}]: before test
  1> [2018-05-08T21:32:15,431][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/20_security/verify users for default password migration in mixed cluster}]: after test
  1> [2018-05-08T21:32:15,440][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/10_basic/Start scroll in mixed cluster for upgraded}]: before test
  1> [2018-05-08T21:32:17,380][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/10_basic/Start scroll in mixed cluster for upgraded}]: after test
  1> [2018-05-08T21:32:17,398][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/10_basic/Basic scroll mixed}]: before test
  1> [2018-05-08T21:32:18,322][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/10_basic/Basic scroll mixed}]: after test
  1> [2018-05-08T21:32:18,337][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/30_ml_jobs_crud/Create a job in the mixed cluster and write some data}]: before test
  1> [2018-05-08T21:32:19,989][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/30_ml_jobs_crud/Create a job in the mixed cluster and write some data}]: after test
  1> [2018-05-08T21:32:19,995][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/40_ml_datafeed_crud/Test old cluster datafeed}]: before test
  1> [2018-05-08T21:32:20,135][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test {p0=mixed_cluster/40_ml_datafeed_crud/Test old cluster datafeed}]: after test
  2> NOTE: leaving temporary files on disk at: /home/alpar.torok/elasticsearch/x-pack/qa/rolling-upgrade/with-system-key/build/testrun/v6.4.0-SNAPSHOT#mixedClusterTestRunner/J0/temp/org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT_D2CB09031F59783F-001
  2> NOTE: test params are: codec=Asserting(Lucene70): {}, docValues:{}, maxPointsInLeafNode=203, maxMBSortInHeap=6.586752071408995, sim=RandomSimilarity(queryNorm=false): {}, locale=en-BB, timezone=Australia/ACT
  2> NOTE: Linux 4.15.0-1006-gcp amd64/Oracle Corporation 10.0.1 (64-bit)/cpus=8,threads=1,free=500541920,total=536870912
  2> NOTE: All tests run in this JVM: [UpgradeClusterClientYamlTestSuiteIT]
Completed [1/1] in 9.96s, 11 tests, 1 failure <<< FAILURES! 

First failure reproduced 3 times for, the second one only once, and got 3 successful runs as well.

@alpar-t alpar-t added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label May 8, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

alpar-t pushed a commit that referenced this issue May 8, 2018
@jasontedor
Copy link
Member

These look like ML issues, I am reassigning.

@jasontedor jasontedor added :ml Machine learning and removed :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 8, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@colings86 colings86 added the >test-failure Triaged test failures from CI label May 8, 2018
@davidkyle
Copy link
Member

Failures have been spotted on the 6.x and master branches. All failures on master occurred upgrading from 6.4 and on 6.x upgrading from 6.2.1, 6.2.3, 6.2.5 and 6.3.0.

I pulled the logs for one instance a unfortunately they don't provide much insight without debug level logging.

mixedcluster.log
oldcluster.log
upgradedcluster.log

@droberts195
Copy link
Contributor

I think the problem is one that we've had intermittently for ages. I imagine something has recently changed elsewhere in ES or the test infrastructure that makes it more likely that a bug that has existed basically forever to be triggered.

To summarise we've had two theories about what might cause this:

  1. Our results index has async durability. We've done testing in the past that shows this dramatically improves performance for our usage pattern. And it's usually reasonable because in the event of a complete node failure losing a result in the translog is no worse than losing a result that existed in memory but hadn't yet been submitted for indexing. But what we're doing wrong is that our flush and close endpoints are supposed to guarantee that when they return everything prior to the call is safely persisted. So ideally our flush and close endpoints should do something that ensures the translog is synced in addition to what they currently do before returning.
  2. When we write our results documents we're not waiting for acknowledgement from all replicas, only the primary. In the case of these rolling upgrade tests it's possible that the primary shard of our results index is on the node that gets taken out of the cluster to be replaced with one running the upgraded version. If our results document had not made it to the replica then it won't exist in the upgraded cluster. As with (1), in general this is fine, as a recent result could have been lost in memory if a node was brutally removed from the cluster. But with flush and close we're not giving the guarantee we should be. Ideally before returning from a flush or close we'd do something that ensured every document indexed to the results index for the job had been written to both primary and replica.

I suspect in the case of the tests (2) is more likely, because fsync would only have an impact in the case of OS or hardware failure, not abrupt shutdown of a user-space program. However, to meet the guarantees of ML's flush and close for real-world failures we need to find ways to fix both problems without degrading performance in cases where flush and close are called rarely.

@ywelsch
Copy link
Contributor

ywelsch commented May 9, 2018

When we write our results documents we're not waiting for acknowledgement from all replicas, only the primary

How is that possible? I'm not aware of any option in ES that would allow you to only wait on acknowledgement from the primary.

@bleskes
Copy link
Contributor

bleskes commented May 9, 2018

@droberts195 I'm not sure what you mean with:

we're not waiting for acknowledgement from all replicas, only the primary

ES does it all the time. Do you mean you don't ensure all replicas are fully allocated?

@droberts195
Copy link
Contributor

Do you mean you don't ensure all replicas are fully allocated?

Yes, sorry, this is what I meant. I think we need to add a wait_for_active_shards=all on some of our indexing operations.

@bleskes
Copy link
Contributor

bleskes commented May 9, 2018

@droberts195 double checking, but you mean in tests only, right? I'm not sure that's the right move for production.

@droberts195
Copy link
Contributor

@bleskes if wait_for_active_shards=all is wrong for production then the way we could probably solve this in the tests is to wait for green status after each of the "old cluster" tests. That should guarantee that both the primary and replica for our results index exist before we kill one of the nodes.

Our results index is created on demand from a template when the first result is indexed, and the job those tests run is tiny - it only receives 2 input documents - so I guess it's quite likely that the replica is not ready at the point where one of the nodes is killed prior to upgrade. @davidkyle has managed to reproduce this locally often enough to try out possible solutions.

@bleskes
Copy link
Contributor

bleskes commented May 9, 2018

@bleskes if wait_for_active_shards=all is wrong for production then the way we could probably solve this in the tests is to wait for green status after each of the "old cluster" tests. That should guarantee that both the primary and replica for our results index exist before we kill one of the nodes.

+1. The problem with wait_for_active_shard=all in production is that it will block indexing when a node dies and a shard copy is lost until the it is completely rebuilt on another node (which may not exist). I think this is not acceptable.

@davidkyle
Copy link
Member

The failures are characterised by missing documents after a node has been upgraded, for example when a job is closed in the old cluster its model state is persisted but often this document cannot be found in the mixed cluster after a single node is upgraded. Here are some of the issues I believe may be the cause

  1. Translog settings
    The ML indices are configured with index.translog.durability: async. The rolling upgrade tests kill -9 the node after the suite is run before upgrading, it seems likely that the missing documents are in the translog which does not have an oppourtunity to flush before being brutally killed. However, I have not been able to confirm this hypothesis, defaulting the index.translog.durability setting or flushing the indices after the test does help reduce the frequency of failures but failures still occur.

  2. Replica Recovery
    If the problem is not the translog durabilty settting then could it be the missing document are in a shard that is still initialising or recovering? Waiting for no initialising shards in test setup does not help the problem.

  3. Auto Expand Replicas
    Indices are created on demand with templates and have index.auto_expand_replicas: 0-1 set. Could this combination of settings be problematic.

  4. Flush Sync
    I can get the tests to pass consistently with a combination of flush and flush sync at the end of tests I believe this enables speedy recovery of replicas after upgrade. I'm not certain what the cause is and I feel like I'm wrapping more and more duct tape (flushes) around the problem until something holds.

@davidkyle
Copy link
Member

davidkyle commented May 11, 2018

@ywelsch This is the reproduce command for the master branch it fails fairly frequently for me

./gradlew -q :x-pack:qa:rolling-upgrade:with-system-key:bwcTest -Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT -Dtests.method="test {p0=*/30_ml_jobs_crud/*}" 

Remember to unmute the test on line 33 of REPO_HOME/x-pack/qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/UpgradeClusterClientYamlTestSuiteIT.java

@ywelsch
Copy link
Contributor

ywelsch commented May 12, 2018

I have reproduced the issue and also have an explanation and fix in the works. The short version is that this is caused by my recent PR #30423 (sorry for that).

The steps that lead to this situation are as follows:

  • 2 node cluster (minimum_master_nodes = 2), .ml-state index that has durability: async and uses auto-expand-replicas.
  • The document old-cluster-job_model_state_1526126821#1 is written to both primary and replica of [.ml-state][4] and acknowledged:
[2018-05-12T14:07:01,544][TRACE][o.e.i.s.IndexShard       ] [node-1] [.ml-state][4] index [doc][old-cluster-job_model_state_1526126821#1] (seq# [-2])
[2018-05-12T14:07:01,546][TRACE][o.e.a.b.TransportShardBulkAction] [node-1] [[.ml-state][4]] op [indices:data/write/bulk[s]] completed on primary for request [BulkShardRequest [[.ml-state][4]] containing [index {[.ml-state][doc][old-cluster-job_model_state_1526126821#1], source[n/a, actual length: [8.8kb], max length: 2kb]}]]
[2018-05-12T14:07:01,547][TRACE][o.e.i.s.IndexShard       ] [node-0] [.ml-state][4] index [doc][old-cluster-job_model_state_1526126821#1] (seq# [0])
[2018-05-12T14:07:01,551][TRACE][o.e.a.b.TransportShardBulkAction] [node-0] action [indices:data/write/bulk[s][r]] completed on shard [[.ml-state][4]] for request [BulkShardRequest [[.ml-state][4]] containing [index {[.ml-state][doc][old-cluster-job_model_state_1526126821#1], source[n/a, actual length: [8.8kb], max length: 2kb]}]]
[2018-05-12T14:07:01,552][TRACE][o.e.a.b.TransportShardBulkAction] [node-1] operation succeeded. action [indices:data/write/bulk[s]],request [BulkShardRequest [[.ml-state][4]] containing [index {[.ml-state][doc][old-cluster-job_model_state_1526126821#1], source[n/a, actual length: [8.8kb], max length: 2kb]}]]
  • Before restarting node1, the cluster state looks as follows:
[2018-05-12T14:07:01,777][TRACE][o.e.c.s.MasterService    ] [node-0] cluster state updated, source [finalize_job_execution [old-cluster-job]]
cluster uuid: G7JOOS_BQPqMEI0zw09cWg
version: 55
   {node-1}{PBVBkxlbQNmFM7-IbxrUqg}{5eo-iP2CSaGdKRaiTP4Gfg}{127.0.0.1}{127.0.0.1:42143}{testattr=test, ml.machine_memory=67423862784, ml.max_open_jobs=20, ml.enabled=true}
   {node-0}{DzJ4MOcvQH6v9pl0gIPByg}{_e0tM49IReiYFX5K_DRT6Q}{127.0.0.1}{127.0.0.1:33498}{ml.machine_memory=67423862784, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, local, master
--------[.ml-state][4], node[PBVBkxlbQNmFM7-IbxrUqg], [P], s[STARTED], a[id=YCQPAECxTUWZnTV_UtxjQQ]
--------[.ml-state][4], node[DzJ4MOcvQH6v9pl0gIPByg], [R], s[STARTED], a[id=ni6u4fegQai4ydnx0pdhYg]
  • The node to be restarted (node1) had the primary for [.ml-state][4].
  • After the restart, node1 now has the name mixed-node-0, and the cluster state looks as follows:
[2018-05-12T14:07:17,603][TRACE][o.e.c.s.ClusterApplierService] [mixed-node-0] cluster state updated, source [apply cluster state (from master [master {node-0}{DzJ4MOcvQH6v9pl0gIPByg}{_e0tM49IReiYFX5K_DRT6Q}{127.0.0.1}{127.0.0.1:33498}{testattr=test, ml.machine_memory=67423862784, ml.max_open_jobs=20, ml.enabled=true} committed version [56]])]
cluster uuid: G7JOOS_BQPqMEI0zw09cWg
version: 56
nodes: 
   {mixed-node-0}{PBVBkxlbQNmFM7-IbxrUqg}{0KD922s0Q0eLKfbuqOhREg}{127.0.0.1}{127.0.0.1:44336}{ml.machine_memory=67423862784, upgraded=first, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, local
   {node-0}{DzJ4MOcvQH6v9pl0gIPByg}{_e0tM49IReiYFX5K_DRT6Q}{127.0.0.1}{127.0.0.1:33498}{testattr=test, ml.machine_memory=67423862784, ml.max_open_jobs=20, ml.enabled=true}, master
--------[.ml-state][4], node[null], [P], recovery_source[existing recovery], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2018-05-12T12:07:17.488Z], delayed=true, details[node_left[PBVBkxlbQNmFM7-IbxrUqg]], allocation_status[fetching_shard_data]]
--------[.ml-state][4], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2018-05-12T12:07:17.490Z], delayed=false, allocation_status[no_attempt]]
  • Both primary and replica are unassigned, and in a subsequent update the primary of [.ml-state][4] gets allocated to the restarted node:
[2018-05-12T14:07:18,046][DEBUG][o.e.i.e.Engine           ] [mixed-node-0] [.ml-state][4] Safe commit [CommitPoint{segment[segments_2], userData[{history_uuid=kdzLwGGvS9GmFWJBITqyqw, local_checkpoint=-1, max_seq_no=-1, max_unsafe_auto_id_timestamp=-1, translog_generation=1, translog_uuid=zoDotg6hTVmzcUsOphr6Gg}]}], last commit [CommitPoint{segment[segments_2], userData[{history_uuid=kdzLwGGvS9GmFWJBITqyqw, local_checkpoint=-1, max_seq_no=-1, max_unsafe_auto_id_timestamp=-1, translog_generation=1, translog_uuid=zoDotg6hTVmzcUsOphr6Gg}]}]

[2018-05-12T14:07:18,068][TRACE][o.e.i.s.IndexShard       ] [mixed-node-0] [.ml-state][4] recovery completed from [shard_store], took [92ms]
    index    : files           [1] with total_size [230b], took[2ms]
             : recovered_files [0] with total_size [0b]
             : reusing_files   [1] with total_size [230b]
    verify_index    : took [0s], check_index [0s]
    translog : number_of_operations [0], took [83ms]
  • As the node was restarted using kill -9 (that's what we do in our rolling-upgrade tests), the translog was not flushed to disk, and the node therefore has lost the document. The recovery stats above show that no document is recovered.

Due to the async durability of the index, the test was relying on the fact that the shard on the restarted node would not be the one to be selected as primary after the restart, but that the shard copy on the non-started node would either remain primary or be promoted to primary.

This was inadvertently changed by PR #30423 (which I consider a bug), now making it so that when node1 leaves and rejoins the cluster in the same update, that we first auto-expand replicas from 1 to 0, and only then look at failing shards of the removed node, only to auto-expand back from 0 to 1 replicas in the same update. In our situation here, the primary was on the node to be removed and readded. The auto-expand-replicas functionality now removed the replica on node0 (the node that was not restarted), only to notice in a follow-up step that the sole remaining primary now was on the node that was removed. The primary shard would then be failed and moved to unassigned. In the next step we would readd former node1 and add back a replica due to auto-expansion. Now the master would prefer allocating the primary to the node that previously had the primary, which was former node1. As this node was the one that experienced data loss due to the kill -9, the replica on node0 would be peer-recovered from that broken copy and the overall cluster would experience data loss, ultimately resulting in

[2018-05-12T14:07:21,683][ERROR][o.e.x.m.j.p.StateStreamer] Expected 1 documents for model state for old-cluster-job snapshot 1526126821 but failed to find old-cluster-job_model_state_1526126821#1

I will be working on a fix for this.

@ywelsch ywelsch self-assigned this May 12, 2018
@droberts195
Copy link
Contributor

Thanks so much for investigating this @ywelsch. It would have taken us ages to work out what was happening.

ywelsch added a commit that referenced this issue May 14, 2018
#30423 combined auto-expansion in the same cluster state update where nodes are removed. As
the auto-expansion step would run before deassociating the dead nodes from the routing table, the
auto-expansion would possibly remove replicas from live nodes instead of dead ones. This commit
reverses the order to ensure that when nodes leave the cluster that the auto-expand-replica
functionality only triggers after failing the shards on the removed nodes. This ensures that active
shards on other live nodes are not failed if the primary resided on a now dead node.
Instead, one of the replicas on the live nodes first gets promoted to primary, and the auto-
expansion (removing replicas) only triggers in a follow-up step (but still same cluster state update).

Relates to #30456 and follow-up of #30423
ywelsch added a commit that referenced this issue May 14, 2018
#30423 combined auto-expansion in the same cluster state update where nodes are removed. As
the auto-expansion step would run before deassociating the dead nodes from the routing table, the
auto-expansion would possibly remove replicas from live nodes instead of dead ones. This commit
reverses the order to ensure that when nodes leave the cluster that the auto-expand-replica
functionality only triggers after failing the shards on the removed nodes. This ensures that active
shards on other live nodes are not failed if the primary resided on a now dead node.
Instead, one of the replicas on the live nodes first gets promoted to primary, and the auto-
expansion (removing replicas) only triggers in a follow-up step (but still same cluster state update).

Relates to #30456 and follow-up of #30423
@ywelsch
Copy link
Contributor

ywelsch commented May 15, 2018

I've verified that #30553 fixes the cause for seeing this issue occurring more often.

I think the problem is one that we've had intermittently for ages.

As I mentioned in an e-mail, the issue (of losing documents) will sometimes manifest if you don’t wait for the indices with async durability to be fully allocated to both nodes in this test. This should happen rarely in practice when running the test (as replicas should allocate pretty quickly), but could be the cause of the more spurious test failures that you were seeing before this issue got opened.

I will revert 27429ec to reenable the tests, but would also like to suggest adding an @After clause to UpgradeClusterClientYamlTestSuiteIT that waits for the .ml-state and .ml-anomalies-shared indices (which are both durability:async) to be fully allocated before killing the node.

@droberts195 droberts195 assigned davidkyle and unassigned ywelsch May 15, 2018
@droberts195
Copy link
Contributor

but would also like to suggest adding an @After clause to UpgradeClusterClientYamlTestSuiteIT that waits for the .ml-state and .ml-anomalies-shared indices

Maybe we could also do this just in the relevant ML tests? Then we won't cause problems if somebody is trying to just reproduce a problem in, say, the monitoring or watcher tests and is not running the ML ones at all.

Also, what is the best way to confirm that the replicas of an index with auto_expand_replicas has had its replicas allocated? Will wait_for_no_initializing_shards do it? Or is there a danger with that that at the instant it's called that the index hasn't yet auto-expanded?

@ywelsch
Copy link
Contributor

ywelsch commented May 15, 2018

Maybe we could also do this just in the relevant ML tests? Then we won't cause problems if somebody is trying to just reproduce a problem in, say, the monitoring or watcher tests and is not running the ML ones at all.

right, I didn't think about that.

Also, what is the best way to confirm that the replicas of an index with auto_expand_replicas has had its replicas allocated

As the two problematic indices (.ml-state and .ml-anomalies-shared) are both created in the old cluster with 2 nodes, and the cluster can only function with 2 nodes due to minimum_master_nodes = 2, the auto-expansion functionality (based on #30553) will always have index.number_of_replicas:1 in this cluster, even with nodes restarting. I think it's therefore sufficient to wait for green health of those two indices at the end of each ML test in old-cluster as well as mixed-cluster. The mixed-cluster can always allocate the replica, as the primary will always be on the older node (if it was on the node that got restarted, the active replica on the other node is promoted to primary, and if it was on the node that did not get restarted, it's going to stay there as primary) .

@davidkyle
Copy link
Member

There were 3 issues here the first an actual wire incompatibility problem fixed in #30512, secondly a bug in auto expanding replicas in dead nodes fixed in #30553 and finally because of async durability an extra precaution was added in #30615

@davidkyle
Copy link
Member

Argh it's back

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu&&virtual/2439/console

java.lang.AssertionError: Failure at [upgraded_cluster/30_ml_jobs_crud:33]: jobs.0.data_counts.processed_record_count didn't match expected value:
jobs.0.data_counts.processed_record_count: expected [2] but was [0]

This failed on master upgrading from 6.4 I'm not sure which build of 6.4 that was and if it had all the fixes in. I'll keep an eye on this

@davidkyle davidkyle reopened this May 16, 2018
@ywelsch
Copy link
Contributor

ywelsch commented May 16, 2018

@davidkyle That build used a last good commit from before you pushed your wait-for-green change:

Using last successful commit 9434f25 from the job elastic+elasticsearch+master+intake with the build id 20180516072627-C0E2E615 that was started at 2018-05-16T07:26:34.919Z and finished 57m 1s ago

@davidkyle
Copy link
Member

Closing as the last failure was based on an old commit.

Thanks @ywelsch

@imotov imotov reopened this May 30, 2018
@imotov
Copy link
Contributor

imotov commented May 30, 2018

@imotov
Copy link
Contributor

imotov commented May 30, 2018

@imotov
Copy link
Contributor

imotov commented May 30, 2018

When it rains it pours https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/5918/console As @jasontedor correctly pointed out this one is a complelty different issue. I opened #30982 for it.

@jasontedor
Copy link
Member

To me this looks like a different issue and I think that a new issue should be opened.

@imotov
Copy link
Contributor

imotov commented May 30, 2018

Yes, I think you are right the last one is something completely new. I shouldn't have merged them both here.

@ywelsch
Copy link
Contributor

ywelsch commented May 31, 2018

The first failure that @imotov reported is the same as the first failure that was originally reported by @atorok (i.e. field [jobs.0.node] doesn't have a true value).

While we fixed the second failure reported by @atorok (jobs.0.data_counts.processed_record_count didn't match expected value), I believe the first failure is of a different kind and has a different root cause.
I suspect an issue with persistent task assignment, and the guarantees that the different API calls give us.
@droberts195 / @davidkyle can one of you add an - match: { "acknowledged": true } after the open_job calls? I think it would be interesting to find out if the persistent task associated job was indeed properly assigned from the perspective of open_job.

@droberts195
Copy link
Contributor

I think the most recent failures are all due to the autodetect process core dumping with a SEGV. The first failure @imotov reported above contains this (and on Linux signal 11 is SEGV):

19:49:03 [2018-05-30T19:48:57,900][ERROR][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/121564] [CDetachedProcessSpawner.cc@184] Child process with PID 121774 was terminated by signal 11

These C++ failures are split out into #30982. As far as I can see all yesterday's failures are caused by this new problem.

The full logs for the first failure reported by @atorok are no longer available, but there a total of three problems that were fixed for this issue. The first was a wire transport format incompatibility between master and 6.x that was fixed in #30512. @davidkyle found that very quickly during the original investigation of this issue, so I suspect this was the cause of the first reported failure.

@davidkyle
Copy link
Member

Closing as the investigation has shown that the recent failures are due to a SEGV in autodetect and is tracked in #30982

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI v6.4.0 v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests

9 participants