Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating TestMutationMdiToDedicated failure #614

Closed
barkbay opened this issue Apr 9, 2019 · 8 comments · Fixed by #1805
Closed

Investigating TestMutationMdiToDedicated failure #614

barkbay opened this issue Apr 9, 2019 · 8 comments · Fixed by #1805
Assignees
Labels
>bug Something isn't working

Comments

@barkbay
Copy link
Contributor

barkbay commented Apr 9, 2019

The e2e test TestMutationMdiToDedicated seems to consistently fails, on my laptop I have dozens of errors like this one during the stepCluster_health_should_not_have_been_red_during_mutation_process (The name of the test is misleading as any error,e.g. a timeout, will trigger a failure):

Cluster health check failure at 2019-04-09 09:11:08.481942 +0200 CEST m=+232.212360466: Get http://test-mutation-mdi-to-dedicated-es.e2e.svc.cluster.local:9200/_cluster/health: no pod addresses found in service endpoints

I have tried to understand why the test is failing, here what I found:

The first interesting thing is that the operator may remove the last pod in a Ready state (from a K8S pov), leaving the Kubernetes service without any endpoint to route a request:

image

I guess it is because the K8S state is not taken into account when the operator checks if it can delete a pod:

https://github.com/elastic/k8s-operators/blob/bac6e037efac300d51c0f68c71942121ba7627ef/operators/pkg/controller/elasticsearch/mutation/podrestrictions.go#L50-L55

A first question is: should we remove a node that is the last in the Ready state ?
note that this behavior is mitigated when some data is present, the migration gives some time to a node to move into a Ready state

But the real problem is that when the former master leaves the cluster it can take more than 1 minute for the cluster to recover:

[2019-04-09T06:48:17,110][INFO ][o.e.d.z.ZenDiscovery     ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] master_left [{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}], reason [shut_down]
[2019-04-09T06:48:17,111][WARN ][o.e.d.z.ZenDiscovery     ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] master left (reason = shut_down), current nodes: nodes: 
   {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}, local
   {test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}, master
   {test-mutation-mdi-to-dedicated-es-9666rsgvfl}{Ki4h1ZkrTqqI3BXPtRqJBw}{1fofStBBTR6RmySl29HyBw}{172.17.0.12}{172.17.0.12:9300}{xpack.installed=true}

[2019-04-09T06:48:20,219][INFO ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}
[2019-04-09T06:48:47,943][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:48:50,236][WARN ][o.e.d.z.PublishClusterStateAction] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] timed out waiting for all nodes to process published state [19] (timeout [30s], pending nodes: [{test-mutation-mdi-to-dedicated-es-9666rsgvfl}{Ki4h1ZkrTqqI3BXPtRqJBw}{1fofStBBTR6RmySl29HyBw}{172.17.0.12}{172.17.0.12:9300}{xpack.installed=true}])
[2019-04-09T06:48:50,237][WARN ][o.e.d.z.PublishClusterStateAction] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] publishing cluster state with version [19] failed for the following nodes: [[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}]]
[2019-04-09T06:48:50,239][INFO ][o.e.c.s.ClusterApplierService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] new_master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}, reason: apply cluster state (from master [master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true} committed version [19] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2019-04-09T06:48:51,444][WARN ][o.e.c.NodeConnectionsService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] failed to connect to node {test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [test-mutation-mdi-to-dedicated-es-j7cxq7f4nc][172.17.0.10:9300] connect_exception
...
[2019-04-09T06:48:57,780][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:49:07,842][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:49:10,037][WARN ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] cluster state update task [zen-disco-elected-as-master ([0] nodes joined)] took [49.9s] above the warn threshold of 30s
[2019-04-09T06:49:10,039][INFO ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] zen-disco-node-failed({test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}), reason(transport disconnected)[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} transport disconnected], reason: removed {{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}
[2019-04-09T06:49:20,405][INFO ][o.e.c.s.ClusterApplierService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] removed {{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true} committed version [20] source [zen-disco-node-failed({test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}), reason(transport disconnected)[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} transport disconnected]]])

I'm not sure I understand how this test has been successful in the past.
I think we should stop increasing the timeout and try to evaluate the downtime, if the recovery of the cluster can't be optimized we should decide that, for instance, it is ok for this test to have a downtime of 90 seconds.
If after 90 seconds there's still some timeout or the cluster is red then the test should be considered as failed.

@sebgl
Copy link
Contributor

sebgl commented Apr 9, 2019

I'm not sure I understand how this test has been successful in the past

I think it was at some point, but then we added more checks (like making sure the cluster stayed green), and tests failed for other reasons, needing some time and investigation.

It's nice that you're investigating this 👍 Looks like we cannot avoid the master election downtime?

@barkbay
Copy link
Contributor Author

barkbay commented Apr 9, 2019

As a side note I don't understand why the minimum master nodes value is always updated to 1 since at some point there may have 2 masters in flight:

grep -r "Setting minimum master nodes to" ./*
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:08.308Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:08.994Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:17.711Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:57.433Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:58.644Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:48:16.919Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:28.511Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:29.111Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:33.122Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:33.756Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:36.562Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:50:27.153Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:50:28.026Z	INFO	version	Setting minimum master nodes to 1

Is there not a risk of having a split brain ?

@sebgl
Copy link
Contributor

sebgl commented Apr 9, 2019

It's possible we could have a bug here!

Looking at where we update minimum_master_nodes in the code:

  1. when we create a new pod, it is started with a configuration including minimum_master_nodes set corresponding to the expected target configuration (eg. want to create a cluster with 5 masters, all pods will start with minimum_master_nodes=3; adding 2 new nodes to this 5 masters cluster, they'll both get minimum_master_nodes=4).
  2. when all changes have been made, we also update both transient and persistent cluster settings minimum master nodes to match what we should have (considering the available pods)
  3. when we delete a pod, we update both transient and persistent cluster settings minimum master nodes to match what we currently have in the cluster (5 pods available? update it to 3).

What we don't do is update the minimum_master_nodes just after creating a pod though.

Trying to fit this to the e2e test where we mutate from 1MDI to 1M and 1D:

  1. Create 1 MDI: gets created with minimum_master_nodes=1
  2. Add 1 master and 1 data: both get created with minimum_master_nodes=1 (a single master in the target MD cluster)
  3. Cluster is up with 3 nodes, including 2 masters. minimum_master_nodes is 1. Changes are not done yet.
  4. Wait for shards migration to be over.
  5. Before deleting the MDI node, update minimum_master_nodes (computed fro the current pod being removed => 1).
  6. Changes are done. Update minimum_master_nodes once again.

I think we should set the minimum_master_nodes after every pod creation? That is between 2. and 3. If we do it too early (before the new master pod is created), we provoke some downtime though.

@nkvoll @pebrc what do you think? I think we did have this right at some point, then did some changes in the code and now I'm not sure it is behaving correctly.

@barkbay
Copy link
Contributor Author

barkbay commented Apr 10, 2019

If we do it too early (before the new master pod is created), we provoke some downtime though.

But if we do it too late I think this can lead to a split brain situation.
Looking at the logs this is raised as a warning by Elasticsearch:

[2019-04-10T08:23:29,797][WARN ][o.e.d.z.ElectMasterService] [test-mutation-mdi-to-dedicated-es-9sl99nfx7r] value for setting "discovery.zen.minimum_master_nodes" is too low. This can result in data loss! Please set it to at least a quorum of master-eligible nodes (current value: [1], total number of master-eligible nodes used for publishing in this round: [2])

Regarding the election process I have enabled trace logs, there are available here:

For information:

  • First master (test-mutation-mdi-to-dedicated-es-9sl99nfx7r) is started at 08:20:11,386
  • First master is shutting down at 08:23:41,099
  • Second master eligible node (test-mutation-mdi-to-dedicated-es-n77c4vcfxv) is started at 08:22:40,880
  • Dedicated data node (test-mutation-mdi-to-dedicated-es-4n6zhp2tc6) is started at 08:22:39,700

If we have a closer look at what is happening on the second eligible master:

  1. It starts an election at 08:23:41,377 👍
  2. At 08:23:42,394 it is trying to open a connection to the former master (172.17.0.10:9300)
  3. 08:23:42,521 it receives a "connect_exception"
  4. 08:23:43,381 it tries again to connect to 172.17.0.10:9300
  5. 08:23:44,382 "won election" is logged 👍
  6. but at 08:24:14,483 there is this message: "timed out waiting to be elected. waited [30s]. pending master node joins [0]"
  7. 08:24:24 we still have a 'MasterNotDiscoveredException', guess it means that we still have no master
  8. 08:24:33,480 the former master seems to be removed from the cluster state: "removed {{test-mutation-mdi-to-dedicated-es-9sl99nfx7r}{CrBUA6rISaen7XsxvyvxXA}{YnsP3uGcT72wIXf3pB9juA}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}, reason: apply cluster state"
  9. Situation seems stable at 08:24:36,391

From what I understand a new election is triggered quickly but it takes some time to apply the new state of the cluster because it looks like the new master is waiting an ACK from the former one.
I'm wondering if file based discovery would not improve the situation as the former master would be removed more quickly from cluster state ?

@pebrc pebrc added the >bug Something isn't working label Apr 11, 2019
@pebrc pebrc added this to the Alpha milestone Apr 11, 2019
@barkbay barkbay mentioned this issue May 6, 2019
@barkbay
Copy link
Contributor Author

barkbay commented May 6, 2019

summary of the discussion I had with @\boaz on that case:

  • Having a red cluster is expected when a master is lost
  • During the test a new master is elected as expected but ... the new master has to update the state of the cluster. When the new master attempts to publish the new state of the cluster it tries to establish a connection to the former one. We would expect a quick failure here because no one is supposed to listen. It seems that something in the K8S environment prevents this quick failure.

TODO:

If there is nothing we can really do about that we could keep the test but flag it as "slow"

@DaveCTurner
Copy link

DaveCTurner commented May 7, 2019

It seems that something in the K8S environment prevents this quick failure.

I think that "something" is that when a container (pod?) shuts down its network interface goes away pretty much immediately, meaning that reconnection attempts receive no response. Contrast this to a process running outside of a container where the process can shut down but the network interface will live on, at least for long enough to reject any further traffic to the shut-down node.

I think this is consistent with what we've seen elsewhere (see e.g. elastic/helm-charts#63 or helm/charts#8785).

There are a few workarounds that I know of right now:

@DaveCTurner
Copy link

Re. #614 (comment), unfortunately there's no 100% safe way to add or remove master-eligible nodes from a 6.x cluster with one or two master nodes.

You can make it slightly safer than the process described there by starting the new master with minimum_master_nodes=2 in its static config (elasticsearch.yml or command-line args). This'll prevent it from electing itself master if for some reason it can't find the real master at all. The value of the dynamic setting overwrites the static setting once the node has joined the cluster, so this only offers limited protection.

Then I agree that it'd be good to update the dynamic minimum_master_nodes setting to 2 as soon as possible after the new master has joined. There's an unavoidable period of time (in 6.x) where there's two master-eligible nodes with minimum_master_nodes=1 but the shorter this period is the better.

Similarly when shutting the old master-node down: reduce minimum_master_nodes back to 1 just before shutting the node down. Again this risky state is unavoidable (in 6.x) but the shorter it is the better.

@pebrc pebrc removed this from the Alpha milestone May 10, 2019
@barkbay
Copy link
Contributor Author

barkbay commented Sep 25, 2019

Closing this one as some new efforts have been made to improve the situation while working on the statefulsets.
See #1281 and #1693

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants