Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster UUID changed when upgrading single master 6.8.x to new nodeSet 7.3.x #2397

Closed
sebgl opened this issue Jan 10, 2020 · 3 comments · Fixed by #2404
Closed

Cluster UUID changed when upgrading single master 6.8.x to new nodeSet 7.3.x #2397

sebgl opened this issue Jan 10, 2020 · 3 comments · Fixed by #2404
Labels
>bug Something isn't working

Comments

@sebgl
Copy link
Contributor

sebgl commented Jan 10, 2020

We got this in E2E tests: https://devops-ci.elastic.co/job/cloud-on-k8s-stack/148/testReport/github/com_elastic_cloud-on-k8s_test_e2e_es/Run_tests_for_different_ELK_stack_versions_in_GKE___7_4_2___TestVersionUpgradeSingleMaster68xToNewNodeSet73x_Cluster_UUID_should_have_been_preserved/

=== RUN   TestVersionUpgradeSingleMaster68xToNewNodeSet73x/Cluster_UUID_should_have_been_preserved
{"level":"error","@timestamp":"2020-01-10T02:42:11.083Z","message":"stopping early","ver":"0.0.0-00000000","error":"test failure","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:42\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:37\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:78\ngithub.com/elastic/cloud-on-k8s/test/e2e/es.RunESMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:299\ngithub.com/elastic/cloud-on-k8s/test/e2e/es.TestVersionUpgradeSingleMaster68xToNewNodeSet73x\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/version_upgrade_test.go:73\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:909"}
    --- FAIL: TestVersionUpgradeSingleMaster68xToNewNodeSet73x/Cluster_UUID_should_have_been_preserved (0.03s)
        cluster_uuid.go:60: 
            	Error Trace:	cluster_uuid.go:60
            	Error:      	Not equal: 
            	            	expected: "w-fU-AmzS1yBcwJxk7TOgQ"
            	            	actual  : "w7NTboYAQXq49Xkgg6s3Cw"
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-w-fU-AmzS1yBcwJxk7TOgQ
            	            	+w7NTboYAQXq49Xkgg6s3Cw
            	Test:       	TestVersionUpgradeSingleMaster68xToNewNodeSet73x/Cluster_UUID_should_have_been_preserved
@sebgl sebgl added the >bug Something isn't working label Jan 10, 2020
@sebgl
Copy link
Contributor Author

sebgl commented Jan 10, 2020

Logs from the 7.3.x instance:

{"type": "deprecation", "timestamp": "2020-01-10T02:41:23,715+0000", "level": "WARN", "component": "o.e.d.c.s.Settings", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "[discovery.zen.minimum_master_nodes] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."  }
{"type": "server", "timestamp": "2020-01-10T02:41:25,900+0000", "level": "INFO", "component": "o.e.x.s.a.s.FileRolesStore", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "parsed [2] roles from file [/usr/share/elasticsearch/config/roles.yml]"  }
{"type": "server", "timestamp": "2020-01-10T02:41:26,362+0000", "level": "INFO", "component": "o.e.x.m.p.l.CppLogMessageHandler", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "[controller/101] [Main.cc@110] controller (64 bit): Version 7.3.0 (Build ff2f774f78ce63) Copyright (c) 2019 Elasticsearch BV"  }
{"type": "server", "timestamp": "2020-01-10T02:41:26,767+0000", "level": "DEBUG", "component": "o.e.a.ActionModule", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "Using REST wrapper from plugin org.elasticsearch.xpack.security.Security"  }
{"type": "server", "timestamp": "2020-01-10T02:41:27,108+0000", "level": "INFO", "component": "o.e.d.DiscoveryModule", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "using discovery type [zen] and seed hosts providers [settings, file]"  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,086+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "initialized"  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,087+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "starting ..."  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,231+0000", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "publish_address {10.92.2.69:9300}, bound_addresses {0.0.0.0:9300}"  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,239+0000", "level": "INFO", "component": "o.e.b.BootstrapChecks", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "bound or publishing to a non-loopback address, enforcing bootstrap checks"  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,255+0000", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "setting initial configuration to VotingConfiguration{Hfmfb7AMQuOVa11ntzJTwg}"  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,524+0000", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "elected-as-master ([1] nodes joined)[{test-version-up-68x-to-new-73x-bhf5-es-master73x-0}{Hfmfb7AMQuOVa11ntzJTwg}{9Gq4IyaKTRuFEM1Q95UOng}{10.92.2.69}{10.92.2.69:9300}{dim}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 1, version: 1, reason: master node changed {previous [], current [{test-version-up-68x-to-new-73x-bhf5-es-master73x-0}{Hfmfb7AMQuOVa11ntzJTwg}{9Gq4IyaKTRuFEM1Q95UOng}{10.92.2.69}{10.92.2.69:9300}{dim}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}]}"  }
{"timestamp": "2020-01-10T02:41:28+0000", "message": "readiness probe failed", "curl_rc": "7"}
{"type": "server", "timestamp": "2020-01-10T02:41:28,706+0000", "level": "INFO", "component": "o.e.c.c.CoordinationState", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "cluster UUID set to [w7NTboYAQXq49Xkgg6s3Cw]"  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,751+0000", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0",  "message": "master node changed {previous [], current [{test-version-up-68x-to-new-73x-bhf5-es-master73x-0}{Hfmfb7AMQuOVa11ntzJTwg}{9Gq4IyaKTRuFEM1Q95UOng}{10.92.2.69}{10.92.2.69:9300}{dim}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}]}, term: 1, version: 1, reason: Publication{term=1, version=1}"  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,834+0000", "level": "INFO", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0", "cluster.uuid": "w7NTboYAQXq49Xkgg6s3Cw", "node.id": "Hfmfb7AMQuOVa11ntzJTwg",  "message": "publish_address {10.92.2.69:9200}, bound_addresses {0.0.0.0:9200}"  }
{"type": "server", "timestamp": "2020-01-10T02:41:28,835+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "test-version-up-68x-to-new-73x-bhf5", "node.name": "test-version-up-68x-to-new-73x-bhf5-es-master73x-0", "cluster.uuid": "w7NTboYAQXq49Xkgg6s3Cw", "node.id": "Hfmfb7AMQuOVa11ntzJTwg",  "message": "started"  }

Looks like it got cluster.initial_master_nodes set to its own name. Which should not happen if a v6.x master is running.

@sebgl
Copy link
Contributor Author

sebgl commented Jan 10, 2020

From the operator logs:

{"level":"info","@timestamp":"2020-01-10T02:40:40.781Z","logger":"driver","message":"Creating master node","ver":"1.0.0-8728753f","namespace":"e2e-0d5tb-mercury","statefulset_name":"test-version-up-68x-to-new-73x-bhf5-es-master68x","actualReplicas":0,"targetReplicas":1}
(...)
{"level":"info","@timestamp":"2020-01-10T02:41:08.730Z","logger":"driver","message":"Creating master node","ver":"1.0.0-8728753f","namespace":"e2e-0d5tb-mercury","statefulset_name":"test-version-up-68x-to-new-73x-bhf5-es-master73x","actualReplicas":0,"targetReplicas":1}
{"level":"info","@timestamp":"2020-01-10T02:41:08.730Z","logger":"zen2","message":"Setting `cluster.initial_master_nodes`","ver":"1.0.0-8728753f","namespace":"e2e-0d5tb-mercury","es_name":"test-version-up-68x-to-new-73x-bhf5","cluster.initial_master_nodes":"test-version-up-68x-to-new-73x-bhf5-es-master73x-0"}
(...)
{"level":"info","@timestamp":"2020-01-10T02:41:13.041Z","logger":"elasticsearch-uuid","message":"Annotating bootstrapped cluster with its UUID","ver":"1.0.0-8728753f","namespace":"e2e-0d5tb-mercury","es_name":"test-version-up-68x-to-new-73x-bhf5"}

It looks like we:

  1. create the first 6.X master
  2. create the 7.X master, with cluster.initial_master_nodes set
  3. set the cluster_uuid annotation

I think the operator zen2 algorithm is implicitly expecting 3 to happen before 2, and consider the cluster was not bootstrapped yet. Here the v6 to v7 upgrade happens before the operator notices the cluster has a UUID. Which is based on observed es state (every 10 seconds).

@sebgl
Copy link
Contributor Author

sebgl commented Jan 10, 2020

Highlights from a discussion with @pebrc and @barkbay on Slack:

  • In the "real world", this bug can only occurs when a user creates a v6.x cluster, then immediately upgrades it to v7.x. And the operator did not notice the cluster had a UUID yet.
  • It seems almost impossible to get a perfect fix for this. There will always be a chance that:
    1. we request the cluster and observe it has no UUID yet
    2. behind the scenes, the cluster actually gets a UUID
    3. we do the v7 upgrade based on the observation 1. that the cluster is not bootstrapped yet
  • Worst case scenario would be to "loose" a v6.x cluster that was created a few seconds before the upgrade to v7. Which does not seem like a big deal.

There are still 2 things we can improve here:

  1. Do not rely on the async observer state to check the Cluster UUID. Instead, synchronously request Elasticsearch in ReconcileClusterUUID iff the annotation is not set yet and we ES is reachable.
  2. Before applying any mutation in E2E test, make sure the first cluster version has bootstrapped (has a UUID annotation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant