Rolling upgrade tests with 3 nodes #25336

ywelsch · 2017-06-21T15:13:05Z

TL;DR: I'm adding a BWC layer in #24841 that will only be triggered in a mixed-version cluster with at least 3 nodes and an index with at least 3 shard copies (i.e. index.number_of_replicas > 1). Q: Should we change our existing qa/rolling-upgrade tests to run with 3 nodes?

Details: When a primary fails and a replica is promoted to primary, the new primary will trigger a primary-replica resync, making sure that all the replicas are realigned with the new primary. This is a transport action that will only go into 6.0, not into 5.x. As such I added a BWC layer to the transport action that it will not send these resync requests to 5.6 nodes (i.e. replicas on 5.6 nodes won't be realigned with the new 6.0 primary, same as we do today in < 6.0).

The text was updated successfully, but these errors were encountered:

dakrone · 2017-06-21T16:59:21Z

+1, I think 3 nodes is a good way to test, it may shake out things that we don't see with primary promotion (such as things similar to #25277)

cbuescher · 2017-06-23T13:37:05Z

Discussed in FixitFriday, sound like a good idea in general but we want to get @rjernst opinion about this too since it touches on build issues as well.

rjernst · 2017-06-26T18:57:48Z

I don't think a bwc integration test is the place to find issues. We won't have randomization here, and the intent of these tests is to directly test integration of components in the system with how they behave in a backward compatible way. Testing internal behavior of how the system operates should be done with unit tests IMO.

rjernst · 2017-06-27T16:25:12Z

I chatted with @ywelsch and we came to an agreement. I am ok with modifying the the rolling restart test to use 3 nodes, under the premise that a user doing a rolling restart with 2 nodes is not realistic, and the entire point of these tests is to be "representative" of a basic user. With 3 nodes, it also makes sense to have an index with both 1 and 2 replicas test through the rolling restart. What I would like to not see is any test mentioning "primary failover" or anything like that. And if we ever catch something at this level, a unit test should be added, as finding issues with the rolling restart tests is a last resort.

elasticmachine · 2018-04-24T09:38:20Z

Pinging @elastic/es-distributed

bleskes · 2018-04-24T10:17:55Z

@ywelsch I presume we still want to do this (the test still uses 2 nodes as far as I can tell). I vaguely remember you trying and it ended up being more difficult, but I'm not sure. If that's true, can you share your experiences? Otherwise it might be a good opportunity for @vladimirdolzhenko do learn the qa tests universe.

ywelsch · 2018-04-24T12:27:22Z

The tricky bit is to make the Gradle changes. Setting this up for a 2 node cluster is already tricky, see https://github.com/elastic/elasticsearch/blob/master/qa/rolling-upgrade/build.gradle
Doing it for a 3 node cluster is even harder as it requires more state transitions, which are tricky to model with Gradle dependencies and easy to get wrong (and very hard to debug). Unless @vladimirdolzhenko wants to become a Gradle expert, I don't recommend this as a ramp-up task.

Switches the rolling upgrade tests from upgrading two nodes to upgrading three nodes which is much more realistic and much better able to find unexpected bugs. It upgrades the nodes one at a time and runs tests between each upgrade. As such this now has four test runs: 1. Old 2. One third upgraded 3. Two thirds upgraded 4. Upgraded It sets system properties so the tests can figure out which stage they are in. It reuses the same yml tests for the "one third" and "two thirds" cases because they are *almost* the same case. This rewrites the yml-based indexing tests to be Java based because the yml-based tests can't handle different expected values for the counts. And the indexing tests need that when they are run twice. Closes #25336

ywelsch · 2018-06-05T09:04:23Z

I'm reopening this issue as #30728 only fixed this for OSS Elasticsearch. The x-pack rolling upgrade tests under x-pack/qa still use 2 nodes.

elasticmachine · 2018-06-05T09:04:51Z

Pinging @elastic/es-core-infra

nik9000 · 2018-06-05T11:24:55Z

Ah, yes the with security rolling upgrade tests. I can take a look at that today. We get quite a bit of coverage by running the oss tests with the default distribution but this will cover more xpack apis and covers the pre 6.3 xpack to post 6.3 xpack upgrade.

…

On Tue, Jun 5, 2018, 5:04 AM Elastic Machine ***@***.***> wrote: Pinging @elastic/es-core-infra <https://github.com/orgs/elastic/teams/es-core-infra> — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#25336 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANLopANrskDniGJdq-YaxvIzlDqKAsbks5t5km5gaJpZM4OBHK0> .

This is much more realistic and can find more issues. This causes the "mixed cluster" tests to be run twice so I had to fix the tests to work in that case. In most cases I did as little as possible to get them working but in a few cases I went a little beyond that to make them easier for me to debug while getting them to work. My test changes: 1. Remove the "basic indexing" tests and replace them with a copy of the tests used in the OSS. We have no way of sharing code between these two projects so for now I copy. 2. Skip the a few tests in the "one third" upgraded scenario: * creating a scroll to be reused when the cluster is fully upgraded * creating some ml data to be used when the cluster is fully ugpraded 3. Drop many "assert yellow and that the cluster has two nodes" assertions. These assertions duplicate those made by the wait condition and they fail now that we have three nodes. 4. Switch many "assert green and that the cluster has two nodes" to 3 nodes. These assertions are unique from the wait condition and, while I imagine they aren't required in all cases, now is not the time to find that out. Thus, I made them work. 5. Rework the index audit trail test so it is more obvious that it is the same test expecting different numbers based on the shape of the cluster. The conditions for which number are expected are fairly complex because the index audit trail is shut down until the template for it is upgraded and the template is upgraded when a master node is elected that has the new version of the software. 6. Add some more information to debug the index audit trail test because it helped me figure out what was going on. I also dropped the `waitCondition` from the `rolling-upgrade-basic` tests because it wasn't needed. Closes elastic#25336

This is much more realistic and can find more issues. This causes the "mixed cluster" tests to be run twice so I had to fix the tests to work in that case. In most cases I did as little as possible to get them working but in a few cases I went a little beyond that to make them easier for me to debug while getting them to work. My test changes: 1. Remove the "basic indexing" tests and replace them with a copy of the tests used in the OSS. We have no way of sharing code between these two projects so for now I copy. 2. Skip the a few tests in the "one third" upgraded scenario: * creating a scroll to be reused when the cluster is fully upgraded * creating some ml data to be used when the cluster is fully ugpraded 3. Drop many "assert yellow and that the cluster has two nodes" assertions. These assertions duplicate those made by the wait condition and they fail now that we have three nodes. 4. Switch many "assert green and that the cluster has two nodes" to 3 nodes. These assertions are unique from the wait condition and, while I imagine they aren't required in all cases, now is not the time to find that out. Thus, I made them work. 5. Rework the index audit trail test so it is more obvious that it is the same test expecting different numbers based on the shape of the cluster. The conditions for which number are expected are fairly complex because the index audit trail is shut down until the template for it is upgraded and the template is upgraded when a master node is elected that has the new version of the software. 6. Add some more information to debug the index audit trail test because it helped me figure out what was going on. I also dropped the `waitCondition` from the `rolling-upgrade-basic` tests because it wasn't needed. Closes #25336

ywelsch added discuss >test Issues or PRs that are addressing/adding tests labels Jun 21, 2017

ywelsch mentioned this issue Jun 21, 2017

Live primary-replica resync (no rollback) #24841

Merged

cbuescher removed the discuss label Jun 23, 2017

colings86 added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Apr 24, 2018

nik9000 self-assigned this May 17, 2018

This was referenced May 18, 2018

QA: Switch rolling upgrade to 3 nodes #30728

Merged

QA: Switch rolling upgrade to 3 nodes (6.x backport) #30733

Closed

nik9000 closed this as completed in #30728 May 21, 2018

ywelsch reopened this Jun 5, 2018

ywelsch added the :Core/Infra/Core Core issues without another label label Jun 5, 2018

ywelsch removed the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Jun 5, 2018

nik9000 mentioned this issue Jun 5, 2018

QA: Switch xpack rolling upgrades to three nodes #31112

Merged

nik9000 closed this as completed in #31112 Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling upgrade tests with 3 nodes #25336

Rolling upgrade tests with 3 nodes #25336

ywelsch commented Jun 21, 2017

dakrone commented Jun 21, 2017

cbuescher commented Jun 23, 2017 •

edited

Loading

rjernst commented Jun 26, 2017

rjernst commented Jun 27, 2017

elasticmachine commented Apr 24, 2018

bleskes commented Apr 24, 2018

ywelsch commented Apr 24, 2018

ywelsch commented Jun 5, 2018

elasticmachine commented Jun 5, 2018

nik9000 commented Jun 5, 2018 via email

Rolling upgrade tests with 3 nodes #25336

Rolling upgrade tests with 3 nodes #25336

Comments

ywelsch commented Jun 21, 2017

dakrone commented Jun 21, 2017

cbuescher commented Jun 23, 2017 • edited Loading

rjernst commented Jun 26, 2017

rjernst commented Jun 27, 2017

elasticmachine commented Apr 24, 2018

bleskes commented Apr 24, 2018

ywelsch commented Apr 24, 2018

ywelsch commented Jun 5, 2018

elasticmachine commented Jun 5, 2018

nik9000 commented Jun 5, 2018 via email

cbuescher commented Jun 23, 2017 •

edited

Loading