Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling upgrade tests with 3 nodes #25336

Closed
ywelsch opened this issue Jun 21, 2017 · 10 comments · Fixed by #31112 or #30728
Closed

Rolling upgrade tests with 3 nodes #25336

ywelsch opened this issue Jun 21, 2017 · 10 comments · Fixed by #31112 or #30728
Assignees
Labels
:Core/Infra/Core Core issues without another label >test Issues or PRs that are addressing/adding tests

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Jun 21, 2017

TL;DR: I'm adding a BWC layer in #24841 that will only be triggered in a mixed-version cluster with at least 3 nodes and an index with at least 3 shard copies (i.e. index.number_of_replicas > 1). Q: Should we change our existing qa/rolling-upgrade tests to run with 3 nodes?

Details: When a primary fails and a replica is promoted to primary, the new primary will trigger a primary-replica resync, making sure that all the replicas are realigned with the new primary. This is a transport action that will only go into 6.0, not into 5.x. As such I added a BWC layer to the transport action that it will not send these resync requests to 5.6 nodes (i.e. replicas on 5.6 nodes won't be realigned with the new 6.0 primary, same as we do today in < 6.0).

@ywelsch ywelsch added discuss >test Issues or PRs that are addressing/adding tests labels Jun 21, 2017
@dakrone
Copy link
Member

dakrone commented Jun 21, 2017

+1, I think 3 nodes is a good way to test, it may shake out things that we don't see with primary promotion (such as things similar to #25277)

@cbuescher cbuescher removed the discuss label Jun 23, 2017
@cbuescher
Copy link
Member

cbuescher commented Jun 23, 2017

Discussed in FixitFriday, sound like a good idea in general but we want to get @rjernst opinion about this too since it touches on build issues as well.

@rjernst
Copy link
Member

rjernst commented Jun 26, 2017

I don't think a bwc integration test is the place to find issues. We won't have randomization here, and the intent of these tests is to directly test integration of components in the system with how they behave in a backward compatible way. Testing internal behavior of how the system operates should be done with unit tests IMO.

@rjernst
Copy link
Member

rjernst commented Jun 27, 2017

I chatted with @ywelsch and we came to an agreement. I am ok with modifying the the rolling restart test to use 3 nodes, under the premise that a user doing a rolling restart with 2 nodes is not realistic, and the entire point of these tests is to be "representative" of a basic user. With 3 nodes, it also makes sense to have an index with both 1 and 2 replicas test through the rolling restart. What I would like to not see is any test mentioning "primary failover" or anything like that. And if we ever catch something at this level, a unit test should be added, as finding issues with the rolling restart tests is a last resort.

@colings86 colings86 added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Apr 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor

bleskes commented Apr 24, 2018

@ywelsch I presume we still want to do this (the test still uses 2 nodes as far as I can tell). I vaguely remember you trying and it ended up being more difficult, but I'm not sure. If that's true, can you share your experiences? Otherwise it might be a good opportunity for @vladimirdolzhenko do learn the qa tests universe.

@ywelsch
Copy link
Contributor Author

ywelsch commented Apr 24, 2018

The tricky bit is to make the Gradle changes. Setting this up for a 2 node cluster is already tricky, see https://github.com/elastic/elasticsearch/blob/master/qa/rolling-upgrade/build.gradle
Doing it for a 3 node cluster is even harder as it requires more state transitions, which are tricky to model with Gradle dependencies and easy to get wrong (and very hard to debug). Unless @vladimirdolzhenko wants to become a Gradle expert, I don't recommend this as a ramp-up task.

@nik9000 nik9000 self-assigned this May 17, 2018
nik9000 added a commit that referenced this issue May 21, 2018
Switches the rolling upgrade tests from upgrading two nodes to upgrading
three nodes which is much more realistic and much better able to find
unexpected bugs. It upgrades the nodes one at a time and runs tests
between each upgrade. As such this now has four test runs:

1. Old
2. One third upgraded
3. Two thirds upgraded
4. Upgraded

It sets system properties so the tests can figure out which stage they
are in. It reuses the same yml tests for the "one third" and "two
thirds" cases because they are *almost* the same case.

This rewrites the yml-based indexing tests to be Java based because the
yml-based tests can't handle different expected values for the counts.
And the indexing tests need that when they are run twice.

Closes #25336
nik9000 added a commit that referenced this issue May 21, 2018
Switches the rolling upgrade tests from upgrading two nodes to upgrading
three nodes which is much more realistic and much better able to find
unexpected bugs. It upgrades the nodes one at a time and runs tests
between each upgrade. As such this now has four test runs:

1. Old
2. One third upgraded
3. Two thirds upgraded
4. Upgraded

It sets system properties so the tests can figure out which stage they
are in. It reuses the same yml tests for the "one third" and "two
thirds" cases because they are *almost* the same case.

This rewrites the yml-based indexing tests to be Java based because the
yml-based tests can't handle different expected values for the counts.
And the indexing tests need that when they are run twice.

Closes #25336
@ywelsch
Copy link
Contributor Author

ywelsch commented Jun 5, 2018

I'm reopening this issue as #30728 only fixed this for OSS Elasticsearch. The x-pack rolling upgrade tests under x-pack/qa still use 2 nodes.

@ywelsch ywelsch reopened this Jun 5, 2018
@ywelsch ywelsch added the :Core/Infra/Core Core issues without another label label Jun 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@ywelsch ywelsch removed the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Jun 5, 2018
@nik9000
Copy link
Member

nik9000 commented Jun 5, 2018 via email

nik9000 added a commit to nik9000/elasticsearch that referenced this issue Jun 5, 2018
This is much more realistic and can find more issues. This causes the
"mixed cluster" tests to be run twice so I had to fix the tests to work
in that case. In most cases I did as little as possible to get them
working but in a few cases I went a little beyond that to make them
easier for me to debug while getting them to work. My test changes:

1. Remove the "basic indexing" tests and replace them with a copy of the
tests used in the OSS. We have no way of sharing code between these two
projects so for now I copy.
2. Skip the a few tests in the "one third" upgraded scenario:
  * creating a scroll to be reused when the cluster is fully upgraded
  * creating some ml data to be used when the cluster is fully ugpraded
3. Drop many "assert yellow and that the cluster has two nodes"
assertions. These assertions duplicate those made by the wait condition
and they fail now that we have three nodes.
4. Switch many "assert green and that the cluster has two nodes" to 3
nodes. These assertions are unique from the wait condition and, while
I imagine they aren't required in all cases, now is not the time to
find that out. Thus, I made them work.
5. Rework the index audit trail test so it is more obvious that it is
the same test expecting different numbers based on the shape of the
cluster. The conditions for which number are expected are fairly
complex because the index audit trail is shut down until the template
for it is upgraded and the template is upgraded when a master node is
elected that has the new version of the software.
6. Add some more information to debug the index audit trail test because
it helped me figure out what was going on.

I also dropped the `waitCondition` from the `rolling-upgrade-basic`
tests because it wasn't needed.

Closes elastic#25336
nik9000 added a commit that referenced this issue Jun 6, 2018
This is much more realistic and can find more issues. This causes the
"mixed cluster" tests to be run twice so I had to fix the tests to work
in that case. In most cases I did as little as possible to get them
working but in a few cases I went a little beyond that to make them
easier for me to debug while getting them to work. My test changes:

1. Remove the "basic indexing" tests and replace them with a copy of the
tests used in the OSS. We have no way of sharing code between these two
projects so for now I copy.
2. Skip the a few tests in the "one third" upgraded scenario:
  * creating a scroll to be reused when the cluster is fully upgraded
  * creating some ml data to be used when the cluster is fully ugpraded
3. Drop many "assert yellow and that the cluster has two nodes"
assertions. These assertions duplicate those made by the wait condition
and they fail now that we have three nodes.
4. Switch many "assert green and that the cluster has two nodes" to 3
nodes. These assertions are unique from the wait condition and, while
I imagine they aren't required in all cases, now is not the time to
find that out. Thus, I made them work.
5. Rework the index audit trail test so it is more obvious that it is
the same test expecting different numbers based on the shape of the
cluster. The conditions for which number are expected are fairly
complex because the index audit trail is shut down until the template
for it is upgraded and the template is upgraded when a master node is
elected that has the new version of the software.
6. Add some more information to debug the index audit trail test because
it helped me figure out what was going on.

I also dropped the `waitCondition` from the `rolling-upgrade-basic`
tests because it wasn't needed.

Closes #25336
nik9000 added a commit that referenced this issue Jun 6, 2018
This is much more realistic and can find more issues. This causes the
"mixed cluster" tests to be run twice so I had to fix the tests to work
in that case. In most cases I did as little as possible to get them
working but in a few cases I went a little beyond that to make them
easier for me to debug while getting them to work. My test changes:

1. Remove the "basic indexing" tests and replace them with a copy of the
tests used in the OSS. We have no way of sharing code between these two
projects so for now I copy.
2. Skip the a few tests in the "one third" upgraded scenario:
  * creating a scroll to be reused when the cluster is fully upgraded
  * creating some ml data to be used when the cluster is fully ugpraded
3. Drop many "assert yellow and that the cluster has two nodes"
assertions. These assertions duplicate those made by the wait condition
and they fail now that we have three nodes.
4. Switch many "assert green and that the cluster has two nodes" to 3
nodes. These assertions are unique from the wait condition and, while
I imagine they aren't required in all cases, now is not the time to
find that out. Thus, I made them work.
5. Rework the index audit trail test so it is more obvious that it is
the same test expecting different numbers based on the shape of the
cluster. The conditions for which number are expected are fairly
complex because the index audit trail is shut down until the template
for it is upgraded and the template is upgraded when a master node is
elected that has the new version of the software.
6. Add some more information to debug the index audit trail test because
it helped me figure out what was going on.

I also dropped the `waitCondition` from the `rolling-upgrade-basic`
tests because it wasn't needed.

Closes #25336
nik9000 added a commit that referenced this issue Jun 8, 2018
This is much more realistic and can find more issues. This causes the
"mixed cluster" tests to be run twice so I had to fix the tests to work
in that case. In most cases I did as little as possible to get them
working but in a few cases I went a little beyond that to make them
easier for me to debug while getting them to work. My test changes:

1. Remove the "basic indexing" tests and replace them with a copy of the
tests used in the OSS. We have no way of sharing code between these two
projects so for now I copy.
2. Skip the a few tests in the "one third" upgraded scenario:
  * creating a scroll to be reused when the cluster is fully upgraded
  * creating some ml data to be used when the cluster is fully ugpraded
3. Drop many "assert yellow and that the cluster has two nodes"
assertions. These assertions duplicate those made by the wait condition
and they fail now that we have three nodes.
4. Switch many "assert green and that the cluster has two nodes" to 3
nodes. These assertions are unique from the wait condition and, while
I imagine they aren't required in all cases, now is not the time to
find that out. Thus, I made them work.
5. Rework the index audit trail test so it is more obvious that it is
the same test expecting different numbers based on the shape of the
cluster. The conditions for which number are expected are fairly
complex because the index audit trail is shut down until the template
for it is upgraded and the template is upgraded when a master node is
elected that has the new version of the software.
6. Add some more information to debug the index audit trail test because
it helped me figure out what was going on.

I also dropped the `waitCondition` from the `rolling-upgrade-basic`
tests because it wasn't needed.

Closes #25336
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >test Issues or PRs that are addressing/adding tests
Projects
None yet
8 participants