-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling upgrade tests with 3 nodes #25336
Comments
+1, I think 3 nodes is a good way to test, it may shake out things that we don't see with primary promotion (such as things similar to #25277) |
Discussed in FixitFriday, sound like a good idea in general but we want to get @rjernst opinion about this too since it touches on build issues as well. |
I don't think a bwc integration test is the place to find issues. We won't have randomization here, and the intent of these tests is to directly test integration of components in the system with how they behave in a backward compatible way. Testing internal behavior of how the system operates should be done with unit tests IMO. |
I chatted with @ywelsch and we came to an agreement. I am ok with modifying the the rolling restart test to use 3 nodes, under the premise that a user doing a rolling restart with 2 nodes is not realistic, and the entire point of these tests is to be "representative" of a basic user. With 3 nodes, it also makes sense to have an index with both 1 and 2 replicas test through the rolling restart. What I would like to not see is any test mentioning "primary failover" or anything like that. And if we ever catch something at this level, a unit test should be added, as finding issues with the rolling restart tests is a last resort. |
Pinging @elastic/es-distributed |
@ywelsch I presume we still want to do this (the test still uses 2 nodes as far as I can tell). I vaguely remember you trying and it ended up being more difficult, but I'm not sure. If that's true, can you share your experiences? Otherwise it might be a good opportunity for @vladimirdolzhenko do learn the qa tests universe. |
The tricky bit is to make the Gradle changes. Setting this up for a 2 node cluster is already tricky, see https://github.com/elastic/elasticsearch/blob/master/qa/rolling-upgrade/build.gradle |
Switches the rolling upgrade tests from upgrading two nodes to upgrading three nodes which is much more realistic and much better able to find unexpected bugs. It upgrades the nodes one at a time and runs tests between each upgrade. As such this now has four test runs: 1. Old 2. One third upgraded 3. Two thirds upgraded 4. Upgraded It sets system properties so the tests can figure out which stage they are in. It reuses the same yml tests for the "one third" and "two thirds" cases because they are *almost* the same case. This rewrites the yml-based indexing tests to be Java based because the yml-based tests can't handle different expected values for the counts. And the indexing tests need that when they are run twice. Closes #25336
Switches the rolling upgrade tests from upgrading two nodes to upgrading three nodes which is much more realistic and much better able to find unexpected bugs. It upgrades the nodes one at a time and runs tests between each upgrade. As such this now has four test runs: 1. Old 2. One third upgraded 3. Two thirds upgraded 4. Upgraded It sets system properties so the tests can figure out which stage they are in. It reuses the same yml tests for the "one third" and "two thirds" cases because they are *almost* the same case. This rewrites the yml-based indexing tests to be Java based because the yml-based tests can't handle different expected values for the counts. And the indexing tests need that when they are run twice. Closes #25336
I'm reopening this issue as #30728 only fixed this for OSS Elasticsearch. The x-pack rolling upgrade tests under |
Pinging @elastic/es-core-infra |
Ah, yes the with security rolling upgrade tests. I can take a look at that
today. We get quite a bit of coverage by running the oss tests with the
default distribution but this will cover more xpack apis and covers the pre
6.3 xpack to post 6.3 xpack upgrade.
…On Tue, Jun 5, 2018, 5:04 AM Elastic Machine ***@***.***> wrote:
Pinging @elastic/es-core-infra
<https://github.com/orgs/elastic/teams/es-core-infra>
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#25336 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANLopANrskDniGJdq-YaxvIzlDqKAsbks5t5km5gaJpZM4OBHK0>
.
|
This is much more realistic and can find more issues. This causes the "mixed cluster" tests to be run twice so I had to fix the tests to work in that case. In most cases I did as little as possible to get them working but in a few cases I went a little beyond that to make them easier for me to debug while getting them to work. My test changes: 1. Remove the "basic indexing" tests and replace them with a copy of the tests used in the OSS. We have no way of sharing code between these two projects so for now I copy. 2. Skip the a few tests in the "one third" upgraded scenario: * creating a scroll to be reused when the cluster is fully upgraded * creating some ml data to be used when the cluster is fully ugpraded 3. Drop many "assert yellow and that the cluster has two nodes" assertions. These assertions duplicate those made by the wait condition and they fail now that we have three nodes. 4. Switch many "assert green and that the cluster has two nodes" to 3 nodes. These assertions are unique from the wait condition and, while I imagine they aren't required in all cases, now is not the time to find that out. Thus, I made them work. 5. Rework the index audit trail test so it is more obvious that it is the same test expecting different numbers based on the shape of the cluster. The conditions for which number are expected are fairly complex because the index audit trail is shut down until the template for it is upgraded and the template is upgraded when a master node is elected that has the new version of the software. 6. Add some more information to debug the index audit trail test because it helped me figure out what was going on. I also dropped the `waitCondition` from the `rolling-upgrade-basic` tests because it wasn't needed. Closes elastic#25336
This is much more realistic and can find more issues. This causes the "mixed cluster" tests to be run twice so I had to fix the tests to work in that case. In most cases I did as little as possible to get them working but in a few cases I went a little beyond that to make them easier for me to debug while getting them to work. My test changes: 1. Remove the "basic indexing" tests and replace them with a copy of the tests used in the OSS. We have no way of sharing code between these two projects so for now I copy. 2. Skip the a few tests in the "one third" upgraded scenario: * creating a scroll to be reused when the cluster is fully upgraded * creating some ml data to be used when the cluster is fully ugpraded 3. Drop many "assert yellow and that the cluster has two nodes" assertions. These assertions duplicate those made by the wait condition and they fail now that we have three nodes. 4. Switch many "assert green and that the cluster has two nodes" to 3 nodes. These assertions are unique from the wait condition and, while I imagine they aren't required in all cases, now is not the time to find that out. Thus, I made them work. 5. Rework the index audit trail test so it is more obvious that it is the same test expecting different numbers based on the shape of the cluster. The conditions for which number are expected are fairly complex because the index audit trail is shut down until the template for it is upgraded and the template is upgraded when a master node is elected that has the new version of the software. 6. Add some more information to debug the index audit trail test because it helped me figure out what was going on. I also dropped the `waitCondition` from the `rolling-upgrade-basic` tests because it wasn't needed. Closes #25336
This is much more realistic and can find more issues. This causes the "mixed cluster" tests to be run twice so I had to fix the tests to work in that case. In most cases I did as little as possible to get them working but in a few cases I went a little beyond that to make them easier for me to debug while getting them to work. My test changes: 1. Remove the "basic indexing" tests and replace them with a copy of the tests used in the OSS. We have no way of sharing code between these two projects so for now I copy. 2. Skip the a few tests in the "one third" upgraded scenario: * creating a scroll to be reused when the cluster is fully upgraded * creating some ml data to be used when the cluster is fully ugpraded 3. Drop many "assert yellow and that the cluster has two nodes" assertions. These assertions duplicate those made by the wait condition and they fail now that we have three nodes. 4. Switch many "assert green and that the cluster has two nodes" to 3 nodes. These assertions are unique from the wait condition and, while I imagine they aren't required in all cases, now is not the time to find that out. Thus, I made them work. 5. Rework the index audit trail test so it is more obvious that it is the same test expecting different numbers based on the shape of the cluster. The conditions for which number are expected are fairly complex because the index audit trail is shut down until the template for it is upgraded and the template is upgraded when a master node is elected that has the new version of the software. 6. Add some more information to debug the index audit trail test because it helped me figure out what was going on. I also dropped the `waitCondition` from the `rolling-upgrade-basic` tests because it wasn't needed. Closes #25336
This is much more realistic and can find more issues. This causes the "mixed cluster" tests to be run twice so I had to fix the tests to work in that case. In most cases I did as little as possible to get them working but in a few cases I went a little beyond that to make them easier for me to debug while getting them to work. My test changes: 1. Remove the "basic indexing" tests and replace them with a copy of the tests used in the OSS. We have no way of sharing code between these two projects so for now I copy. 2. Skip the a few tests in the "one third" upgraded scenario: * creating a scroll to be reused when the cluster is fully upgraded * creating some ml data to be used when the cluster is fully ugpraded 3. Drop many "assert yellow and that the cluster has two nodes" assertions. These assertions duplicate those made by the wait condition and they fail now that we have three nodes. 4. Switch many "assert green and that the cluster has two nodes" to 3 nodes. These assertions are unique from the wait condition and, while I imagine they aren't required in all cases, now is not the time to find that out. Thus, I made them work. 5. Rework the index audit trail test so it is more obvious that it is the same test expecting different numbers based on the shape of the cluster. The conditions for which number are expected are fairly complex because the index audit trail is shut down until the template for it is upgraded and the template is upgraded when a master node is elected that has the new version of the software. 6. Add some more information to debug the index audit trail test because it helped me figure out what was going on. I also dropped the `waitCondition` from the `rolling-upgrade-basic` tests because it wasn't needed. Closes #25336
TL;DR: I'm adding a BWC layer in #24841 that will only be triggered in a mixed-version cluster with at least 3 nodes and an index with at least 3 shard copies (i.e.
index.number_of_replicas
> 1). Q: Should we change our existingqa/rolling-upgrade
tests to run with 3 nodes?Details: When a primary fails and a replica is promoted to primary, the new primary will trigger a primary-replica resync, making sure that all the replicas are realigned with the new primary. This is a transport action that will only go into 6.0, not into 5.x. As such I added a BWC layer to the transport action that it will not send these resync requests to 5.6 nodes (i.e. replicas on 5.6 nodes won't be realigned with the new 6.0 primary, same as we do today in < 6.0).
The text was updated successfully, but these errors were encountered: