-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UpgradeClusterClientYamlTestSuiteIT fails intermittently #30456
Comments
Pinging @elastic/es-distributed |
These look like ML issues, I am reassigning. |
Pinging @elastic/ml-core |
Failures have been spotted on the 6.x and master branches. All failures on master occurred upgrading from 6.4 and on 6.x upgrading from 6.2.1, 6.2.3, 6.2.5 and 6.3.0. I pulled the logs for one instance a unfortunately they don't provide much insight without debug level logging. |
I think the problem is one that we've had intermittently for ages. I imagine something has recently changed elsewhere in ES or the test infrastructure that makes it more likely that a bug that has existed basically forever to be triggered. To summarise we've had two theories about what might cause this:
I suspect in the case of the tests (2) is more likely, because |
How is that possible? I'm not aware of any option in ES that would allow you to only wait on acknowledgement from the primary. |
@droberts195 I'm not sure what you mean with:
ES does it all the time. Do you mean you don't ensure all replicas are fully allocated? |
Yes, sorry, this is what I meant. I think we need to add a |
@droberts195 double checking, but you mean in tests only, right? I'm not sure that's the right move for production. |
@bleskes if Our results index is created on demand from a template when the first result is indexed, and the job those tests run is tiny - it only receives 2 input documents - so I guess it's quite likely that the replica is not ready at the point where one of the nodes is killed prior to upgrade. @davidkyle has managed to reproduce this locally often enough to try out possible solutions. |
+1. The problem with |
The failures are characterised by missing documents after a node has been upgraded, for example when a job is closed in the old cluster its model state is persisted but often this document cannot be found in the mixed cluster after a single node is upgraded. Here are some of the issues I believe may be the cause
|
@ywelsch This is the reproduce command for the master branch it fails fairly frequently for me
Remember to unmute the test on line 33 of |
I have reproduced the issue and also have an explanation and fix in the works. The short version is that this is caused by my recent PR #30423 (sorry for that). The steps that lead to this situation are as follows:
Due to the async durability of the index, the test was relying on the fact that the shard on the restarted node would not be the one to be selected as primary after the restart, but that the shard copy on the non-started node would either remain primary or be promoted to primary. This was inadvertently changed by PR #30423 (which I consider a bug), now making it so that when node1 leaves and rejoins the cluster in the same update, that we first auto-expand replicas from 1 to 0, and only then look at failing shards of the removed node, only to auto-expand back from 0 to 1 replicas in the same update. In our situation here, the primary was on the node to be removed and readded. The auto-expand-replicas functionality now removed the replica on node0 (the node that was not restarted), only to notice in a follow-up step that the sole remaining primary now was on the node that was removed. The primary shard would then be failed and moved to unassigned. In the next step we would readd former node1 and add back a replica due to auto-expansion. Now the master would prefer allocating the primary to the node that previously had the primary, which was former node1. As this node was the one that experienced data loss due to the
I will be working on a fix for this. |
Thanks so much for investigating this @ywelsch. It would have taken us ages to work out what was happening. |
#30423 combined auto-expansion in the same cluster state update where nodes are removed. As the auto-expansion step would run before deassociating the dead nodes from the routing table, the auto-expansion would possibly remove replicas from live nodes instead of dead ones. This commit reverses the order to ensure that when nodes leave the cluster that the auto-expand-replica functionality only triggers after failing the shards on the removed nodes. This ensures that active shards on other live nodes are not failed if the primary resided on a now dead node. Instead, one of the replicas on the live nodes first gets promoted to primary, and the auto- expansion (removing replicas) only triggers in a follow-up step (but still same cluster state update). Relates to #30456 and follow-up of #30423
#30423 combined auto-expansion in the same cluster state update where nodes are removed. As the auto-expansion step would run before deassociating the dead nodes from the routing table, the auto-expansion would possibly remove replicas from live nodes instead of dead ones. This commit reverses the order to ensure that when nodes leave the cluster that the auto-expand-replica functionality only triggers after failing the shards on the removed nodes. This ensures that active shards on other live nodes are not failed if the primary resided on a now dead node. Instead, one of the replicas on the live nodes first gets promoted to primary, and the auto- expansion (removing replicas) only triggers in a follow-up step (but still same cluster state update). Relates to #30456 and follow-up of #30423
I've verified that #30553 fixes the cause for seeing this issue occurring more often.
As I mentioned in an e-mail, the issue (of losing documents) will sometimes manifest if you don’t wait for the indices with async durability to be fully allocated to both nodes in this test. This should happen rarely in practice when running the test (as replicas should allocate pretty quickly), but could be the cause of the more spurious test failures that you were seeing before this issue got opened. I will revert 27429ec to reenable the tests, but would also like to suggest adding an |
Maybe we could also do this just in the relevant ML tests? Then we won't cause problems if somebody is trying to just reproduce a problem in, say, the monitoring or watcher tests and is not running the ML ones at all. Also, what is the best way to confirm that the replicas of an index with |
right, I didn't think about that.
As the two problematic indices ( |
Argh it's back
This failed on master upgrading from 6.4 I'm not sure which build of 6.4 that was and if it had all the fixes in. I'll keep an eye on this |
@davidkyle That build used a last good commit from before you pushed your wait-for-green change:
|
Closing as the last failure was based on an old commit. Thanks @ywelsch |
|
To me this looks like a different issue and I think that a new issue should be opened. |
Yes, I think you are right the last one is something completely new. I shouldn't have merged them both here. |
The first failure that @imotov reported is the same as the first failure that was originally reported by @atorok (i.e. While we fixed the second failure reported by @atorok ( |
I think the most recent failures are all due to the
These C++ failures are split out into #30982. As far as I can see all yesterday's failures are caused by this new problem. The full logs for the first failure reported by @atorok are no longer available, but there a total of three problems that were fixed for this issue. The first was a wire transport format incompatibility between master and 6.x that was fixed in #30512. @davidkyle found that very quickly during the original investigation of this issue, so I suspect this was the cause of the first reported failure. |
Closing as the investigation has shown that the recent failures are due to a SEGV in autodetect and is tracked in #30982 |
Discovered while building
#30367
FailedReproducible using :
Saw different failures while trying to reproduce it in isolation:
and
First failure reproduced 3 times for, the second one only once, and got 3 successful runs as well.
The text was updated successfully, but these errors were encountered: