Improve zen1 support with StatefulSets #1281

sebgl · 2019-07-18T12:04:38Z

How should we deal with zen1 in ECK reconciliation loops?

I have in mind the following, open for discussion:

Reconciliation:
---------------

We go through the following algorithm over and over again,
everytime something changes in k8s resources
(eg. a pod was created/deleted based on what we scheduled).

/!\ We don't explicitly create/delete nodes: we schedule them for
async creation/deletion (handled by k8s).

---------------

## 1. Retrieve state of the world

- get the expected nodes we should have
- get the nodes we currently have

## 2.  Create missing nodes

- create `elasticsearch.yml` for each missing node,
   with `minimum_master_nodes=(expected_master_nodes/2)+1`
- `expected_master_nodes` here does not represent the current
   state of resources, but the one we expect once all nodes are there
   (eg. if we add 2 new masters to 2 existing ones, the 2 new ones will have m_m_n=3)
- schedule missing nodes for creation (with their configuration)
- modify all existing nodes configuration to consider the updated
  `minimum_master_nodes`, in case they restart

## 3. Update cluster settings API

/!\ Important: at that time the nodes we scheduled for creation above
may not be created yet

- Update /_cluster/settings (both persistent and transient)
   with `minimum_master_nodes=(current_masters/2)+1`

`current_masters` here is ignoring any new node from step 2
that may not have been created yet.
Once these new nodes are actually created, we'll go through
the entire algorithm again.

## 4. Delete nodes

- Only delete one master node at a time.
- If node(s) to delete is a master:
    - special case if going from 2 masters to 1 master (unsafe):
         update /_cluster/settings with `minimum_master_nodes=1`
    - else (normal case): don't update /_cluster/settings yet
    - schedule the master node for deletion

Nodes are not effectively deleted here (async call),
but we'll go through the entire algorithm again when
the deletion actually happened, and update `/_cluster/settings` in step 3.

Thoughts would be appreciated @ywelsch @DaveCTurner :)

The text was updated successfully, but these errors were encountered:

sebgl · 2019-07-18T12:07:05Z

A suggestion from @ywelsch on Slack:

The only perhaps a bit more controversial one is step 2, where the existing running masters are only made aware of the changes post-hoc, leaving a period where the minimum_master_nodes cluster setting is less than (total_master_nodes/2)+1. (edited)
One solution there would be to only add one master at a time (with the only exception during cluster bootstrapping perhaps)

sebgl · 2019-07-18T21:25:53Z

Following #1262, I think we need to:

update m_m_n before deleting a master node in the case 2->1
probably make sure we add only one master node at a time (except when bootstrapping the cluster)

barkbay · 2019-08-07T07:29:33Z

probably make sure we add only one master node at a time (except when bootstrapping the cluster)

Actually maybe that there is small improvement, if the current number n of master is odd you may be able to add 2 new masters at a time. Reason is that n+1 and n+2 masters share the same m_m_n

For example: if you have 3 nodes and user want to move to 5 (or more) nodes then you can directly add the 2 new nodes because the new m_m_n is still 3 whatever there are 4 or 5 masters.

sebgl · 2019-08-07T12:49:58Z

Actually maybe that there is small improvement, if the current number n of master is odd you may be able to add 2 new masters at a time. Reason is that n+1 and n+2 masters share the same m_m_n

True.
Not sure that's worth handling explicitly though, it would add a bit of complexity in the code for a rather small optimisation?

sebgl · 2019-08-12T15:35:20Z

@DaveCTurner what's the recommended approach when dealing with a mixed v6/v7 cluster?

Say I'm moving from 3 v6.x nodes to 5 v7.x nodes.
Is it safe to update minimum_master_nodes through the API if the current master node is a v7 node?
Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

One way of removing this problem is maybe to make sure we don't perform (up|down)scales and rolling upgrades at the same time.

pebrc · 2019-08-12T15:42:24Z

Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

This will fail because the 6.x master node does not have the 7.x voting exclusion API

ywelsch · 2019-08-12T16:14:02Z

@DaveCTurner is on vacation (for the next 2 weeks) so I'm answering here.

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster.

Voting configurations only make sense once the cluster has been Zen2-bootstrapped. In a rolling-upgrade, this bootstrapping automatically happens once the Zen2 nodes notice that all Zen1 master-eligible node have left the cluster. Only after bootstrapping (which establishes the first voting configuration) can a Zen2 node become master.

This means that in a rolling upgrade with mixed Zen1/Zen2 master-eligible nodes, as long as there is a Zen1 master, minimum_master_nodes should be adjusted, as there are no voting configurations (and therefore also no exclusions).

If you're expanding a 3 node cluster to 5 nodes during a 6.x to 7.x upgrade with a grow-and-shrink, it should look as follows:

initially have 3 6.x nodes, min_master_nodes set to 2
set min_master_nodes to 3, add 7.x node (3 is a majority of 4 nodes)
set min_master_nodes to 3, add 7.x node (3 is a majority of 5 nodes)
set min_master_nodes to 4, add 7.x node (4 is a majority of 6 nodes)
set min_master_nodes to 4, add 7.x node (4 is a majority of 7 nodes)
set min_master_nodes to 5, add 7.x node (5 is a majority of 8 nodes)
remove 6.x node (master will still be a 6.x node), set min_master_nodes to 4 (4 is a majority of 7 nodes)
remove 6.x node (master will still be a 6.x node), set min_master_nodes to 4 (4 is a majority of 6 nodes)
remove 6.x node (master will still be a 6.x node). This is the last Zen1 node. The Zen2 nodes will at that point auto-bootstrap with an initial voting configuration containing the 5 Zen nodes, and a Zen2 master will be elected. There is no need to adjust min_master_nodes (it can be set to null).

Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

It's nonsensical because there are no voting configurations at that point.

Is it safe to update minimum_master_nodes through the API if the current master node is a v7 node?

It's non-sensical to do because once you have a Zen2 master, only voting configurations matter.

sebgl · 2019-08-12T16:21:17Z

Thanks @ywelsch!
IIUC, a simple way to handle this in our code when we do some changes on the cluster:

if there is at least one v6 master node running, go the zen1 way (update minimum_master_nodes)
else (no v6 master node), go the zen2 way (voting config exclusions)

ywelsch · 2019-08-12T16:32:55Z

yes, makes sense.

sebgl · 2019-08-14T09:55:23Z

@ywelsch one tricky bit raised in the PR for zen1 2->1 master special case is about availability of the master nodes. We can be in the following situation:

3 master nodes cluster
2 out of the 3 master nodes are actually running. One is unavailable for whatever reason (was just created but is not running yet, not enough resource on k8s, host is down, network issue, etc.)
The user wants to downscale from 3 to 2 master nodes, effectively removing one of the 2 running masters (this might be part of a larger 3->1 downscale).

Even though we downscale, in theory, from 3 to 2 nodes as seen from K8S, we actually downscale effectively from 2 to 1 nodes as seen from ES. If we decide to update zen1 minimum_master_nodes=1 before removing the node, but the third node comes back to life before the node is removed, we are temporarily running 3 masters with minimum_master_nodes=1.

This is made a bit harder since ECK is working with out-of-date nodes state (can see one node unavailable that would, in real-time reality (😄), be available).

We could also argue that the master to remove should be the unavailable one, but the user is allowed to express master nodes specification in different group of nodes (different config, availability zones, etc.) so may request explicitly the removal of one of the 2 running masters.

Any thoughts?

ywelsch · 2019-08-14T11:37:55Z

As you noticed, this type of down-scaling is completely unsafe, and the quickest way to get data loss. As @pebrc commented on the PR, you can't just take the currently available nodes into account to determine minimum_master_nodes (same applies to all other cases, not only those outlined here). I think that the operator should fail the down-scaling process in the above case.

sebgl · 2019-08-14T13:33:50Z

We could decide to always make sure the expected masters are running before updating minimum_master_nodes through the ES API (not more from removals or rolling upgrades not completed yet, not less from creations not completed yet).
We would requeue any downscale operation until we reach this stable state.

I'm worried about cases where the user would be stuck with masters that will never go running. For example:

masters that were misconfigured (eg. unknown es setting)
master that cannot be scheduled on a k8s node (wrong affinity or no more resources available)

If we refuse to perform any downscale at this point, the user cannot remove those misconfigured StatefulSets?

pebrc · 2019-08-14T15:55:21Z

If we refuse to perform any downscale at this point, the user cannot remove those misconfigured StatefulSets?

Could we detect some instances where such a node would be OK to remove. For example a bootlooping node (due to misconfiguration) would be detectable. I know it adds complexity but it might help us getting out of that situation.

anyasabo · 2019-08-14T17:37:20Z

We could decide to always make sure the expected masters are running before updating minimum_master_nodes through the ES API

👍 this makes sense to me. We know how many we expect to be running and can check how many are actually members of the cluster.

Personally I'm not too worried about handling a boot looping pod within the operator right now, that definitely seems like it can be an optimization for later. For now some manual work seems acceptable. We might need to increase the troubleshooting documentation as we run into more cases like that and then can think about automating them.

When doing any zen2 API call (voting config exclusions), we checked if there is at least one v7+ master in the cluster in order to proceed. This is wrong because if the current master is a v6 node, the call will fail. In a cluster with mixed v6 -> v7 versions, the current master is supposed to be a v6 node. Zen2 APIs can be called once there is no more v6 node in the cluster, at which point the current master is a v7 node. For more context, see elastic#1281 (comment)

* Only update zen2 settings if there is no v6 master node When doing any zen2 API call (voting config exclusions), we checked if there is at least one v7+ master in the cluster in order to proceed. This is wrong because if the current master is a v6 node, the call will fail. In a cluster with mixed v6 -> v7 versions, the current master is supposed to be a v6 node. Zen2 APIs can be called once there is no more v6 node in the cluster, at which point the current master is a v7 node. For more context, see #1281 (comment) * Rename functions to respect the IsCompatibleWith convention

sebgl · 2019-08-21T09:47:39Z

I think it's safe to close this now.
I created #1605 to focus on the remaining task of adding only one master node at a time, and how to deal with unavailable masters.

DaveCTurner · 2019-08-26T08:22:29Z

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster.

This is not quite true. It should be:

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster and they have not already bootstrapped a Zen2 cluster.

The normal flow for a rolling upgrade from 6.x to 7.x is as Yannick describes, but there is a failure case that isn't covered. If more than half of the master-eligible nodes are upgraded to 7.x and then the remaining 6.x master-eligible nodes are temporarily disconnected then the 7.x nodes will assume the 6.x master-eligible nodes have been shut down and will automatically bootstrap a 7.x cluster. When the 6.x nodes reconnect they will discover the 7.x cluster and will join it. You will now be in a state where the process described in this comment doesn't do the right thing, because it will be assuming this is still a 6.x cluster and manipulating minimum_master_nodes.

As a more concrete example, imagine you are upgrading a 3-master-node cluster and get to the point of having two 7.x master nodes and one 6.x master node. If the 6.x master node is briefly disconnected then the 7.x master nodes will bootstrap a cluster and elect a master, and the voting configuration will be the elected master alone. If you now shut down the elected master without using a voting config exclusion then the cluster will fall apart, even if the other two master nodes are fine.

I raise this because I'm not sure what else might happen mid-upgrade. If an upgrade gets stuck for some reason and you need to restart a 7.x node to fix it then you may need to handle the cluster being unavailable for a while.

There's also a mechanism for preventing 6.x nodes from even joining a cluster that comprises entirely 7.x nodes, which might happen if the 6.x nodes are all temporarily disconnected.

sebgl · 2019-08-26T11:59:32Z

@DaveCTurner thanks for the details.
What's the best way to know if a zen2 cluster has been bootstrapped?

DaveCTurner · 2019-08-26T12:15:32Z

If the cluster state has a nonempty voting configuration then it has definitely bootstrapped. However if you observe an empty voting configuration then we don't know: bootstrapping can technically happen at any time, so you may be looking at stale information.

The best way to handle this is, I think, to make sure that you can still escape from this kind of intermediate state by pushing forwards with the upgrade despite failures.

sebgl · 2019-09-25T13:28:09Z

Closing this big issue in favor of keeping #1792 open to fix the corner case we discovered in the last few posts.

Other than that, everything looks fine with zen1 so far. We'll open other issues if we discover new bugs/corner cases.

sebgl added the discuss We need to figure this out label Jul 18, 2019

This was referenced Jul 18, 2019

Orchestrate zen1 and zen2 settings for StatefulSets #1262

Merged

Improve master nodes deletion #668

Closed

sebgl changed the title ~~Ssets refactoring and zen1 considerations~~ Improve zen1 support with StatefulSets Jul 19, 2019

sebgl added justdoit Continuous improvement not related to a specific feature loe:medium and removed discuss We need to figure this out labels Jul 19, 2019

sebgl mentioned this issue Jul 19, 2019

Meta-issue: StatefulSet refactoring #1299

Closed

18 tasks

sebgl self-assigned this Aug 6, 2019

This was referenced Aug 7, 2019

Refactor downscales and add unit tests #1506

Merged

Downscale one master at a time & don't remove last running master #1549

Merged

Handle zen1 minimum_master_nodes special case for 2->1 masters #1551

Merged

sebgl mentioned this issue Aug 19, 2019

Update zen2 settings only if there is no v6 master node #1595

Merged

sebgl mentioned this issue Aug 21, 2019

Add and remove one master node at a time #1605

Closed

3 tasks

sebgl closed this as completed Aug 21, 2019

sebgl reopened this Aug 26, 2019

This was referenced Aug 29, 2019

Add only one master node at a time #1654

Merged

Upgrading 1 node from 6.X to 7.X fails since cluster appears already bootstrapped #1660

Closed

barkbay mentioned this issue Sep 25, 2019

Investigating TestMutationMdiToDedicated failure #614

Closed

sebgl mentioned this issue Sep 25, 2019

Handle mixed v6/v7 rolling upgrade failure case #1792

Closed

sebgl closed this as completed Sep 25, 2019

sebgl mentioned this issue Sep 26, 2019

Version upgrade might not finish if v6 nodes leave the cluster #1644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve zen1 support with StatefulSets #1281

Improve zen1 support with StatefulSets #1281

sebgl commented Jul 18, 2019 •

edited

Loading

sebgl commented Jul 18, 2019 •

edited

Loading

sebgl commented Jul 18, 2019

barkbay commented Aug 7, 2019 •

edited

Loading

sebgl commented Aug 7, 2019

sebgl commented Aug 12, 2019

pebrc commented Aug 12, 2019

ywelsch commented Aug 12, 2019

sebgl commented Aug 12, 2019

ywelsch commented Aug 12, 2019

sebgl commented Aug 14, 2019 •

edited

Loading

ywelsch commented Aug 14, 2019

sebgl commented Aug 14, 2019

pebrc commented Aug 14, 2019 •

edited

Loading

anyasabo commented Aug 14, 2019

sebgl commented Aug 21, 2019

DaveCTurner commented Aug 26, 2019 •

edited

Loading

sebgl commented Aug 26, 2019

DaveCTurner commented Aug 26, 2019

sebgl commented Sep 25, 2019

Improve zen1 support with StatefulSets #1281

Improve zen1 support with StatefulSets #1281

Comments

sebgl commented Jul 18, 2019 • edited Loading

sebgl commented Jul 18, 2019 • edited Loading

sebgl commented Jul 18, 2019

barkbay commented Aug 7, 2019 • edited Loading

sebgl commented Aug 7, 2019

sebgl commented Aug 12, 2019

pebrc commented Aug 12, 2019

ywelsch commented Aug 12, 2019

sebgl commented Aug 12, 2019

ywelsch commented Aug 12, 2019

sebgl commented Aug 14, 2019 • edited Loading

ywelsch commented Aug 14, 2019

sebgl commented Aug 14, 2019

pebrc commented Aug 14, 2019 • edited Loading

anyasabo commented Aug 14, 2019

sebgl commented Aug 21, 2019

DaveCTurner commented Aug 26, 2019 • edited Loading

sebgl commented Aug 26, 2019

DaveCTurner commented Aug 26, 2019

sebgl commented Sep 25, 2019

sebgl commented Jul 18, 2019 •

edited

Loading

sebgl commented Jul 18, 2019 •

edited

Loading

barkbay commented Aug 7, 2019 •

edited

Loading

sebgl commented Aug 14, 2019 •

edited

Loading

pebrc commented Aug 14, 2019 •

edited

Loading

DaveCTurner commented Aug 26, 2019 •

edited

Loading