Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve zen1 support with StatefulSets #1281

Closed
Tracked by #1299
sebgl opened this issue Jul 18, 2019 · 19 comments
Closed
Tracked by #1299

Improve zen1 support with StatefulSets #1281

sebgl opened this issue Jul 18, 2019 · 19 comments
Assignees
Labels
justdoit Continuous improvement not related to a specific feature

Comments

@sebgl
Copy link
Contributor

sebgl commented Jul 18, 2019

How should we deal with zen1 in ECK reconciliation loops?

I have in mind the following, open for discussion:

Reconciliation:
---------------

We go through the following algorithm over and over again,
everytime something changes in k8s resources
(eg. a pod was created/deleted based on what we scheduled).

/!\ We don't explicitly create/delete nodes: we schedule them for
async creation/deletion (handled by k8s).

---------------

## 1. Retrieve state of the world

- get the expected nodes we should have
- get the nodes we currently have

## 2.  Create missing nodes

- create `elasticsearch.yml` for each missing node,
   with `minimum_master_nodes=(expected_master_nodes/2)+1`
- `expected_master_nodes` here does not represent the current
   state of resources, but the one we expect once all nodes are there
   (eg. if we add 2 new masters to 2 existing ones, the 2 new ones will have m_m_n=3)
- schedule missing nodes for creation (with their configuration)
- modify all existing nodes configuration to consider the updated
  `minimum_master_nodes`, in case they restart

## 3. Update cluster settings API

/!\ Important: at that time the nodes we scheduled for creation above
may not be created yet

- Update /_cluster/settings (both persistent and transient)
   with `minimum_master_nodes=(current_masters/2)+1`

`current_masters` here is ignoring any new node from step 2
that may not have been created yet.
Once these new nodes are actually created, we'll go through
the entire algorithm again.

## 4. Delete nodes

- Only delete one master node at a time.
- If node(s) to delete is a master:
    - special case if going from 2 masters to 1 master (unsafe):
         update /_cluster/settings with `minimum_master_nodes=1`
    - else (normal case): don't update /_cluster/settings yet
    - schedule the master node for deletion

Nodes are not effectively deleted here (async call),
but we'll go through the entire algorithm again when
the deletion actually happened, and update `/_cluster/settings` in step 3.

Thoughts would be appreciated @ywelsch @DaveCTurner :)

@sebgl sebgl added the discuss We need to figure this out label Jul 18, 2019
@sebgl
Copy link
Contributor Author

sebgl commented Jul 18, 2019

A suggestion from @ywelsch on Slack:

The only perhaps a bit more controversial one is step 2, where the existing running masters are only made aware of the changes post-hoc, leaving a period where the minimum_master_nodes cluster setting is less than (total_master_nodes/2)+1. (edited)
One solution there would be to only add one master at a time (with the only exception during cluster bootstrapping perhaps)

@sebgl
Copy link
Contributor Author

sebgl commented Jul 18, 2019

Following #1262, I think we need to:

  • update m_m_n before deleting a master node in the case 2->1
  • probably make sure we add only one master node at a time (except when bootstrapping the cluster)

@sebgl sebgl changed the title Ssets refactoring and zen1 considerations Improve zen1 support with StatefulSets Jul 19, 2019
@sebgl sebgl added justdoit Continuous improvement not related to a specific feature loe:medium and removed discuss We need to figure this out labels Jul 19, 2019
@sebgl sebgl self-assigned this Aug 6, 2019
@barkbay
Copy link
Contributor

barkbay commented Aug 7, 2019

  • probably make sure we add only one master node at a time (except when bootstrapping the cluster)

Actually maybe that there is small improvement, if the current number n of master is odd you may be able to add 2 new masters at a time. Reason is that n+1 and n+2 masters share the same m_m_n

For example: if you have 3 nodes and user want to move to 5 (or more) nodes then you can directly add the 2 new nodes because the new m_m_n is still 3 whatever there are 4 or 5 masters.

@sebgl
Copy link
Contributor Author

sebgl commented Aug 7, 2019

Actually maybe that there is small improvement, if the current number n of master is odd you may be able to add 2 new masters at a time. Reason is that n+1 and n+2 masters share the same m_m_n

True.
Not sure that's worth handling explicitly though, it would add a bit of complexity in the code for a rather small optimisation?

@sebgl
Copy link
Contributor Author

sebgl commented Aug 12, 2019

@DaveCTurner what's the recommended approach when dealing with a mixed v6/v7 cluster?

Say I'm moving from 3 v6.x nodes to 5 v7.x nodes.
Is it safe to update minimum_master_nodes through the API if the current master node is a v7 node?
Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

One way of removing this problem is maybe to make sure we don't perform (up|down)scales and rolling upgrades at the same time.

@pebrc
Copy link
Collaborator

pebrc commented Aug 12, 2019

Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

This will fail because the 6.x master node does not have the 7.x voting exclusion API

@ywelsch
Copy link

ywelsch commented Aug 12, 2019

@DaveCTurner is on vacation (for the next 2 weeks) so I'm answering here.

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster.

Voting configurations only make sense once the cluster has been Zen2-bootstrapped. In a rolling-upgrade, this bootstrapping automatically happens once the Zen2 nodes notice that all Zen1 master-eligible node have left the cluster. Only after bootstrapping (which establishes the first voting configuration) can a Zen2 node become master.

This means that in a rolling upgrade with mixed Zen1/Zen2 master-eligible nodes, as long as there is a Zen1 master, minimum_master_nodes should be adjusted, as there are no voting configurations (and therefore also no exclusions).

If you're expanding a 3 node cluster to 5 nodes during a 6.x to 7.x upgrade with a grow-and-shrink, it should look as follows:

  • initially have 3 6.x nodes, min_master_nodes set to 2
  • set min_master_nodes to 3, add 7.x node (3 is a majority of 4 nodes)
  • set min_master_nodes to 3, add 7.x node (3 is a majority of 5 nodes)
  • set min_master_nodes to 4, add 7.x node (4 is a majority of 6 nodes)
  • set min_master_nodes to 4, add 7.x node (4 is a majority of 7 nodes)
  • set min_master_nodes to 5, add 7.x node (5 is a majority of 8 nodes)
  • remove 6.x node (master will still be a 6.x node), set min_master_nodes to 4 (4 is a majority of 7 nodes)
  • remove 6.x node (master will still be a 6.x node), set min_master_nodes to 4 (4 is a majority of 6 nodes)
  • remove 6.x node (master will still be a 6.x node). This is the last Zen1 node. The Zen2 nodes will at that point auto-bootstrap with an initial voting configuration containing the 5 Zen nodes, and a Zen2 master will be elected. There is no need to adjust min_master_nodes (it can be set to null).

Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

It's nonsensical because there are no voting configurations at that point.

Is it safe to update minimum_master_nodes through the API if the current master node is a v7 node?

It's non-sensical to do because once you have a Zen2 master, only voting configurations matter.

@sebgl
Copy link
Contributor Author

sebgl commented Aug 12, 2019

Thanks @ywelsch!
IIUC, a simple way to handle this in our code when we do some changes on the cluster:

  • if there is at least one v6 master node running, go the zen1 way (update minimum_master_nodes)
  • else (no v6 master node), go the zen2 way (voting config exclusions)

@ywelsch
Copy link

ywelsch commented Aug 12, 2019

yes, makes sense.

@sebgl
Copy link
Contributor Author

sebgl commented Aug 14, 2019

@ywelsch one tricky bit raised in the PR for zen1 2->1 master special case is about availability of the master nodes. We can be in the following situation:

  • 3 master nodes cluster
  • 2 out of the 3 master nodes are actually running. One is unavailable for whatever reason (was just created but is not running yet, not enough resource on k8s, host is down, network issue, etc.)
  • The user wants to downscale from 3 to 2 master nodes, effectively removing one of the 2 running masters (this might be part of a larger 3->1 downscale).

Even though we downscale, in theory, from 3 to 2 nodes as seen from K8S, we actually downscale effectively from 2 to 1 nodes as seen from ES. If we decide to update zen1 minimum_master_nodes=1 before removing the node, but the third node comes back to life before the node is removed, we are temporarily running 3 masters with minimum_master_nodes=1.

This is made a bit harder since ECK is working with out-of-date nodes state (can see one node unavailable that would, in real-time reality (😄), be available).

We could also argue that the master to remove should be the unavailable one, but the user is allowed to express master nodes specification in different group of nodes (different config, availability zones, etc.) so may request explicitly the removal of one of the 2 running masters.

Any thoughts?

@ywelsch
Copy link

ywelsch commented Aug 14, 2019

As you noticed, this type of down-scaling is completely unsafe, and the quickest way to get data loss. As @pebrc commented on the PR, you can't just take the currently available nodes into account to determine minimum_master_nodes (same applies to all other cases, not only those outlined here). I think that the operator should fail the down-scaling process in the above case.

@sebgl
Copy link
Contributor Author

sebgl commented Aug 14, 2019

We could decide to always make sure the expected masters are running before updating minimum_master_nodes through the ES API (not more from removals or rolling upgrades not completed yet, not less from creations not completed yet).
We would requeue any downscale operation until we reach this stable state.

I'm worried about cases where the user would be stuck with masters that will never go running. For example:

  • masters that were misconfigured (eg. unknown es setting)
  • master that cannot be scheduled on a k8s node (wrong affinity or no more resources available)

If we refuse to perform any downscale at this point, the user cannot remove those misconfigured StatefulSets?

@pebrc
Copy link
Collaborator

pebrc commented Aug 14, 2019

If we refuse to perform any downscale at this point, the user cannot remove those misconfigured StatefulSets?

Could we detect some instances where such a node would be OK to remove. For example a bootlooping node (due to misconfiguration) would be detectable. I know it adds complexity but it might help us getting out of that situation.

@anyasabo
Copy link
Contributor

We could decide to always make sure the expected masters are running before updating minimum_master_nodes through the ES API

👍 this makes sense to me. We know how many we expect to be running and can check how many are actually members of the cluster.

Personally I'm not too worried about handling a boot looping pod within the operator right now, that definitely seems like it can be an optimization for later. For now some manual work seems acceptable. We might need to increase the troubleshooting documentation as we run into more cases like that and then can think about automating them.

sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Aug 19, 2019
When doing any zen2 API call (voting config exclusions), we checked if
there is at least one v7+ master in the cluster in order to proceed.
This is wrong because if the current master is a v6 node, the call will
fail.
In a cluster with mixed v6 -> v7 versions, the current master is
supposed to be a v6 node. Zen2 APIs can be called once there is no more
v6 node in the cluster, at which point the current master is a v7 node.

For more context, see
elastic#1281 (comment)
sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Aug 19, 2019
When doing any zen2 API call (voting config exclusions), we checked if
there is at least one v7+ master in the cluster in order to proceed.
This is wrong because if the current master is a v6 node, the call will
fail.
In a cluster with mixed v6 -> v7 versions, the current master is
supposed to be a v6 node. Zen2 APIs can be called once there is no more
v6 node in the cluster, at which point the current master is a v7 node.

For more context, see
elastic#1281 (comment)
sebgl added a commit that referenced this issue Aug 20, 2019
* Only update zen2 settings if there is no v6 master node

When doing any zen2 API call (voting config exclusions), we checked if
there is at least one v7+ master in the cluster in order to proceed.
This is wrong because if the current master is a v6 node, the call will
fail.
In a cluster with mixed v6 -> v7 versions, the current master is
supposed to be a v6 node. Zen2 APIs can be called once there is no more
v6 node in the cluster, at which point the current master is a v7 node.

For more context, see
#1281 (comment)

* Rename functions to respect the IsCompatibleWith convention
@sebgl
Copy link
Contributor Author

sebgl commented Aug 21, 2019

I think it's safe to close this now.
I created #1605 to focus on the remaining task of adding only one master node at a time, and how to deal with unavailable masters.

@sebgl sebgl closed this as completed Aug 21, 2019
@DaveCTurner
Copy link

DaveCTurner commented Aug 26, 2019

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster.

This is not quite true. It should be:

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster and they have not already bootstrapped a Zen2 cluster.

The normal flow for a rolling upgrade from 6.x to 7.x is as Yannick describes, but there is a failure case that isn't covered. If more than half of the master-eligible nodes are upgraded to 7.x and then the remaining 6.x master-eligible nodes are temporarily disconnected then the 7.x nodes will assume the 6.x master-eligible nodes have been shut down and will automatically bootstrap a 7.x cluster. When the 6.x nodes reconnect they will discover the 7.x cluster and will join it. You will now be in a state where the process described in this comment doesn't do the right thing, because it will be assuming this is still a 6.x cluster and manipulating minimum_master_nodes.

As a more concrete example, imagine you are upgrading a 3-master-node cluster and get to the point of having two 7.x master nodes and one 6.x master node. If the 6.x master node is briefly disconnected then the 7.x master nodes will bootstrap a cluster and elect a master, and the voting configuration will be the elected master alone. If you now shut down the elected master without using a voting config exclusion then the cluster will fall apart, even if the other two master nodes are fine.

I raise this because I'm not sure what else might happen mid-upgrade. If an upgrade gets stuck for some reason and you need to restart a 7.x node to fix it then you may need to handle the cluster being unavailable for a while.

There's also a mechanism for preventing 6.x nodes from even joining a cluster that comprises entirely 7.x nodes, which might happen if the 6.x nodes are all temporarily disconnected.

@sebgl
Copy link
Contributor Author

sebgl commented Aug 26, 2019

@DaveCTurner thanks for the details.
What's the best way to know if a zen2 cluster has been bootstrapped?

@sebgl sebgl reopened this Aug 26, 2019
@DaveCTurner
Copy link

If the cluster state has a nonempty voting configuration then it has definitely bootstrapped. However if you observe an empty voting configuration then we don't know: bootstrapping can technically happen at any time, so you may be looking at stale information.

The best way to handle this is, I think, to make sure that you can still escape from this kind of intermediate state by pushing forwards with the upgrade despite failures.

@sebgl
Copy link
Contributor Author

sebgl commented Sep 25, 2019

Closing this big issue in favor of keeping #1792 open to fix the corner case we discovered in the last few posts.

Other than that, everything looks fine with zen1 so far. We'll open other issues if we discover new bugs/corner cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
justdoit Continuous improvement not related to a specific feature
Projects
None yet
Development

No branches or pull requests

6 participants