Downscale one master at a time & don't remove last running master #1549

sebgl · 2019-08-12T14:39:24Z

Add safety measures to StatefulSets downscales:

Make sure we remove one master node at a time
This plays nicely with zen settings in general (especially zen1).
Don't remove a master if it's the last master running in the cluster
This prevents situation where we can accidentally remove the last
running master, loosing cluster_state, while waiting for other masters
to be running during a sset mutation (eg. sset renamed).

Both are implemented as "invariants" in an "invariants" struct we use
through the downscale process.

Relates #1281.

This commit adds 2 safety measures to StatefulSets downscales: * Make sure we remove one master node at a time This plays nicely with zen settings in general (especially zen1). * Don't remove a master if it's the last master running in the cluster This prevents situation where we can accidentally remove the last running master, loosing cluster_state, while waiting for other masters to spin up during a sset mutation (eg. sset renamed). Both are implemented as "invariants" in an "invariants" struct we use through the downscale process.

pebrc

LGTM!

operators/pkg/controller/elasticsearch/driver/downscale_invariants.go

anyasabo

LGTM, literally only naming nits

operators/pkg/controller/elasticsearch/driver/downscale.go

barkbay · 2019-08-13T12:32:59Z

LGTM !

As a side note I have been a little bit surprised that the operator allows a scale down from 5 to 3 master nodes while master 0 and 1 are unavailable:

NAME                                    READY   STATUS    RESTARTS   AGE     IP           NODE                                                 NOMINATED NODE   READINESS GATES
pod/elasticsearch-sample-es-default-0   0/1     Pending   0          75s     <none>       <none>                                               <none>           <none>
pod/elasticsearch-sample-es-default-1   0/1     Pending   0          73s     <none>       <none>                                               <none>           <none>
pod/elasticsearch-sample-es-default-2   1/1     Running   0          6m41s   10.60.1.3    gke-michael-dev-cluster-default-pool-f0e38722-3p4q   <none>           <none>
pod/elasticsearch-sample-es-default-3   1/1     Running   0          3m26s   10.60.0.13   gke-michael-dev-cluster-default-pool-cd30e5ea-vvw1   <none>           <none>
pod/elasticsearch-sample-es-default-4   1/1     Running   0          3m25s   10.60.2.3    gke-michael-dev-cluster-default-pool-3098b032-v54c   <none>           <none>

After the scale down request user has only one master alive while it's initial request was to keep 3 of them:

NAME                                    READY   STATUS    RESTARTS   AGE   IP          NODE                                                 NOMINATED NODE   READINESS GATES
pod/elasticsearch-sample-es-default-0   0/1     Pending   0          19m   <none>      <none>                                               <none>           <none>
pod/elasticsearch-sample-es-default-1   0/1     Pending   0          19m   <none>      <none>                                               <none>           <none>
pod/elasticsearch-sample-es-default-2   1/1     Running   0          24m   10.60.1.3   gke-michael-dev-cluster-default-pool-f0e38722-3p4q   <none>           <none>

But I guess that's the expected behavior ?

david-kow

LGTM, one nit.

operators/pkg/controller/elasticsearch/driver/downscale_invariants.go

sebgl · 2019-08-14T07:46:07Z

@barkbay

After the scale down request user has only one master alive while it's initial request was to keep 3 of them

Interesting, thanks for testing this corner case. I think that's what we'd expect? The operator cannot do much about nodes staying in a Pending state unfortunately. Btw how did you manage to get into that situation? I'd love to easily reproduce these scenarios.

One important thing though is, if you wanted to downscale to 5 to 2, it would make sure we don't remove pod/elasticsearch-sample-es-default-2 (last running master) if 0 and 1 are not running.
We would still stay in a "dangerous" situation where only 1/3 master nodes are running. I don't think we can do better?

sebgl · 2019-08-14T09:38:42Z

@pebrc or @anyasabo can you double-check the last commit?

pebrc

Re-LGTM

sebgl added 2 commits August 12, 2019 16:22

Reorganize downscale into multiple files

644d6ba

sebgl added the justdoit Continuous improvement not related to a specific feature label Aug 12, 2019

sebgl mentioned this pull request Aug 12, 2019

Handle zen1 minimum_master_nodes special case for 2->1 masters #1551

Merged

pebrc approved these changes Aug 12, 2019

View reviewed changes

operators/pkg/controller/elasticsearch/driver/downscale_invariants.go Outdated Show resolved Hide resolved

operators/pkg/controller/elasticsearch/driver/downscale_invariants.go Outdated Show resolved Hide resolved

anyasabo reviewed Aug 13, 2019

View reviewed changes

operators/pkg/controller/elasticsearch/driver/downscale_invariants.go Outdated Show resolved Hide resolved

anyasabo approved these changes Aug 13, 2019

View reviewed changes

anyasabo reviewed Aug 13, 2019

View reviewed changes

operators/pkg/controller/elasticsearch/driver/downscale.go Outdated Show resolved Hide resolved

david-kow approved these changes Aug 13, 2019

View reviewed changes

operators/pkg/controller/elasticsearch/driver/downscale_invariants.go Outdated Show resolved Hide resolved

Decouple stateful downscaleState from stateless invariants check

8b59ee2

pebrc approved these changes Aug 14, 2019

View reviewed changes

sebgl merged commit 1d55099 into elastic:master Aug 14, 2019

sebgl mentioned this pull request Aug 27, 2019

Remove deprecated TODO about masters removal #1637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downscale one master at a time & don't remove last running master #1549

Downscale one master at a time & don't remove last running master #1549

sebgl commented Aug 12, 2019

pebrc left a comment

anyasabo left a comment

barkbay commented Aug 13, 2019

david-kow left a comment

sebgl commented Aug 14, 2019

sebgl commented Aug 14, 2019

pebrc left a comment

Downscale one master at a time & don't remove last running master #1549

Downscale one master at a time & don't remove last running master #1549

Conversation

sebgl commented Aug 12, 2019

pebrc left a comment

Choose a reason for hiding this comment

anyasabo left a comment

Choose a reason for hiding this comment

barkbay commented Aug 13, 2019

david-kow left a comment

Choose a reason for hiding this comment

sebgl commented Aug 14, 2019

sebgl commented Aug 14, 2019

pebrc left a comment

Choose a reason for hiding this comment