Optimize cluster mutations with PVC reuse #1160

sebgl · 2019-06-27T13:48:40Z

When doing any cluster mutation (configuration change, resources changes, spec changes, etc.), attempt to optimize the mutation by finding a match between pods to delete and pods to create.
When such match is found, delete the corresponding pod, but not its PVC. The PVC can then be reused for the pod to create, using our orphan PVC mechanism, already in place.

This is only enabled if the user specifies a value for updateStrategy.changeBudget.maxUnavailable larger than 0. This setting means "it's OK to delete a pod and go below the target number of pods". For example: "it's OK to temporarily have only 4 nodes in my cluster even though I requested 5".
This concept maps quite well with reusing PVC, since we need to first delete a node, then create another one with the same PVC.
This setting is not enabled by default right now, so reusing PVCs is an opt-in feature.

There are a few tricky points in that PR, that I'll highlight in the PR comments.

Some follow-up issues for things this PR does not cover:

Improve updateStrategy.changeBudget to better consider PVC reuse #1153 to improve the changeBudget in order to better consider PVC reuse options
Refresh cluster checks in-between pods deletion for PVC reuse #1154 to fix stale observed state cache when deleting multiple nodes for PVC reuse
Improve cluster health check before shutting down a pod during a rolling upgrade #1155 to improve the cluster state health check before shutting down (restarting) a node with PVC reuse
Add more rolling upgrade E2E tests #1156 to add more E2E tests
Consider pod/nodes affinities when reusing PVCs #1157 consider pod/node affinities when reusing PVCs
Consider hot-warm-cold topologies when reusing PVCs #1158 to consider hot-warm topologies when reusing PVCs
Be more flexible when comparing node roles labels for PVC reuse #1159 to be more flexible with node roles and eg. reuse a data volume for an ML+data node
Explore concurrency schemes to compensate for blocking ES calls #1161 avoid blocking ES calls in the reconciliation loop, especially if they are useless

Fixes #312.

As specified in the rolling upgrade doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html This allows indices to still be created while a rolling upgrade is in progress.

These are 2 different settings we want to control separately.

…cs reuse

sebgl · 2019-06-27T13:50:48Z

operators/pkg/apis/elasticsearch/v1alpha1/elasticsearch_types.go

+// in order to reuse their PVC for pods to be created.
+func (c ChangeBudget) AllowsPVCReuse() bool {
+	// only allow pod replacement (deletion first, then creation) if we can go below the target number of pods
+	return c.MaxUnavailable > 0


I think we eventually need a better way to express that. See #1153

sebgl · 2019-06-27T13:52:00Z

operators/pkg/controller/elasticsearch/client/v6.go

-func (c *clientV6) DisableShardAllocation(ctx context.Context) error {
-	allocationSetting := ClusterRoutingAllocation{AllocationSettings{Enable: "none"}}
+func (c *clientV6) DisableReplicasShardAllocation(ctx context.Context) error {
+	allocationSetting := ClusterRoutingAllocation{AllocationSettings{Enable: "primaries"}}


It's "the official way" from the documentation since a few ES versions, and I think it makes sense (we don't have to refuse new indices if we plan to shut down a node).

sebgl · 2019-06-27T13:58:14Z

operators/pkg/controller/elasticsearch/driver/default.go

 		return results.WithResult(defaultRequeue)
 	}

+	if len(podsState.RunningReady) >= int(es.Spec.NodeCount()) {


I'm not 100% confident with this condition.
The problem is the following:

When we replace a node (delete one, create another one), we need to:

disable shards allocation before deleting the node: this is to avoid all shards to be suddenly replicated on all other nodes while the node is down

re-enable shards allocation once the node is back into the cluster

is easy since we know when we'll delete a node.

is hard since we don't keep track of which node was "restarted": how do we know that one of the nodes we have in the cluster at the moment actually comes from a rolling upgrade?

Also we want to avoid re-enabling shards allocation if the node is still in the process of being restarted.
The best I could come up with is checking the number of nodes in the cluster. In most cases when doing a rolling upgrade we'll have less nodes than when the cluster is "complete": we must not enable shards allocations at this point. If we have the expected number of nodes, chances are all nodes are currently part of the cluster, there is no node being rolled.

I'm not exactly sure how this could conflict with maxSurge > 0, where we could have the expected number of nodes even though one node is being restarted.

Open for better implementation ideas :)

One refinement could be:

empty change set means cluster is stable -> enable shard allocation

maxSurge = 0 and RunningReady==Spec.NodeCount() that is the reuse/restart only case. We are good to enable shard allocation (btw technically we would have to wait until the node joins the ES cluster I believe)

maxSurge > 0 and maxUnavailable > 0 that is the tricky case

first question why would users have that configured other than by mistake?

the only case I can think of is when you have a mixed topology of emptyDir/tempfs and PVCs

we should probably in that case do the grow/shrink i.e. non-reuse changes first and only start with the reuse deletions once we have nothing to add and only pvc reuse deletions left (adjusting performable changes accordingly by adding another pass that enforces that invariant)

the problem remains that we can only guess that a reuse is in progress when replica shard allocation is disabled and the user could change the spec at any time to add more nodes in which case we would reanble shard allocation too early causing unnecessary IO. But maybe that is OK? The only alternative I see is resorting to annotations on the Elasticsearch resource or keeping state in memory about ongoing operations (which would be lost on crashes or restarts leading again to unnecessary IO as a trade off)

sebgl · 2019-06-27T14:00:52Z

operators/pkg/controller/elasticsearch/driver/default.go

+		// TODO: optimize by doing the call only if shards allocation is currently disabled
+		ctx, cancel := context.WithTimeout(context.Background(), esclient.DefaultReqTimeout)
+		defer cancel()
+		if err := esClient.EnableShardAllocation(ctx); err != nil {


The way things work now, we'll do this call at every single reconciliation (hopefully reconciliations don't happen all the time, but only when there are things to reconcile). We actually don't have to do the call if allocations are already enabled. To optimize, see #1161.

sebgl · 2019-06-27T14:58:33Z

operators/pkg/controller/elasticsearch/driver/deletion.go

+
+	// Safety check we can move on with pod deletion
+	// TODO: check in-between nodes if there is more than 1 to delete, otherwise that
+	//  check may not be valid for the 2nd node after the first one is deleted


sebgl · 2019-06-27T14:59:02Z

operators/pkg/controller/elasticsearch/driver/deletion.go

+// pod deletion for PVC reuse.
+func (d *PodDeletionHandler) clusterReadyForNodesRestart() bool {
+	// Safety check: move on with nodes shutdown only if the cluster is green
+	// TODO: optimize by doing a per-node check instead of a cluster-wide health check:


pebrc · 2019-06-28T06:58:43Z

operators/pkg/controller/elasticsearch/mutation/calculate.go

@@ -135,3 +144,45 @@ func getAndRemoveMatchingPod(
 		RemainingPods:         podsWithConfig,
 	}, nil
 }
+
+// calculatePodsToReplace modifies changes to account for pods to delete


comment out of sync with function name?

pebrc · 2019-06-28T07:51:42Z

operators/pkg/controller/elasticsearch/driver/deletion.go

+	}
+
+	// Then delete pods which will not be replaced hence need data to be migrated
+	if err := d.deleteWithDataMigration(d.performableChanges.ToDelete.WithoutPVCReuse()); err != nil {


I don't think we can combine deletion with data migration and deletion with pod reuse. Aren't the allocation settings for both operations mutually exclusive?

If you disable shard allocation for the reuse algo (even if its only replicas) how are you going to migrate data away from nodes that have data on them.

So I think what needs to happen is that you need to check whether one of these two operations is in progress and finish that one before starting with the other.

So while reuse is running (i.e. shard allocation disabled) no attempt to delete with migration and vice versa.

Also sound like we are missing an e2e test for that

If you disable shard allocation for the reuse algo (even if its only replicas) how are you going to migrate data away from nodes that have data on them.

Yes, I also thought about that. We discussed it very briefly with @nkvoll, and realized maybe it's fine to have both in parallel: PVC reuse will temporarily block ongoing data migrations, but that's supposed to be temporary. But thinking this through again, you're right it does not work well. If we use the maxSurge budget to create new nodes (grow and shrink), effectively taking the place of a node to replace (pvc reuse), then we might have a deadlock.

So while reuse is running (i.e. shard allocation disabled) no attempt to delete with migration and vice versa.

Would you check a "reuse is running" from retrieving ES shards allocation setting?
It's quite hard to know we are in a reuse phase in the current implementation, there is technically no difference between creating a pod for grow and shrink and creating a pod for PVC reuse (which, by design, sounds nice?).
I'm wondering if persisting a state in the cluster annotations would help here.

Also sound like we are missing an e2e test for that

+1. I planned to add it with #1156 (https://github.com/elastic/cloud-on-k8s/issues/1156), but maybe it's major enough to deserve being in this PR.

I'm wondering if persisting a state in the cluster annotations would help here.

I was trying to find a way without resorting to annotations, but in the end it could be necessary to introduce one. (see my other comment further down)

but maybe it's major enough to deserve being in this PR.

I think I am fine with doing it in #1156

pebrc · 2019-06-28T08:54:51Z

operators/pkg/controller/elasticsearch/driver/default.go

 		return results.WithResult(defaultRequeue)
 	}

+	if len(podsState.RunningReady) >= int(es.Spec.NodeCount()) {


One refinement could be:

empty change set means cluster is stable -> enable shard allocation

maxSurge = 0 and RunningReady==Spec.NodeCount() that is the reuse/restart only case. We are good to enable shard allocation (btw technically we would have to wait until the node joins the ES cluster I believe)

maxSurge > 0 and maxUnavailable > 0 that is the tricky case

first question why would users have that configured other than by mistake?

the only case I can think of is when you have a mixed topology of emptyDir/tempfs and PVCs

we should probably in that case do the grow/shrink i.e. non-reuse changes first and only start with the reuse deletions once we have nothing to add and only pvc reuse deletions left (adjusting performable changes accordingly by adding another pass that enforces that invariant)

the problem remains that we can only guess that a reuse is in progress when replica shard allocation is disabled and the user could change the spec at any time to add more nodes in which case we would reanble shard allocation too early causing unnecessary IO. But maybe that is OK? The only alternative I see is resorting to annotations on the Elasticsearch resource or keeping state in memory about ongoing operations (which would be lost on crashes or restarts leading again to unnecessary IO as a trade off)

sebgl · 2019-07-04T07:17:19Z

We agreed on relying on StatefulSets in #1173, which makes this PR obsolete in the longer term.
I think it's better not to merge it short-term (could have bugs?), and wait for StatefulSets support instead to have proper PVC reuse.
Closing the PR now.

sebgl added 11 commits June 27, 2019 14:59

Add a PodToDelete struct holding a ReusePVC bool

c04f0e8

Allow PVC reuse if maxUnavailable > 0

c7e7aad

Match pods to delete with pods to create to optimize for PVC reuse

56540f2

Minor improvements for explicitness

bd722db

Only disable replicas shards allocations, not primaries

88fad56

As specified in the rolling upgrade doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html This allows indices to still be created while a rolling upgrade is in progress.

Don't enable allocations while adding allocation excludes

2f5aec3

These are 2 different settings we want to control separately.

Refactor zen2 updater to not clear allocations when pods are deleting

30a9147

Refactor pod deletion to consider PVC reuse

214b94c

Minor e2e tests improvements

92a6510

Introduce mutation test options to check expected pods creations & pv…

a2ccee7

…cs reuse

Add 2 E2E tests covering resources & config rolling upgrades

71a65fb

sebgl commented Jun 27, 2019

View reviewed changes

pebrc reviewed Jun 28, 2019

View reviewed changes

This was referenced Jul 1, 2019

Improve updateStrategy.changeBudget to better consider PVC reuse #1153

Closed

Should we rely on StatefulSets? (Yes.) #1173

Closed

sebgl closed this Jul 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cluster mutations with PVC reuse #1160

Optimize cluster mutations with PVC reuse #1160

sebgl commented Jun 27, 2019 •

edited

Loading

sebgl Jun 27, 2019

sebgl Jun 27, 2019

sebgl Jun 27, 2019

pebrc Jun 28, 2019

sebgl Jun 27, 2019 •

edited

Loading

sebgl Jun 27, 2019

sebgl Jun 27, 2019

pebrc Jun 28, 2019

pebrc Jun 28, 2019

sebgl Jul 1, 2019

pebrc Jul 1, 2019 •

edited

Loading

pebrc Jun 28, 2019

sebgl commented Jul 4, 2019

Optimize cluster mutations with PVC reuse #1160

Optimize cluster mutations with PVC reuse #1160

Conversation

sebgl commented Jun 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebgl Jun 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pebrc Jul 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebgl commented Jul 4, 2019

sebgl commented Jun 27, 2019 •

edited

Loading

sebgl Jun 27, 2019 •

edited

Loading

pebrc Jul 1, 2019 •

edited

Loading