-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize cluster mutations with PVC reuse #1160
Conversation
As specified in the rolling upgrade doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html This allows indices to still be created while a rolling upgrade is in progress.
These are 2 different settings we want to control separately.
// in order to reuse their PVC for pods to be created. | ||
func (c ChangeBudget) AllowsPVCReuse() bool { | ||
// only allow pod replacement (deletion first, then creation) if we can go below the target number of pods | ||
return c.MaxUnavailable > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we eventually need a better way to express that. See #1153
func (c *clientV6) DisableShardAllocation(ctx context.Context) error { | ||
allocationSetting := ClusterRoutingAllocation{AllocationSettings{Enable: "none"}} | ||
func (c *clientV6) DisableReplicasShardAllocation(ctx context.Context) error { | ||
allocationSetting := ClusterRoutingAllocation{AllocationSettings{Enable: "primaries"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's "the official way" from the documentation since a few ES versions, and I think it makes sense (we don't have to refuse new indices if we plan to shut down a node).
return results.WithResult(defaultRequeue) | ||
} | ||
|
||
if len(podsState.RunningReady) >= int(es.Spec.NodeCount()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% confident with this condition.
The problem is the following:
When we replace a node (delete one, create another one), we need to:
-
disable shards allocation before deleting the node: this is to avoid all shards to be suddenly replicated on all other nodes while the node is down
-
re-enable shards allocation once the node is back into the cluster
-
is easy since we know when we'll delete a node.
-
is hard since we don't keep track of which node was "restarted": how do we know that one of the nodes we have in the cluster at the moment actually comes from a rolling upgrade?
Also we want to avoid re-enabling shards allocation if the node is still in the process of being restarted.
The best I could come up with is checking the number of nodes in the cluster. In most cases when doing a rolling upgrade we'll have less nodes than when the cluster is "complete": we must not enable shards allocations at this point. If we have the expected number of nodes, chances are all nodes are currently part of the cluster, there is no node being rolled.
I'm not exactly sure how this could conflict with maxSurge > 0
, where we could have the expected number of nodes even though one node is being restarted.
Open for better implementation ideas :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One refinement could be:
- empty change set means cluster is stable -> enable shard allocation
maxSurge = 0
andRunningReady==Spec.NodeCount()
that is the reuse/restart only case. We are good to enable shard allocation (btw technically we would have to wait until the node joins the ES cluster I believe)maxSurge > 0
andmaxUnavailable > 0
that is the tricky case- first question why would users have that configured other than by mistake?
- the only case I can think of is when you have a mixed topology of emptyDir/tempfs and PVCs
- we should probably in that case do the grow/shrink i.e. non-reuse changes first and only start with the reuse deletions once we have nothing to add and only pvc reuse deletions left (adjusting performable changes accordingly by adding another pass that enforces that invariant)
- the problem remains that we can only guess that a reuse is in progress when replica shard allocation is disabled and the user could change the spec at any time to add more nodes in which case we would reanble shard allocation too early causing unnecessary IO. But maybe that is OK? The only alternative I see is resorting to annotations on the Elasticsearch resource or keeping state in memory about ongoing operations (which would be lost on crashes or restarts leading again to unnecessary IO as a trade off)
- first question why would users have that configured other than by mistake?
// TODO: optimize by doing the call only if shards allocation is currently disabled | ||
ctx, cancel := context.WithTimeout(context.Background(), esclient.DefaultReqTimeout) | ||
defer cancel() | ||
if err := esClient.EnableShardAllocation(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way things work now, we'll do this call at every single reconciliation (hopefully reconciliations don't happen all the time, but only when there are things to reconcile). We actually don't have to do the call if allocations are already enabled. To optimize, see #1161.
|
||
// Safety check we can move on with pod deletion | ||
// TODO: check in-between nodes if there is more than 1 to delete, otherwise that | ||
// check may not be valid for the 2nd node after the first one is deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// pod deletion for PVC reuse. | ||
func (d *PodDeletionHandler) clusterReadyForNodesRestart() bool { | ||
// Safety check: move on with nodes shutdown only if the cluster is green | ||
// TODO: optimize by doing a per-node check instead of a cluster-wide health check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -135,3 +144,45 @@ func getAndRemoveMatchingPod( | |||
RemainingPods: podsWithConfig, | |||
}, nil | |||
} | |||
|
|||
// calculatePodsToReplace modifies changes to account for pods to delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment out of sync with function name?
} | ||
|
||
// Then delete pods which will not be replaced hence need data to be migrated | ||
if err := d.deleteWithDataMigration(d.performableChanges.ToDelete.WithoutPVCReuse()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can combine deletion with data migration and deletion with pod reuse. Aren't the allocation settings for both operations mutually exclusive?
If you disable shard allocation for the reuse algo (even if its only replicas) how are you going to migrate data away from nodes that have data on them.
So I think what needs to happen is that you need to check whether one of these two operations is in progress and finish that one before starting with the other.
So while reuse is running (i.e. shard allocation disabled) no attempt to delete with migration and vice versa.
Also sound like we are missing an e2e test for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you disable shard allocation for the reuse algo (even if its only replicas) how are you going to migrate data away from nodes that have data on them.
Yes, I also thought about that. We discussed it very briefly with @nkvoll, and realized maybe it's fine to have both in parallel: PVC reuse will temporarily block ongoing data migrations, but that's supposed to be temporary. But thinking this through again, you're right it does not work well. If we use the maxSurge budget to create new nodes (grow and shrink), effectively taking the place of a node to replace (pvc reuse), then we might have a deadlock.
So while reuse is running (i.e. shard allocation disabled) no attempt to delete with migration and vice versa.
Would you check a "reuse is running" from retrieving ES shards allocation setting?
It's quite hard to know we are in a reuse phase in the current implementation, there is technically no difference between creating a pod for grow and shrink and creating a pod for PVC reuse (which, by design, sounds nice?).
I'm wondering if persisting a state in the cluster annotations would help here.
Also sound like we are missing an e2e test for that
+1. I planned to add it with #1156 (https://github.com/elastic/cloud-on-k8s/issues/1156
), but maybe it's major enough to deserve being in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if persisting a state in the cluster annotations would help here.
I was trying to find a way without resorting to annotations, but in the end it could be necessary to introduce one. (see my other comment further down)
but maybe it's major enough to deserve being in this PR.
I think I am fine with doing it in #1156
return results.WithResult(defaultRequeue) | ||
} | ||
|
||
if len(podsState.RunningReady) >= int(es.Spec.NodeCount()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One refinement could be:
- empty change set means cluster is stable -> enable shard allocation
maxSurge = 0
andRunningReady==Spec.NodeCount()
that is the reuse/restart only case. We are good to enable shard allocation (btw technically we would have to wait until the node joins the ES cluster I believe)maxSurge > 0
andmaxUnavailable > 0
that is the tricky case- first question why would users have that configured other than by mistake?
- the only case I can think of is when you have a mixed topology of emptyDir/tempfs and PVCs
- we should probably in that case do the grow/shrink i.e. non-reuse changes first and only start with the reuse deletions once we have nothing to add and only pvc reuse deletions left (adjusting performable changes accordingly by adding another pass that enforces that invariant)
- the problem remains that we can only guess that a reuse is in progress when replica shard allocation is disabled and the user could change the spec at any time to add more nodes in which case we would reanble shard allocation too early causing unnecessary IO. But maybe that is OK? The only alternative I see is resorting to annotations on the Elasticsearch resource or keeping state in memory about ongoing operations (which would be lost on crashes or restarts leading again to unnecessary IO as a trade off)
- first question why would users have that configured other than by mistake?
We agreed on relying on StatefulSets in #1173, which makes this PR obsolete in the longer term. |
When doing any cluster mutation (configuration change, resources changes, spec changes, etc.), attempt to optimize the mutation by finding a match between pods to delete and pods to create.
When such match is found, delete the corresponding pod, but not its PVC. The PVC can then be reused for the pod to create, using our orphan PVC mechanism, already in place.
This is only enabled if the user specifies a value for
updateStrategy.changeBudget.maxUnavailable
larger than 0. This setting means "it's OK to delete a pod and go below the target number of pods". For example: "it's OK to temporarily have only 4 nodes in my cluster even though I requested 5".This concept maps quite well with reusing PVC, since we need to first delete a node, then create another one with the same PVC.
This setting is not enabled by default right now, so reusing PVCs is an opt-in feature.
There are a few tricky points in that PR, that I'll highlight in the PR comments.
Some follow-up issues for things this PR does not cover:
Fixes #312.