Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cluster mutations with PVC reuse #1160

Closed
wants to merge 11 commits into from
Closed

Conversation

sebgl
Copy link
Contributor

@sebgl sebgl commented Jun 27, 2019

When doing any cluster mutation (configuration change, resources changes, spec changes, etc.), attempt to optimize the mutation by finding a match between pods to delete and pods to create.
When such match is found, delete the corresponding pod, but not its PVC. The PVC can then be reused for the pod to create, using our orphan PVC mechanism, already in place.

This is only enabled if the user specifies a value for updateStrategy.changeBudget.maxUnavailable larger than 0. This setting means "it's OK to delete a pod and go below the target number of pods". For example: "it's OK to temporarily have only 4 nodes in my cluster even though I requested 5".
This concept maps quite well with reusing PVC, since we need to first delete a node, then create another one with the same PVC.
This setting is not enabled by default right now, so reusing PVCs is an opt-in feature.

There are a few tricky points in that PR, that I'll highlight in the PR comments.

Some follow-up issues for things this PR does not cover:

Fixes #312.

// in order to reuse their PVC for pods to be created.
func (c ChangeBudget) AllowsPVCReuse() bool {
// only allow pod replacement (deletion first, then creation) if we can go below the target number of pods
return c.MaxUnavailable > 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we eventually need a better way to express that. See #1153

func (c *clientV6) DisableShardAllocation(ctx context.Context) error {
allocationSetting := ClusterRoutingAllocation{AllocationSettings{Enable: "none"}}
func (c *clientV6) DisableReplicasShardAllocation(ctx context.Context) error {
allocationSetting := ClusterRoutingAllocation{AllocationSettings{Enable: "primaries"}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "the official way" from the documentation since a few ES versions, and I think it makes sense (we don't have to refuse new indices if we plan to shut down a node).

return results.WithResult(defaultRequeue)
}

if len(podsState.RunningReady) >= int(es.Spec.NodeCount()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% confident with this condition.
The problem is the following:

When we replace a node (delete one, create another one), we need to:

  1. disable shards allocation before deleting the node: this is to avoid all shards to be suddenly replicated on all other nodes while the node is down

  2. re-enable shards allocation once the node is back into the cluster

  3. is easy since we know when we'll delete a node.

  4. is hard since we don't keep track of which node was "restarted": how do we know that one of the nodes we have in the cluster at the moment actually comes from a rolling upgrade?

Also we want to avoid re-enabling shards allocation if the node is still in the process of being restarted.
The best I could come up with is checking the number of nodes in the cluster. In most cases when doing a rolling upgrade we'll have less nodes than when the cluster is "complete": we must not enable shards allocations at this point. If we have the expected number of nodes, chances are all nodes are currently part of the cluster, there is no node being rolled.

I'm not exactly sure how this could conflict with maxSurge > 0, where we could have the expected number of nodes even though one node is being restarted.

Open for better implementation ideas :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One refinement could be:

  • empty change set means cluster is stable -> enable shard allocation
  • maxSurge = 0 and RunningReady==Spec.NodeCount() that is the reuse/restart only case. We are good to enable shard allocation (btw technically we would have to wait until the node joins the ES cluster I believe)
  • maxSurge > 0 and maxUnavailable > 0 that is the tricky case
    • first question why would users have that configured other than by mistake?
      • the only case I can think of is when you have a mixed topology of emptyDir/tempfs and PVCs
    • we should probably in that case do the grow/shrink i.e. non-reuse changes first and only start with the reuse deletions once we have nothing to add and only pvc reuse deletions left (adjusting performable changes accordingly by adding another pass that enforces that invariant)
    • the problem remains that we can only guess that a reuse is in progress when replica shard allocation is disabled and the user could change the spec at any time to add more nodes in which case we would reanble shard allocation too early causing unnecessary IO. But maybe that is OK? The only alternative I see is resorting to annotations on the Elasticsearch resource or keeping state in memory about ongoing operations (which would be lost on crashes or restarts leading again to unnecessary IO as a trade off)

// TODO: optimize by doing the call only if shards allocation is currently disabled
ctx, cancel := context.WithTimeout(context.Background(), esclient.DefaultReqTimeout)
defer cancel()
if err := esClient.EnableShardAllocation(ctx); err != nil {
Copy link
Contributor Author

@sebgl sebgl Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way things work now, we'll do this call at every single reconciliation (hopefully reconciliations don't happen all the time, but only when there are things to reconcile). We actually don't have to do the call if allocations are already enabled. To optimize, see #1161.


// Safety check we can move on with pod deletion
// TODO: check in-between nodes if there is more than 1 to delete, otherwise that
// check may not be valid for the 2nd node after the first one is deleted
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// pod deletion for PVC reuse.
func (d *PodDeletionHandler) clusterReadyForNodesRestart() bool {
// Safety check: move on with nodes shutdown only if the cluster is green
// TODO: optimize by doing a per-node check instead of a cluster-wide health check:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -135,3 +144,45 @@ func getAndRemoveMatchingPod(
RemainingPods: podsWithConfig,
}, nil
}

// calculatePodsToReplace modifies changes to account for pods to delete
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment out of sync with function name?

}

// Then delete pods which will not be replaced hence need data to be migrated
if err := d.deleteWithDataMigration(d.performableChanges.ToDelete.WithoutPVCReuse()); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can combine deletion with data migration and deletion with pod reuse. Aren't the allocation settings for both operations mutually exclusive?

If you disable shard allocation for the reuse algo (even if its only replicas) how are you going to migrate data away from nodes that have data on them.

So I think what needs to happen is that you need to check whether one of these two operations is in progress and finish that one before starting with the other.

So while reuse is running (i.e. shard allocation disabled) no attempt to delete with migration and vice versa.

Also sound like we are missing an e2e test for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you disable shard allocation for the reuse algo (even if its only replicas) how are you going to migrate data away from nodes that have data on them.

Yes, I also thought about that. We discussed it very briefly with @nkvoll, and realized maybe it's fine to have both in parallel: PVC reuse will temporarily block ongoing data migrations, but that's supposed to be temporary. But thinking this through again, you're right it does not work well. If we use the maxSurge budget to create new nodes (grow and shrink), effectively taking the place of a node to replace (pvc reuse), then we might have a deadlock.

So while reuse is running (i.e. shard allocation disabled) no attempt to delete with migration and vice versa.

Would you check a "reuse is running" from retrieving ES shards allocation setting?
It's quite hard to know we are in a reuse phase in the current implementation, there is technically no difference between creating a pod for grow and shrink and creating a pod for PVC reuse (which, by design, sounds nice?).
I'm wondering if persisting a state in the cluster annotations would help here.

Also sound like we are missing an e2e test for that

+1. I planned to add it with #1156 (https://github.com/elastic/cloud-on-k8s/issues/1156), but maybe it's major enough to deserve being in this PR.

Copy link
Collaborator

@pebrc pebrc Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if persisting a state in the cluster annotations would help here.

I was trying to find a way without resorting to annotations, but in the end it could be necessary to introduce one. (see my other comment further down)

but maybe it's major enough to deserve being in this PR.

I think I am fine with doing it in #1156

return results.WithResult(defaultRequeue)
}

if len(podsState.RunningReady) >= int(es.Spec.NodeCount()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One refinement could be:

  • empty change set means cluster is stable -> enable shard allocation
  • maxSurge = 0 and RunningReady==Spec.NodeCount() that is the reuse/restart only case. We are good to enable shard allocation (btw technically we would have to wait until the node joins the ES cluster I believe)
  • maxSurge > 0 and maxUnavailable > 0 that is the tricky case
    • first question why would users have that configured other than by mistake?
      • the only case I can think of is when you have a mixed topology of emptyDir/tempfs and PVCs
    • we should probably in that case do the grow/shrink i.e. non-reuse changes first and only start with the reuse deletions once we have nothing to add and only pvc reuse deletions left (adjusting performable changes accordingly by adding another pass that enforces that invariant)
    • the problem remains that we can only guess that a reuse is in progress when replica shard allocation is disabled and the user could change the spec at any time to add more nodes in which case we would reanble shard allocation too early causing unnecessary IO. But maybe that is OK? The only alternative I see is resorting to annotations on the Elasticsearch resource or keeping state in memory about ongoing operations (which would be lost on crashes or restarts leading again to unnecessary IO as a trade off)

@sebgl
Copy link
Contributor Author

sebgl commented Jul 4, 2019

We agreed on relying on StatefulSets in #1173, which makes this PR obsolete in the longer term.
I think it's better not to merge it short-term (could have bugs?), and wait for StatefulSets support instead to have proper PVC reuse.
Closing the PR now.

@sebgl sebgl closed this Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize cluster mutations by reusing PVCs when replacing nodes
2 participants