Replica allocation consider no-op #42518

henningandersen · 2019-05-24T13:03:36Z

This is a first step away from sync-ids. We now check if replica and
primary are identical using sequence numbers when determining where to
allocate a replica shard.

If an index is no longer indexed into, issuing a regular flush will now
be enough to ensure a no-op recovery is done.

This has the nice side-effect of ensuring that closed indices and frozen
indices choose existing shard copies with identical data over
file-overlap comparison, increasing the chance that we end up doing a
no-op recovery (only no-op and file-based recovery is supported by
closed indices).

Relates #41400 and #33888

Supersedes #41784

This is a first step away from sync-ids. We now check if replica and primary are identical using sequence numbers when determining where to allocate a replica shard. If an index is no longer indexed into, issuing a regular flush will now be enough to ensure a no-op recovery is done. This has the nice side-effect of ensuring that closed indices and frozen indices choose existing shard copies with identical data over file-overlap comparison, increasing the chance that we end up doing a no-op recovery (only no-op and file-based recovery is supported by closed indices). Relates elastic#41400 and elastic#33888 Supersedes elastic#41784

elasticmachine · 2019-05-24T13:03:37Z

Pinging @elastic/es-distributed

…ocator_no_syncid

Hopefully this makes test succeed in CI too.

Now lock during cleanup files to protect snapshotRecoveryMetadata from seeing half copied data. snapshotRecoveryMetadata now handles peer recovery and existing store recovery specifically, returning empty snapshot in other recovery types (local shards, restore snapshot).

…ocator_no_syncid

dnhatn

This looks great. Thanks @henningandersen. Would you mind splitting this PR to multiple smaller pieces?

dnhatn · 2019-06-04T20:41:30Z

qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/RecoveryIT.java

+    /**
+     * We test that a closed index makes no-op replica allocation only.
+     */
+    public void testClosedIndexReplicaAllocation() throws Exception {


I think this test passed with the current behaviour. Can we make a small PR for this test only?

dnhatn · 2019-06-04T20:41:58Z

server/src/main/java/org/elasticsearch/gateway/GatewayAllocator.java

+     * Whenever we see a new data node, we clear the information we have on primary to ensure it is at least as recent as the start
+     * of the new node. This reduces risk of making a decision on stale information from primary.
+     */
+    private void ensureAsyncFetchStorePrimaryRecency(RoutingAllocation allocation) {


Can you make a separate PR for this enhancement?

dnhatn · 2019-06-04T20:42:46Z

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

+        return primaryStore.hasSeqNoInfo()
+            && primaryStore.maxSeqNo() == candidateStore.maxSeqNo()
+            && primaryStore.provideRecoverySeqNo() <= candidateStore.requireRecoverySeqNo()
+            && candidateStore.requireRecoverySeqNo() == primaryStore.maxSeqNo() + 1;


Not sure if we need the last condition?

dnhatn · 2019-06-04T20:43:24Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+     * Finalize index recovery. Manipulate store files, clean up old files, generate new empty translog and do other
+     * housekeeping for retention leases.
+     */
+    public void finalizeIndexRecovery(CheckedRunnable<IOException> manipulateStore, long globalCheckpoint,


Can you also make a separate PR for this enhancement?

henningandersen · 2019-06-06T11:35:56Z

Thanks for reviewing @dnhatn , I have marked this WIP and will split it into multiple PRs (and then close this one).

DaveCTurner · 2019-09-04T13:48:51Z

I have opened #46318 to track the underlying issue, and am closing this since we won't be merging it as it is now. We will certainly use it for inspiration in the work on #46318.

Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>

henningandersen added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.3.0 labels May 24, 2019

henningandersen added 8 commits May 24, 2019 16:28

Merge remote-tracking branch 'origin/master' into enhance_replica_all…

57cac72

…ocator_no_syncid

Merge remote-tracking branch 'origin/master' into enhance_replica_all…

ba09f21

…ocator_no_syncid

Fixed rolling-upgrade test

94a3c8d

Increase delayed allocation time

b01c3fc

Hopefully this makes test succeed in CI too.

Check style fixes

647f21d

Merge remote-tracking branch 'origin/master' into enhance_replica_all…

4afe2c9

…ocator_no_syncid

Fixed a few outdated comments.

a7dd7ee

henningandersen requested a review from ywelsch June 4, 2019 12:15

ywelsch requested a review from dnhatn June 4, 2019 15:25

dnhatn reviewed Jun 4, 2019

View reviewed changes

henningandersen added the WIP label Jun 6, 2019

henningandersen mentioned this pull request Jun 6, 2019

Allow cluster access during node restart #42946

Merged

jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019

DaveCTurner mentioned this pull request Jul 8, 2019

Retain history for peer recovery using leases #41536

Closed

10 tasks

henningandersen mentioned this pull request Jul 8, 2019

Closed index noop recovery during upgrade #44072

Merged

colings86 added v7.5.0 and removed v7.4.0 labels Aug 30, 2019

DaveCTurner closed this Sep 4, 2019

dnhatn removed v7.5.0 v8.0.0 labels Sep 8, 2019

This was referenced Sep 24, 2019

Re-fetch shard info of primary when new node joins #47035

Merged

Sequence number based replica allocation #46959

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica allocation consider no-op #42518

Replica allocation consider no-op #42518

henningandersen commented May 24, 2019

elasticmachine commented May 24, 2019

dnhatn left a comment

dnhatn Jun 4, 2019

dnhatn Jun 4, 2019

dnhatn Jun 4, 2019

dnhatn Jun 4, 2019

henningandersen commented Jun 6, 2019

DaveCTurner commented Sep 4, 2019

Replica allocation consider no-op #42518

Replica allocation consider no-op #42518

Conversation

henningandersen commented May 24, 2019

elasticmachine commented May 24, 2019

dnhatn left a comment

Choose a reason for hiding this comment

dnhatn Jun 4, 2019

Choose a reason for hiding this comment

dnhatn Jun 4, 2019

Choose a reason for hiding this comment

dnhatn Jun 4, 2019

Choose a reason for hiding this comment

dnhatn Jun 4, 2019

Choose a reason for hiding this comment

henningandersen commented Jun 6, 2019

DaveCTurner commented Sep 4, 2019