-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replica allocation consider no-op #42518
Replica allocation consider no-op #42518
Conversation
This is a first step away from sync-ids. We now check if replica and primary are identical using sequence numbers when determining where to allocate a replica shard. If an index is no longer indexed into, issuing a regular flush will now be enough to ensure a no-op recovery is done. This has the nice side-effect of ensuring that closed indices and frozen indices choose existing shard copies with identical data over file-overlap comparison, increasing the chance that we end up doing a no-op recovery (only no-op and file-based recovery is supported by closed indices). Relates elastic#41400 and elastic#33888 Supersedes elastic#41784
Pinging @elastic/es-distributed |
Hopefully this makes test succeed in CI too.
Now lock during cleanup files to protect snapshotRecoveryMetadata from seeing half copied data. snapshotRecoveryMetadata now handles peer recovery and existing store recovery specifically, returning empty snapshot in other recovery types (local shards, restore snapshot).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks @henningandersen. Would you mind splitting this PR to multiple smaller pieces?
/** | ||
* We test that a closed index makes no-op replica allocation only. | ||
*/ | ||
public void testClosedIndexReplicaAllocation() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test passed with the current behaviour. Can we make a small PR for this test only?
* Whenever we see a new data node, we clear the information we have on primary to ensure it is at least as recent as the start | ||
* of the new node. This reduces risk of making a decision on stale information from primary. | ||
*/ | ||
private void ensureAsyncFetchStorePrimaryRecency(RoutingAllocation allocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a separate PR for this enhancement?
return primaryStore.hasSeqNoInfo() | ||
&& primaryStore.maxSeqNo() == candidateStore.maxSeqNo() | ||
&& primaryStore.provideRecoverySeqNo() <= candidateStore.requireRecoverySeqNo() | ||
&& candidateStore.requireRecoverySeqNo() == primaryStore.maxSeqNo() + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need the last condition?
* Finalize index recovery. Manipulate store files, clean up old files, generate new empty translog and do other | ||
* housekeeping for retention leases. | ||
*/ | ||
public void finalizeIndexRecovery(CheckedRunnable<IOException> manipulateStore, long globalCheckpoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make a separate PR for this enhancement?
Thanks for reviewing @dnhatn , I have marked this WIP and will split it into multiple PRs (and then close this one). |
Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
This is a first step away from sync-ids. We now check if replica and
primary are identical using sequence numbers when determining where to
allocate a replica shard.
If an index is no longer indexed into, issuing a regular flush will now
be enough to ensure a no-op recovery is done.
This has the nice side-effect of ensuring that closed indices and frozen
indices choose existing shard copies with identical data over
file-overlap comparison, increasing the chance that we end up doing a
no-op recovery (only no-op and file-based recovery is supported by
closed indices).
Relates #41400 and #33888
Supersedes #41784