[Remote Store] Primary term validation with replicas - New approach POC #5033

ashking94 · 2022-11-02T05:55:35Z

Is your feature request related to a problem? Please describe.

Primary term validation with replicas

In reference to #3706, with segment replication and remote store for storing translog, storing translog on replicas becomes obsolete. Not just that, the in sync replication call to the replicas that happens during a write call becomes obsolete. And as we know, the replication call serves 2 use cases - 1) to replicate the data for durability and 2) primary term validation, while the 1st use case is taken cared off with using remote store for translog, the 2nd use case still needs to be handled.

Challenges

Until now, we were planning to achieve the no-op by modifying the TransportBulk/WriteAction call and making it no-op. While we do not store any data, there was still one concern as these replication calls modifies the replication tracker state in the replica shard. With the older approach, we would have needed a no-op replication tracker proxy and needed to cut off all calls to replicas that updates the replication tracker on the replicas. This is to make sure that we are not updating any state on the replica - be it the data part (segment/translog) or the (global/local) checkpoints during the replication call or the async calls. This is a bit cumbersome on implementation making the code a lot intertwined putting a lot of logic on when to update checkpoints (replication tracker) vs when to not. This approach is a bit messier. Following things hinder or adds to complexity with older approach -

Global/Local checkpoint calculation - Whenever a shard is marked to be inSync in the Checkpoint state, it starts to contribute in the calculation of global checkpoint on the primary. Since translog is not written on the replicas, global checkpoint calculation on primary should only consider its own local checkpoint and do not concern with the replicas. There is no meaning to local checkpoint on replicas since the data is replicated using segment replication.
Retention Leases - Whenever a shard is marked to be tracked in the Checkpoint state, there is an implicit expectation of a corresponding retention lease for the shard. Retention leases are useful in doing sequence based peer recovery. With translog on remote store, the retention lease do not make sense for recovering replicas. However, they are still required for primary-primary peer recovery.
- In assertInvariant() method inside ReplicationTracker expects there is a PRRL present for each of the shard that is tracked in the CheckpointState in the ReplicationTracker. For replicas, since we do not have translogs present locally, PRRL is not required anymore. If a shard is tracked, it is implied that there are PRRL existing.
Every action that extends TransportReplicationAction is a replicated call which is first performed on the primary and then fans out to the active replicas. These calls performs some activity on the primary and same or different activity on the replica. The most common one is TransportWriteAction, TransportBulkShardAction which are not required to be fanned out as we do not need the translogs written on replicas anymore. The below are other actions (that may or may not be required with remote store for translog) -
- GlobalCheckpointAction - This syncs global checkpoint to replicas. This is not required for remote replicas.
- PublishCheckpointAction - This publishes the checkpoint for segment replication on account of refreshes. This is no-op on primary and holds relevance for replicas to kickstart segment replication in async. This is required.
- RetentionLeaseBackgroundSyncAction - On primary, it persists the retention lease state on disk. On replica, this currently copies over all the rentention leases from the primary and persists it on disk. This is a background job that is scheduled periodically. There is an open item here if we were to remove this action altogether. This might be required for replicas.
- TransportShardFlushAction - Executes flush request against the engine. This is not required for remote replicas as NRTReplicationEngine already makes the flush method in it’s engine as No-Op.
- TransportShardRefreshAction - This writes all indexing change to disk and opens a new searcher. This is not required for remote replicas as NRTReplicationEngine already makes the refresh method in it’s engine as No-Op.
- TransportVerifyShardBeforeCloseAction - verifies that the max seq no and global checkpoint are same on primary and replica. The flush method is no-op for replicas. This looks to be not required for replicas.
- TransportVerifyShardIndexBlockAction - This ensures that an Index block has been applied on primary and replicas. This might be required for replicas.
- TransportWriteAction -
  - TransportShardBulkAction - Write request. Not required on replicas anymore.
  - TransportResyncReplicationAction - When a replica shard is promoted to new primary, the other replicas are reset upto global checkpoint and this action is used to replay the lucene operations to make sure that other replica shards have the same consistent view as the new primary. With remote store, the need to reset engine and replay lucene operations becomes obsolete. This is not required on replicas any more.
  - RetentionLeaseSyncAction - Same as RetentionLeaseBackgroundSyncAction. This is triggered when a new retention lease is created or an existing retention lease is removed on the primary.
All the above transport actions are fanned out to the replication target of the replication group. These shards are those which have tracked as true in the CheckpointState within the IndexShard’s ReplicationTracker. To implement no-op replication (primary term validation) with the TransportShardBulkAction would mean that we need to put a lot of logic for each of the TransportReplicationAction actions. This would make the code full of if/else conditions and at the same time make the code highly unreadable and unmaintainable over long time.
Within ReplicationTracker class, there is a method invariant() which is used to check certain invariants with respect to the replication tracker. This ensures certain expected behaviour for retention leases, global checkpoint, replication group for shards that are tracked.

Proposal

However, this can be handled with a separate call for primary term validation (lets call it primary validation call) along side keeping the tracked and inSync as false in the CheckpointState in the ReplicationTracker. Whenever the cluster manager publishes the updated cluster state, the primary term would get updated on the replicas. When the primary term validation happens, the primary term supplied over the request is validated against the same. On incorrect primary term, the acknowledgement fails and the assuming isolated primary can fail itself. This also makes the approach and code change a bit cleaner.

With the new approach, we can do the following -

Keep replicas of remote store enabled indices with CheckpointState.tracked, CheckpointState.inSync as false. - Can be Handled.
- tracked = false makes the existing replication calls to not happen.
- inSync = false allows us to not tweak code in ReplicationTracker class and so we create another variable called validatePrimaryTerm. This can be used for knowing which all replicas where we can send the primary validation call.
Cluster manager will continue to send the cluster updates on this new node (where the replica shard is residing). Indirectly, the latest cluster state is always known to the replica shard inspite it being having tracked and inSync as false. - Can be Handled.
The replica shard will get into STARTED state inspite of tracked and inSync as false. ShardStateAction - this is used for informing the cluster manager that a particular shard has started. - Can be Handled.
Segment replication will work without triggering the shard.initiateTracking(..) and shard.markAllocationIdAsInSync(..). The request is fanned out to replicas by making a transport action which is no-op on primary and publishes checkpoint on replicas. If a shard is not tracked and not in sync, then Segment replication would stop to work. - Can be Handled.
The active replica shards will continue to serve the search traffic. Adaptive replica selection - As long as we are informing the cluster manager about a shard being started, the shard selection for replicas account for the shards as earlier. - No code change required.
Cluster manager should be aware of all replicas that are available for primary promotion. Make appropriate code change to allow cluster manager to elect a new primary using the validatePrimaryTerm = true. Currently, it probably uses the inSync allocations concept which it fetches from the cluster state. - No code change required on Master Side
̇We can have a wrapper over ReplicationTracker which disallows and throws exception for all methods that updates the state and when it is not expected to happen. - Need to evaluate
Refactor Peer recovery and start the replica without doing initiate tracking and marking in sync. - Can be Handled.
Primary term on replica/primary shard should get updated on account of shard state updates (cluster metadata updates). - Can be Handled.
Segment replication currently uses Replication calls to propagate the publish checkpoint to the replicas. The replication calls tries to update the global checkpoint and maxSeqNoOfUpdatesOrDeletes (No-Op on NRT Engine). - Yet to Handle. Not hurting if we keep it like this anyways.
Create a separate call for primary term validation - Yet to Handle

The above approach requires further deep dive and a POC -

Master promoting another replica as primary
- How is the ShardRouting.state changed to ShardRoutingState.STARTED or ShardRoutingState.RELOCATING?
Primary to replica - TransportWriteAction - On what basis does the request gets fanned out.
Segment replication - Since
Any implication on hasAllPeerRecoveryRetentionLeases in ReplicationTracker class.
The active replica shards will continue to serve the search traffic. Adaptive replica selection - this will be updated to allow search requests to fan out across different active replica.
ReplicationGroup class has a derived field called replicationTargets. This is used to fan out requests to replicas for general requests that needs to be replicated.
ReplicationTracker has a method called updateFromClusterManager. In this, if an allocation id does not exist in the inSyncAllocationIds or the initializingAllocationIds (derived from RoutingTable). Based on the primaryMode field value, the checkpoints are updated. This logic will be updated to account for primaryTermValidation.
Primary term on replica/primary shard should get updated on account of shard state updates (cluster metadata updates).

Open Questions -

Segment replication (PublishCheckpointAction) is build on top of TransportReplicationAction which relies on the replication target.
Currently RetentionLeaseBackgroundSyncAction on primary persists the retention leases to write the given state to the given directories and performs cleanup of old state files if the write succeeds or newly created state file if write fails. On Replicas, we copy the same retention leases and persist on the replicas too. With replicas having tracked as false, the replication calls do not go through and there are no retention leases. What is the consequence of this?

Describe the solution you'd like
Mentioned above.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

ashking94 · 2022-11-10T08:55:35Z

Draft PR showing prospective changes - dc27870. In addition to it, we need to place the separate primary term validation call too.

ashking94 · 2022-11-14T13:28:13Z

Couple of points that can be updated in the proposal -

We can try to reuse the TransportShardBulkAction call for primary term validation. On the replicas, we can introduce a proxy that can check the primary term validation and return.
Keeping inSync (in CheckpointState in ReplicationTracker) as false and ClusterManager having inSync as true could introduce confusion from understanding standpoint. In spirit of definitions being same on replication layer and cluster manager's cluster state, we can continue to keep the shards as insync and modify the assertions so that replication calls fans out in the desired manner.

With this approach,

We should, by default, have no replication calls being fanned out from the primary shard.
We should still be able to support PublishCheckpointAction and TransportShardBulkAction (& any other TransportReplicationAction if required).
- PublishCheckpointAction for segment replication to work
- TransportShardBulkAction for primary term validation to work
For supporting primary term validation, the TransportShardBulkAction would have the replica just checking for primary term and returning from there itself.
Any necessary change on engine or translog factory if required.

ashking94 · 2023-01-02T09:20:52Z

This has been implemented.

ashking94 added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 2, 2022

ashking94 mentioned this issue Nov 2, 2022

[Meta] NoOp replication for remote translogs with NRT segment replication #4507

Closed

10 tasks

minalsha added Indexing & Search distributed framework labels Nov 3, 2022

anasalkouz removed the untriaged label Nov 8, 2022

ashking94 added Storage:Durability Issues and PRs related to the durability framework v2.5.0 'Issues and PRs related to version v2.5.0' labels Nov 15, 2022

ashking94 mentioned this issue Nov 21, 2022

Enhance CheckpointState to support no-op replication #5282

Merged

6 tasks

ashking94 mentioned this issue Dec 6, 2022

[Remote Store] Primary term validation in TransportShardBulkAction on replicas when remote translog is enabled #5464

Closed

gbbafna mentioned this issue Dec 7, 2022

[Design Proposal] Add Remote Translog for Improved Durability #5476

Closed

ashking94 closed this as completed Jan 2, 2023

This was referenced Jan 2, 2023

[Remote Store][META] Request Level Durability for for remote-backed indexes (Experimental Release Tracking for 2.5) #5671

Closed

[Backport 2.x] Support no-op replication for remote-backed indexes #5731

Merged

gbbafna mentioned this issue Feb 8, 2024

[RFC] Existing Cluster Migration to/from Remote Store - HLD #12246

Closed

Bukhtawar mentioned this issue Jun 27, 2024

Add Ashish Singh as maintainer #14567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remote Store] Primary term validation with replicas - New approach POC #5033

[Remote Store] Primary term validation with replicas - New approach POC #5033

ashking94 commented Nov 2, 2022 •

edited

Loading

ashking94 commented Nov 10, 2022 •

edited

Loading

ashking94 commented Nov 14, 2022 •

edited

Loading

ashking94 commented Jan 2, 2023

[Remote Store] Primary term validation with replicas - New approach POC #5033

[Remote Store] Primary term validation with replicas - New approach POC #5033

Comments

ashking94 commented Nov 2, 2022 • edited Loading

Primary term validation with replicas

Challenges

Proposal

ashking94 commented Nov 10, 2022 • edited Loading

ashking94 commented Nov 14, 2022 • edited Loading

ashking94 commented Jan 2, 2023

ashking94 commented Nov 2, 2022 •

edited

Loading

ashking94 commented Nov 10, 2022 •

edited

Loading

ashking94 commented Nov 14, 2022 •

edited

Loading