Translog pruning based on retention leases #1100

saikaranam-amazon · 2021-08-17T10:16:00Z

Is your feature request related to a problem? Please describe.

Background
Cross Cluster Replication feature follows logical replication model. Each follower pulls the latest operations from the corresponding shards in the leader index and replays them at the follower side. In order to maintain the latest operations at the leader cluster, retention leases are acquired at the leader index. Once the operations are replayed at the follower side, these retention leases are renewed. Existing peer-recovery infrastructure is leveraged and extended for cross cluster replication feature.

Problem
Retention leases preserves operations for each shard at Lucene level (used as part of peer recovery within the cluster).
During performance benchmarking for the replication feature (for high indexing workloads), the fetch for the latest operations from the leader cluster has seen an impact on CPU (of up to ~8-10%) due to Lucene stored fields decompression.

Describe the solution you'd like
Solution
All the latest operations are available under translog in uncompressed form. Currently, translog doesn't have the mechanism to prune the operations based on the retention lease. If translog pruning takes into account retention leases as well, then the fetch operations can directly serve the requests from translog saving CPU cycles.

Details

Introduce a new dynamic setting at index level to prune translog operations based on retention lease.
For the indices with this setting enabled, translog deletion policy is updated to take retention leases into account. This ensures that the operations upto certain threshold are available as part of the translog and fetch operations doesn't have to query lucene for these operations.

Describe alternatives you've considered
N/A

Additional context
N/A

itiyamas · 2021-08-27T04:04:21Z

What benchmarks did you run? Can you share the configuration and benchmark results here?

itiyamas · 2021-08-27T04:08:27Z

It is not very clear as to why Tlog needs to be pruned based on retention lease for it to be fetched? All you need is a way to define start and end sequence numbers and then fetch those operations from tlog. Do you mind explaining this a bit more?

saikaranam-amazon · 2021-08-31T05:56:17Z

It is not very clear as to why Tlog needs to be pruned based on retention lease for it to be fetched?

We will not be using retention leases to fetch the operations from the Tlog rather the existing deletion policy to prune Tlog operation takes retention leases also into account.

With TLog
The deletion policy of Tlog takes into account various settings - such as the current locks held, size of the Tlog etc. It computes min generation to retain based on these settings and proceeds with Tlog pruning.

With luncene soft-deletes
With Lucene soft-deletes, Most of the Tlog settings are not taken into account as the recovery process is entirely depends on Lucene and retention leases are used to maintain these operations under lucene.

Now, In replication use-case,
without support for translog pruning based on retention lease
Replay on the follower shard entirely depends on fetching these operations from the leader shard. Depending on the follower state, retention leases are acquired by the follower shard and this ensures that the operations are retained only in Lucene upto that SeqNo and Tlog will only retain the latest gen.

with support for translog pruning based on retention lease
Tlog takes into account retention lease and along with the above Tlog settings to maintain the operations that are part of older gen Tlog files. Operations for replication can be directly fetched from Tlog and avoid Lucene.

What benchmarks did you run? Can you share the configuration and benchmark results here?

Ran indexing load on single node, single shard on leader and fetched the operations to simulate the follower cluster side of things.
No other operations were performed on the leader node/shard during this time and captured CPU Utilization for the duration.
Instance type: i3.2xlarge
Dataset: eventdata
Configuration: 1 shard and 1 node cluster.
Impact:

On observing the flame graph, Most of the cycles were spent on Lucene decompression.

itiyamas · 2021-08-31T06:31:13Z

Can you share a well known benchmark like nyc taxis on a selected instance type and share the impact on indexing throughput or network transfer rate for when you use tlog vs lucene snapshot. Maybe a general purpose EC2 instance like m4.4xlarge will help. It would be good to have more than 1 follower cluster to identify the network impact as well. I would recommend capturing disk size limits for these tests after running these tests for a sufficiently longer duration(atleast upto default retention lease).

A few concerns I have with this:

I will confirm if there is a one to one mapping between translog and lucene operations, especially in case of updates and if there are any implications of using tlog based snapshot for retention leases as that code has not been in use for a while.
I am also worried about storing Tlog upto the retention lease period(12 hours) for high throughput use-cases? Would that not create a bottleneck on disk size on the leader cluster, especially since the Tlog is fully uncompressed.
It appears that we moved away from translog based snapshots at some point in the past and now we are using them again because the lucene compressions are expensive. I would use the same mechanism for both in cluster and cross cluster recovery if Lucene compressions create performance impact so that the path is tested well. Btw, LZ4 is a relatively faster compression algorithm, so I expected the impact to be much lower.

There maybe even more implications of this change that I can't think of at this point. Will need more feedback on this before we proceed with this change.

@nknize What do you think about this change?

saikaranam-amazon · 2021-08-31T06:56:00Z

Updated the additional benchmark details regarding dataset and instance used.

Regarding some of the concerns listed above:

Thought of this and the pruning will also take into account the overall Tlog size and older Tlog gen files will get pruned if it breaches this limit. This shouldn't cause additional load on the disk to keep the Tlog operations. If the size grows and Tlog gets pruned, we will fallback to Lucene.
Understand that the current recovery is only using Lucene based. For the current replication use-case, the issue is seen during the high index workload and to counteract this we will fetch from Tlog.

Let me know, If you have any other concerns.

itiyamas · 2021-08-31T07:22:34Z

Please update the rally benchmark output with Tlog vs Lucene. e.g. indexing throughput/latency etc.

itiyamas · 2021-09-06T08:45:01Z

@saikaranam-amazon Did you try figuring out a mechanism to improve the indexing throughput on the leader cluster without tlog optimizations by looking closely at how lucene stored fields are being read? Is there an opportunity to improve performance at that layer instead of making this change?

itiyamas · 2021-09-07T06:14:24Z

I believe that one of the things that has not been tested well is data integrity across the 2 recovery sources upto the local checkpoint. The old logic either always used translog recovery or retention leases, but never both. Here, we are choosing between the two dynamically and hence it is imperative that we check the integrity of the data across the two sources.

nknize · 2021-09-09T14:39:10Z

I agree w/ @itiyamas on the data integrity testing concern. Has this been addressed?

Also, this is based on the document level replication model, but there's also segment level replication built into Lucene that rsyncs the segments rather than relying on the translog playback replication model. This doesn't require decompression at all and proves to be significantly faster in all cases (product search is taking this approach) though it would require some check-point coordination. Have we looked at this approach instead of building on a slower replication model?

saikaranam-amazon · 2021-09-13T12:32:19Z

Did you try figuring out a mechanism to improve the indexing throughput on the leader cluster without tlog optimizations by looking closely at how lucene stored fields are being read? Is there an opportunity to improve performance at that layer instead of making this change?

Yes, Explored compression algorithm used for the stored fields and this is at block level (with group of documents) and couldn't turnoff compression for set of docs.

saikaranam-amazon · 2021-09-13T12:35:36Z

I believe that one of the things that has not been tested well is data integrity across the 2 recovery sources upto the local checkpoint. The old logic either always used translog recovery or retention leases, but never both. Here, we are choosing between the two dynamically and hence it is imperative that we check the integrity of the data across the two sources.

Performed these test via scripts with source as lucene and translog. I understand the concern, we will extend these scripts to work with the tests as part of integration tests under the replication plugin.

saikaranam-amazon · 2021-09-13T15:08:34Z

Attaching the results for esrally with and without applying translog pruning

Configuration

Instance - i3.xlarge
Dataset - Eventdata
Index settings - 1 Primary Shard - Single Node
without any other load on the clusters

without enabling the setting

|   Lap |                                                         Metric |         Task |     Value |   Unit |
|------:|---------------------------------------------------------------:|-------------:|----------:|-------:|
|   All |                     Cumulative indexing time of primary shards |              |   63.7212 |    min |
|   All |             Min cumulative indexing time across primary shards |              |   63.7212 |    min |
|   All |          Median cumulative indexing time across primary shards |              |   63.7212 |    min |
|   All |             Max cumulative indexing time across primary shards |              |   63.7212 |    min |
|   All |            Cumulative indexing throttle time of primary shards |              |         0 |    min |
|   All |    Min cumulative indexing throttle time across primary shards |              |         0 |    min |
|   All | Median cumulative indexing throttle time across primary shards |              |         0 |    min |
|   All |    Max cumulative indexing throttle time across primary shards |              |         0 |    min |
|   All |                        Cumulative merge time of primary shards |              |   14.7692 |    min |
|   All |                       Cumulative merge count of primary shards |              |        34 |        |
|   All |                Min cumulative merge time across primary shards |              |   14.7692 |    min |
|   All |             Median cumulative merge time across primary shards |              |   14.7692 |    min |
|   All |                Max cumulative merge time across primary shards |              |   14.7692 |    min |
|   All |               Cumulative merge throttle time of primary shards |              |   7.11638 |    min |
|   All |       Min cumulative merge throttle time across primary shards |              |   7.11638 |    min |
|   All |    Median cumulative merge throttle time across primary shards |              |   7.11638 |    min |
|   All |       Max cumulative merge throttle time across primary shards |              |   7.11638 |    min |
|   All |                      Cumulative refresh time of primary shards |              |    0.3849 |    min |
|   All |                     Cumulative refresh count of primary shards |              |        45 |        |
|   All |              Min cumulative refresh time across primary shards |              |    0.3849 |    min |
|   All |           Median cumulative refresh time across primary shards |              |    0.3849 |    min |
|   All |              Max cumulative refresh time across primary shards |              |    0.3849 |    min |
|   All |                        Cumulative flush time of primary shards |              |   2.74077 |    min |
|   All |                       Cumulative flush count of primary shards |              |        35 |        |
|   All |                Min cumulative flush time across primary shards |              |   2.74077 |    min |
|   All |             Median cumulative flush time across primary shards |              |   2.74077 |    min |
|   All |                Max cumulative flush time across primary shards |              |   2.74077 |    min |
|   All |                                             Total Young Gen GC |              |    37.855 |      s |
|   All |                                               Total Old Gen GC |              |         0 |      s |
|   All |                                                     Store size |              |   7.17995 |     GB |
|   All |                                                  Translog size |              |  0.227218 |     GB |
|   All |                                         Heap used for segments |              |  0.241604 |     MB |
|   All |                                       Heap used for doc values |              | 0.0107002 |     MB |
|   All |                                            Heap used for terms |              |  0.195831 |     MB |
|   All |                                            Heap used for norms |              |         0 |     MB |
|   All |                                           Heap used for points |              |         0 |     MB |
|   All |                                    Heap used for stored fields |              | 0.0350723 |     MB |
|   All |                                                  Segment count |              |        69 |        |
|   All |                                                 Min Throughput | index-append |   19338.4 | docs/s |
|   All |                                              Median Throughput | index-append |   20049.2 | docs/s |
|   All |                                                 Max Throughput | index-append |   20767.5 | docs/s |
|   All |                                        50th percentile latency | index-append |    1862.9 |     ms |
|   All |                                        90th percentile latency | index-append |   2734.17 |     ms |
|   All |                                        99th percentile latency | index-append |   4964.31 |     ms |
|   All |                                      99.9th percentile latency | index-append |   6048.77 |     ms |
|   All |                                       100th percentile latency | index-append |   7277.91 |     ms |
|   All |                                   50th percentile service time | index-append |    1862.9 |     ms |
|   All |                                   90th percentile service time | index-append |   2734.17 |     ms |
|   All |                                   99th percentile service time | index-append |   4964.31 |     ms |
|   All |                                 99.9th percentile service time | index-append |   6048.77 |     ms |
|   All |                                  100th percentile service time | index-append |   7277.91 |     ms |
|   All |                                                     error rate | index-append |         0 |      % |


----------------------------------
[INFO] SUCCESS (took 1042 seconds)
----------------------------------

with enabling the setting

            
|   Lap |                                                         Metric |         Task |    Value |   Unit |
|------:|---------------------------------------------------------------:|-------------:|---------:|-------:|
|   All |                     Cumulative indexing time of primary shards |              |  62.7009 |    min |
|   All |             Min cumulative indexing time across primary shards |              |  62.7009 |    min |
|   All |          Median cumulative indexing time across primary shards |              |  62.7009 |    min |
|   All |             Max cumulative indexing time across primary shards |              |  62.7009 |    min |
|   All |            Cumulative indexing throttle time of primary shards |              |        0 |    min |
|   All |    Min cumulative indexing throttle time across primary shards |              |        0 |    min |
|   All | Median cumulative indexing throttle time across primary shards |              |        0 |    min |
|   All |    Max cumulative indexing throttle time across primary shards |              |        0 |    min |
|   All |                        Cumulative merge time of primary shards |              |  16.3531 |    min |
|   All |                       Cumulative merge count of primary shards |              |       30 |        |
|   All |                Min cumulative merge time across primary shards |              |  16.3531 |    min |
|   All |             Median cumulative merge time across primary shards |              |  16.3531 |    min |
|   All |                Max cumulative merge time across primary shards |              |  16.3531 |    min |
|   All |               Cumulative merge throttle time of primary shards |              |  8.63825 |    min |
|   All |       Min cumulative merge throttle time across primary shards |              |  8.63825 |    min |
|   All |    Median cumulative merge throttle time across primary shards |              |  8.63825 |    min |
|   All |       Max cumulative merge throttle time across primary shards |              |  8.63825 |    min |
|   All |                      Cumulative refresh time of primary shards |              | 0.363767 |    min |
|   All |                     Cumulative refresh count of primary shards |              |       37 |        |
|   All |              Min cumulative refresh time across primary shards |              | 0.363767 |    min |
|   All |           Median cumulative refresh time across primary shards |              | 0.363767 |    min |
|   All |              Max cumulative refresh time across primary shards |              | 0.363767 |    min |
|   All |                        Cumulative flush time of primary shards |              |  2.68672 |    min |
|   All |                       Cumulative flush count of primary shards |              |       35 |        |
|   All |                Min cumulative flush time across primary shards |              |  2.68672 |    min |
|   All |             Median cumulative flush time across primary shards |              |  2.68672 |    min |
|   All |                Max cumulative flush time across primary shards |              |  2.68672 |    min |
|   All |                                             Total Young Gen GC |              |    38.22 |      s |
|   All |                                               Total Old Gen GC |              |        0 |      s |
|   All |                                                     Store size |              |  6.80982 |     GB |
|   All |                                                  Translog size |              | 0.325901 |     GB |
|   All |                                         Heap used for segments |              | 0.213825 |     MB |
|   All |                                       Heap used for doc values |              | 0.009655 |     MB |
|   All |                                            Heap used for terms |              | 0.173126 |     MB |
|   All |                                            Heap used for norms |              |        0 |     MB |
|   All |                                           Heap used for points |              |        0 |     MB |
|   All |                                    Heap used for stored fields |              | 0.031044 |     MB |
|   All |                                                  Segment count |              |       61 |        |
|   All |                                                 Min Throughput | index-append |  19626.1 | docs/s |
|   All |                                              Median Throughput | index-append |    19953 | docs/s |
|   All |                                                 Max Throughput | index-append |  20520.3 | docs/s |
|   All |                                        50th percentile latency | index-append |  1851.39 |     ms |
|   All |                                        90th percentile latency | index-append |   2634.4 |     ms |
|   All |                                        99th percentile latency | index-append |  4765.54 |     ms |
|   All |                                      99.9th percentile latency | index-append |   5610.4 |     ms |
|   All |                                       100th percentile latency | index-append |  9018.62 |     ms |
|   All |                                   50th percentile service time | index-append |  1851.39 |     ms |
|   All |                                   90th percentile service time | index-append |   2634.4 |     ms |
|   All |                                   99th percentile service time | index-append |  4765.54 |     ms |
|   All |                                 99.9th percentile service time | index-append |   5610.4 |     ms |
|   All |                                  100th percentile service time | index-append |  9018.62 |     ms |
|   All |                                                     error rate | index-append |        0 |      % |


----------------------------------
[INFO] SUCCESS (took 1026 seconds)
----------------------------------

saikaranam-amazon · 2021-09-15T10:52:23Z

After reviewing the use-case, modifying the setting name to plugins.replication.index.translog.retention_lease.pruning.enabled to explicitly denote that the setting is primarily focused for cross cluster replication.

saikaranam-amazon · 2021-09-20T05:46:15Z

Test for peer recovery

Setup
- Cluster: 2 nodes
- Dataset: eventdata
- Re-routed the shards after ingestion and monitored recovery stats

with setting disabled

index     shard time type        stage source_host   source_node   target_host   target_node   repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
eventdata 0     2.2m peer        index 172.31.66.242 172.31.66.242 172.31.72.250 172.31.72.250 n/a        n/a      109   109             100.0%        109         4915809903 4915809903      100.0%        4915809903  0            0                      100.0%

with setting enabled

index     shard time type stage source_host   source_node   target_host   target_node   repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
eventdata 0     2.2m peer done  172.31.72.250 172.31.72.250 172.31.66.242 172.31.66.242 n/a        n/a      109   109             100.0%        109         4915809903 4915809903      100.0%        4915809903  0            0                      100.0%

…n lease. (#1416) The settings and the corresponding logic for translog pruning by retention lease which were added as part of #1100 have been deprecated. This commit removes those deprecated code in favor of an extension point for providing a custom TranslogDeletionPolicy. Signed-off-by: Rabi Panda <adnapibar@gmail.com>

…n lease. (#1416) (#1471) The settings and the corresponding logic for translog pruning by retention lease which were added as part of #1100 have been deprecated. This commit removes those deprecated code in favor of an extension point for providing a custom TranslogDeletionPolicy. Signed-off-by: Rabi Panda <adnapibar@gmail.com>

saikaranam-amazon added the enhancement Enhancement or improvement to existing feature or request label Aug 17, 2021

saikaranam-amazon mentioned this issue Aug 17, 2021

Support for translog pruning based on retention leases #1038

Merged

5 tasks

saikaranam-amazon mentioned this issue Sep 3, 2021

Changes to support retrieval of operations from translog based on specified range #1210

Merged

5 tasks

saikaranam-amazon mentioned this issue Sep 6, 2021

Handled recovery to start from lowest seqno in the translog based on retention lease #1217

Closed

5 tasks

nknize mentioned this issue Sep 8, 2021

[Backport] Support for translog pruning based on retention leases #1226

Merged

5 tasks

saikaranam-amazon mentioned this issue Sep 15, 2021

Rename translog pruning setting to CCR specific setting and addressed Bug in the test case #1243

Merged

5 tasks

mch2 mentioned this issue Sep 20, 2021

Add pluggable extension point for TranslogDeletionPolicy and move CCR specific logic to plugin. #1260

Closed

This was referenced Oct 21, 2021

Remove deprecated settings and logic for translog pruning by retention lease. #1416

Merged

Provide a custom TranslogDeletionPolicy for translog pruning based on… opensearch-project/cross-cluster-replication#209

Merged

adnapibar mentioned this issue Oct 29, 2021

[Backport 1.x] Remove deprecated settings and logic for translog pruning by retentio… #1471

Merged

saikaranam-amazon closed this as completed Nov 2, 2021

saikaranam-amazon mentioned this issue Mar 18, 2022

[BUG] Breaking changes for CCR plugin in OS 2.0 #2482

Closed

This was referenced Apr 20, 2022

[FEATURE] Provide extension for translog fetch operations under OS Engine opensearch-project/cross-cluster-replication#374

Closed

[FEATURE] Use Pluggable translog for fetching the operations from leader opensearch-project/cross-cluster-replication#375

Closed

ankitkala mentioned this issue Sep 13, 2022

[Discuss] Long term plan for cross-cluster-replication opensearch-project/cross-cluster-replication#558

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translog pruning based on retention leases #1100

Translog pruning based on retention leases #1100

saikaranam-amazon commented Aug 17, 2021 •

edited

Loading

itiyamas commented Aug 27, 2021

itiyamas commented Aug 27, 2021

saikaranam-amazon commented Aug 31, 2021 •

edited

Loading

itiyamas commented Aug 31, 2021 •

edited

Loading

saikaranam-amazon commented Aug 31, 2021

itiyamas commented Aug 31, 2021

itiyamas commented Sep 6, 2021

itiyamas commented Sep 7, 2021

nknize commented Sep 9, 2021

saikaranam-amazon commented Sep 13, 2021

saikaranam-amazon commented Sep 13, 2021

saikaranam-amazon commented Sep 13, 2021 •

edited

Loading

saikaranam-amazon commented Sep 15, 2021

saikaranam-amazon commented Sep 20, 2021

Translog pruning based on retention leases #1100

Translog pruning based on retention leases #1100

Comments

saikaranam-amazon commented Aug 17, 2021 • edited Loading

itiyamas commented Aug 27, 2021

itiyamas commented Aug 27, 2021

saikaranam-amazon commented Aug 31, 2021 • edited Loading

itiyamas commented Aug 31, 2021 • edited Loading

saikaranam-amazon commented Aug 31, 2021

itiyamas commented Aug 31, 2021

itiyamas commented Sep 6, 2021

itiyamas commented Sep 7, 2021

nknize commented Sep 9, 2021

saikaranam-amazon commented Sep 13, 2021

saikaranam-amazon commented Sep 13, 2021

saikaranam-amazon commented Sep 13, 2021 • edited Loading

saikaranam-amazon commented Sep 15, 2021

saikaranam-amazon commented Sep 20, 2021

saikaranam-amazon commented Aug 17, 2021 •

edited

Loading

saikaranam-amazon commented Aug 31, 2021 •

edited

Loading

itiyamas commented Aug 31, 2021 •

edited

Loading

saikaranam-amazon commented Sep 13, 2021 •

edited

Loading