Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translog pruning based on retention leases #1100

Closed
saikaranam-amazon opened this issue Aug 17, 2021 · 14 comments
Closed

Translog pruning based on retention leases #1100

saikaranam-amazon opened this issue Aug 17, 2021 · 14 comments
Labels
enhancement Enhancement or improvement to existing feature or request

Comments

@saikaranam-amazon
Copy link
Member

saikaranam-amazon commented Aug 17, 2021

Is your feature request related to a problem? Please describe.

Background
Cross Cluster Replication feature follows logical replication model. Each follower pulls the latest operations from the corresponding shards in the leader index and replays them at the follower side. In order to maintain the latest operations at the leader cluster, retention leases are acquired at the leader index. Once the operations are replayed at the follower side, these retention leases are renewed. Existing peer-recovery infrastructure is leveraged and extended for cross cluster replication feature.

Problem
Retention leases preserves operations for each shard at Lucene level (used as part of peer recovery within the cluster).
During performance benchmarking for the replication feature (for high indexing workloads), the fetch for the latest operations from the leader cluster has seen an impact on CPU (of up to ~8-10%) due to Lucene stored fields decompression.

Describe the solution you'd like
Solution
All the latest operations are available under translog in uncompressed form. Currently, translog doesn't have the mechanism to prune the operations based on the retention lease. If translog pruning takes into account retention leases as well, then the fetch operations can directly serve the requests from translog saving CPU cycles.

Details

  • Introduce a new dynamic setting at index level to prune translog operations based on retention lease.
  • For the indices with this setting enabled, translog deletion policy is updated to take retention leases into account. This ensures that the operations upto certain threshold are available as part of the translog and fetch operations doesn't have to query lucene for these operations.

Describe alternatives you've considered
N/A

Additional context
N/A

@saikaranam-amazon saikaranam-amazon added the enhancement Enhancement or improvement to existing feature or request label Aug 17, 2021
@itiyamas
Copy link
Contributor

What benchmarks did you run? Can you share the configuration and benchmark results here?

@itiyamas
Copy link
Contributor

It is not very clear as to why Tlog needs to be pruned based on retention lease for it to be fetched? All you need is a way to define start and end sequence numbers and then fetch those operations from tlog. Do you mind explaining this a bit more?

@saikaranam-amazon
Copy link
Member Author

saikaranam-amazon commented Aug 31, 2021

It is not very clear as to why Tlog needs to be pruned based on retention lease for it to be fetched?

We will not be using retention leases to fetch the operations from the Tlog rather the existing deletion policy to prune Tlog operation takes retention leases also into account.

With TLog
The deletion policy of Tlog takes into account various settings - such as the current locks held, size of the Tlog etc. It computes min generation to retain based on these settings and proceeds with Tlog pruning.

With luncene soft-deletes
With Lucene soft-deletes, Most of the Tlog settings are not taken into account as the recovery process is entirely depends on Lucene and retention leases are used to maintain these operations under lucene.

Now, In replication use-case,
without support for translog pruning based on retention lease
Replay on the follower shard entirely depends on fetching these operations from the leader shard. Depending on the follower state, retention leases are acquired by the follower shard and this ensures that the operations are retained only in Lucene upto that SeqNo and Tlog will only retain the latest gen.

with support for translog pruning based on retention lease
Tlog takes into account retention lease and along with the above Tlog settings to maintain the operations that are part of older gen Tlog files. Operations for replication can be directly fetched from Tlog and avoid Lucene.

What benchmarks did you run? Can you share the configuration and benchmark results here?

Ran indexing load on single node, single shard on leader and fetched the operations to simulate the follower cluster side of things.
No other operations were performed on the leader node/shard during this time and captured CPU Utilization for the duration.
Instance type: i3.2xlarge
Dataset: eventdata
Configuration: 1 shard and 1 node cluster.
Impact:

  • On observing the flame graph, Most of the cycles were spent on Lucene decompression.

Flame_Graph_Leader_CPU

@itiyamas
Copy link
Contributor

itiyamas commented Aug 31, 2021

Can you share a well known benchmark like nyc taxis on a selected instance type and share the impact on indexing throughput or network transfer rate for when you use tlog vs lucene snapshot. Maybe a general purpose EC2 instance like m4.4xlarge will help. It would be good to have more than 1 follower cluster to identify the network impact as well. I would recommend capturing disk size limits for these tests after running these tests for a sufficiently longer duration(atleast upto default retention lease).

A few concerns I have with this:

  1. I will confirm if there is a one to one mapping between translog and lucene operations, especially in case of updates and if there are any implications of using tlog based snapshot for retention leases as that code has not been in use for a while.
  2. I am also worried about storing Tlog upto the retention lease period(12 hours) for high throughput use-cases? Would that not create a bottleneck on disk size on the leader cluster, especially since the Tlog is fully uncompressed.
  3. It appears that we moved away from translog based snapshots at some point in the past and now we are using them again because the lucene compressions are expensive. I would use the same mechanism for both in cluster and cross cluster recovery if Lucene compressions create performance impact so that the path is tested well. Btw, LZ4 is a relatively faster compression algorithm, so I expected the impact to be much lower.

There maybe even more implications of this change that I can't think of at this point. Will need more feedback on this before we proceed with this change.

@nknize What do you think about this change?

@saikaranam-amazon
Copy link
Member Author

Updated the additional benchmark details regarding dataset and instance used.

Regarding some of the concerns listed above:

  • Thought of this and the pruning will also take into account the overall Tlog size and older Tlog gen files will get pruned if it breaches this limit. This shouldn't cause additional load on the disk to keep the Tlog operations. If the size grows and Tlog gets pruned, we will fallback to Lucene.
  • Understand that the current recovery is only using Lucene based. For the current replication use-case, the issue is seen during the high index workload and to counteract this we will fetch from Tlog.

Let me know, If you have any other concerns.

@itiyamas
Copy link
Contributor

Please update the rally benchmark output with Tlog vs Lucene. e.g. indexing throughput/latency etc.

@itiyamas
Copy link
Contributor

itiyamas commented Sep 6, 2021

@saikaranam-amazon Did you try figuring out a mechanism to improve the indexing throughput on the leader cluster without tlog optimizations by looking closely at how lucene stored fields are being read? Is there an opportunity to improve performance at that layer instead of making this change?

@itiyamas
Copy link
Contributor

itiyamas commented Sep 7, 2021

I believe that one of the things that has not been tested well is data integrity across the 2 recovery sources upto the local checkpoint. The old logic either always used translog recovery or retention leases, but never both. Here, we are choosing between the two dynamically and hence it is imperative that we check the integrity of the data across the two sources.

@nknize
Copy link
Collaborator

nknize commented Sep 9, 2021

I agree w/ @itiyamas on the data integrity testing concern. Has this been addressed?

Also, this is based on the document level replication model, but there's also segment level replication built into Lucene that rsyncs the segments rather than relying on the translog playback replication model. This doesn't require decompression at all and proves to be significantly faster in all cases (product search is taking this approach) though it would require some check-point coordination. Have we looked at this approach instead of building on a slower replication model?

@saikaranam-amazon
Copy link
Member Author

Did you try figuring out a mechanism to improve the indexing throughput on the leader cluster without tlog optimizations by looking closely at how lucene stored fields are being read? Is there an opportunity to improve performance at that layer instead of making this change?

Yes, Explored compression algorithm used for the stored fields and this is at block level (with group of documents) and couldn't turnoff compression for set of docs.

@saikaranam-amazon
Copy link
Member Author

I believe that one of the things that has not been tested well is data integrity across the 2 recovery sources upto the local checkpoint. The old logic either always used translog recovery or retention leases, but never both. Here, we are choosing between the two dynamically and hence it is imperative that we check the integrity of the data across the two sources.

Performed these test via scripts with source as lucene and translog. I understand the concern, we will extend these scripts to work with the tests as part of integration tests under the replication plugin.

@saikaranam-amazon
Copy link
Member Author

saikaranam-amazon commented Sep 13, 2021

Attaching the results for esrally with and without applying translog pruning

Configuration

  • Instance - i3.xlarge
  • Dataset - Eventdata
  • Index settings - 1 Primary Shard - Single Node
  • without any other load on the clusters

without enabling the setting

|   Lap |                                                         Metric |         Task |     Value |   Unit |
|------:|---------------------------------------------------------------:|-------------:|----------:|-------:|
|   All |                     Cumulative indexing time of primary shards |              |   63.7212 |    min |
|   All |             Min cumulative indexing time across primary shards |              |   63.7212 |    min |
|   All |          Median cumulative indexing time across primary shards |              |   63.7212 |    min |
|   All |             Max cumulative indexing time across primary shards |              |   63.7212 |    min |
|   All |            Cumulative indexing throttle time of primary shards |              |         0 |    min |
|   All |    Min cumulative indexing throttle time across primary shards |              |         0 |    min |
|   All | Median cumulative indexing throttle time across primary shards |              |         0 |    min |
|   All |    Max cumulative indexing throttle time across primary shards |              |         0 |    min |
|   All |                        Cumulative merge time of primary shards |              |   14.7692 |    min |
|   All |                       Cumulative merge count of primary shards |              |        34 |        |
|   All |                Min cumulative merge time across primary shards |              |   14.7692 |    min |
|   All |             Median cumulative merge time across primary shards |              |   14.7692 |    min |
|   All |                Max cumulative merge time across primary shards |              |   14.7692 |    min |
|   All |               Cumulative merge throttle time of primary shards |              |   7.11638 |    min |
|   All |       Min cumulative merge throttle time across primary shards |              |   7.11638 |    min |
|   All |    Median cumulative merge throttle time across primary shards |              |   7.11638 |    min |
|   All |       Max cumulative merge throttle time across primary shards |              |   7.11638 |    min |
|   All |                      Cumulative refresh time of primary shards |              |    0.3849 |    min |
|   All |                     Cumulative refresh count of primary shards |              |        45 |        |
|   All |              Min cumulative refresh time across primary shards |              |    0.3849 |    min |
|   All |           Median cumulative refresh time across primary shards |              |    0.3849 |    min |
|   All |              Max cumulative refresh time across primary shards |              |    0.3849 |    min |
|   All |                        Cumulative flush time of primary shards |              |   2.74077 |    min |
|   All |                       Cumulative flush count of primary shards |              |        35 |        |
|   All |                Min cumulative flush time across primary shards |              |   2.74077 |    min |
|   All |             Median cumulative flush time across primary shards |              |   2.74077 |    min |
|   All |                Max cumulative flush time across primary shards |              |   2.74077 |    min |
|   All |                                             Total Young Gen GC |              |    37.855 |      s |
|   All |                                               Total Old Gen GC |              |         0 |      s |
|   All |                                                     Store size |              |   7.17995 |     GB |
|   All |                                                  Translog size |              |  0.227218 |     GB |
|   All |                                         Heap used for segments |              |  0.241604 |     MB |
|   All |                                       Heap used for doc values |              | 0.0107002 |     MB |
|   All |                                            Heap used for terms |              |  0.195831 |     MB |
|   All |                                            Heap used for norms |              |         0 |     MB |
|   All |                                           Heap used for points |              |         0 |     MB |
|   All |                                    Heap used for stored fields |              | 0.0350723 |     MB |
|   All |                                                  Segment count |              |        69 |        |
|   All |                                                 Min Throughput | index-append |   19338.4 | docs/s |
|   All |                                              Median Throughput | index-append |   20049.2 | docs/s |
|   All |                                                 Max Throughput | index-append |   20767.5 | docs/s |
|   All |                                        50th percentile latency | index-append |    1862.9 |     ms |
|   All |                                        90th percentile latency | index-append |   2734.17 |     ms |
|   All |                                        99th percentile latency | index-append |   4964.31 |     ms |
|   All |                                      99.9th percentile latency | index-append |   6048.77 |     ms |
|   All |                                       100th percentile latency | index-append |   7277.91 |     ms |
|   All |                                   50th percentile service time | index-append |    1862.9 |     ms |
|   All |                                   90th percentile service time | index-append |   2734.17 |     ms |
|   All |                                   99th percentile service time | index-append |   4964.31 |     ms |
|   All |                                 99.9th percentile service time | index-append |   6048.77 |     ms |
|   All |                                  100th percentile service time | index-append |   7277.91 |     ms |
|   All |                                                     error rate | index-append |         0 |      % |


----------------------------------
[INFO] SUCCESS (took 1042 seconds)
----------------------------------

with enabling the setting

            
|   Lap |                                                         Metric |         Task |    Value |   Unit |
|------:|---------------------------------------------------------------:|-------------:|---------:|-------:|
|   All |                     Cumulative indexing time of primary shards |              |  62.7009 |    min |
|   All |             Min cumulative indexing time across primary shards |              |  62.7009 |    min |
|   All |          Median cumulative indexing time across primary shards |              |  62.7009 |    min |
|   All |             Max cumulative indexing time across primary shards |              |  62.7009 |    min |
|   All |            Cumulative indexing throttle time of primary shards |              |        0 |    min |
|   All |    Min cumulative indexing throttle time across primary shards |              |        0 |    min |
|   All | Median cumulative indexing throttle time across primary shards |              |        0 |    min |
|   All |    Max cumulative indexing throttle time across primary shards |              |        0 |    min |
|   All |                        Cumulative merge time of primary shards |              |  16.3531 |    min |
|   All |                       Cumulative merge count of primary shards |              |       30 |        |
|   All |                Min cumulative merge time across primary shards |              |  16.3531 |    min |
|   All |             Median cumulative merge time across primary shards |              |  16.3531 |    min |
|   All |                Max cumulative merge time across primary shards |              |  16.3531 |    min |
|   All |               Cumulative merge throttle time of primary shards |              |  8.63825 |    min |
|   All |       Min cumulative merge throttle time across primary shards |              |  8.63825 |    min |
|   All |    Median cumulative merge throttle time across primary shards |              |  8.63825 |    min |
|   All |       Max cumulative merge throttle time across primary shards |              |  8.63825 |    min |
|   All |                      Cumulative refresh time of primary shards |              | 0.363767 |    min |
|   All |                     Cumulative refresh count of primary shards |              |       37 |        |
|   All |              Min cumulative refresh time across primary shards |              | 0.363767 |    min |
|   All |           Median cumulative refresh time across primary shards |              | 0.363767 |    min |
|   All |              Max cumulative refresh time across primary shards |              | 0.363767 |    min |
|   All |                        Cumulative flush time of primary shards |              |  2.68672 |    min |
|   All |                       Cumulative flush count of primary shards |              |       35 |        |
|   All |                Min cumulative flush time across primary shards |              |  2.68672 |    min |
|   All |             Median cumulative flush time across primary shards |              |  2.68672 |    min |
|   All |                Max cumulative flush time across primary shards |              |  2.68672 |    min |
|   All |                                             Total Young Gen GC |              |    38.22 |      s |
|   All |                                               Total Old Gen GC |              |        0 |      s |
|   All |                                                     Store size |              |  6.80982 |     GB |
|   All |                                                  Translog size |              | 0.325901 |     GB |
|   All |                                         Heap used for segments |              | 0.213825 |     MB |
|   All |                                       Heap used for doc values |              | 0.009655 |     MB |
|   All |                                            Heap used for terms |              | 0.173126 |     MB |
|   All |                                            Heap used for norms |              |        0 |     MB |
|   All |                                           Heap used for points |              |        0 |     MB |
|   All |                                    Heap used for stored fields |              | 0.031044 |     MB |
|   All |                                                  Segment count |              |       61 |        |
|   All |                                                 Min Throughput | index-append |  19626.1 | docs/s |
|   All |                                              Median Throughput | index-append |    19953 | docs/s |
|   All |                                                 Max Throughput | index-append |  20520.3 | docs/s |
|   All |                                        50th percentile latency | index-append |  1851.39 |     ms |
|   All |                                        90th percentile latency | index-append |   2634.4 |     ms |
|   All |                                        99th percentile latency | index-append |  4765.54 |     ms |
|   All |                                      99.9th percentile latency | index-append |   5610.4 |     ms |
|   All |                                       100th percentile latency | index-append |  9018.62 |     ms |
|   All |                                   50th percentile service time | index-append |  1851.39 |     ms |
|   All |                                   90th percentile service time | index-append |   2634.4 |     ms |
|   All |                                   99th percentile service time | index-append |  4765.54 |     ms |
|   All |                                 99.9th percentile service time | index-append |   5610.4 |     ms |
|   All |                                  100th percentile service time | index-append |  9018.62 |     ms |
|   All |                                                     error rate | index-append |        0 |      % |


----------------------------------
[INFO] SUCCESS (took 1026 seconds)
----------------------------------

@saikaranam-amazon
Copy link
Member Author

After reviewing the use-case, modifying the setting name to plugins.replication.index.translog.retention_lease.pruning.enabled to explicitly denote that the setting is primarily focused for cross cluster replication.

@saikaranam-amazon
Copy link
Member Author

Test for peer recovery

  • Setup
    • Cluster: 2 nodes
    • Dataset: eventdata
    • Re-routed the shards after ingestion and monitored recovery stats

with setting disabled

index     shard time type        stage source_host   source_node   target_host   target_node   repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
eventdata 0     2.2m peer        index 172.31.66.242 172.31.66.242 172.31.72.250 172.31.72.250 n/a        n/a      109   109             100.0%        109         4915809903 4915809903      100.0%        4915809903  0            0                      100.0%

with setting enabled

index     shard time type stage source_host   source_node   target_host   target_node   repository snapshot files files_recovered files_percent files_total bytes      bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
eventdata 0     2.2m peer done  172.31.72.250 172.31.72.250 172.31.66.242 172.31.66.242 n/a        n/a      109   109             100.0%        109         4915809903 4915809903      100.0%        4915809903  0            0                      100.0%

adnapibar added a commit that referenced this issue Oct 29, 2021
…n lease. (#1416)

The settings and the corresponding logic for translog pruning by retention lease which were added as part of #1100 have been deprecated. This commit removes those deprecated code in favor of an extension point for providing a custom TranslogDeletionPolicy.

Signed-off-by: Rabi Panda <adnapibar@gmail.com>
nknize pushed a commit that referenced this issue Nov 1, 2021
…n lease. (#1416) (#1471)

The settings and the corresponding logic for translog pruning by retention 
lease which were added as part of #1100 have been deprecated. This 
commit removes those deprecated code in favor of an extension point 
for providing a custom TranslogDeletionPolicy.

Signed-off-by: Rabi Panda <adnapibar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

3 participants