Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CCR] Re-evaluate shard follow parameter defaults #31717

Closed
martijnvg opened this issue Jul 2, 2018 · 8 comments
Closed

[CCR] Re-evaluate shard follow parameter defaults #31717

martijnvg opened this issue Jul 2, 2018 · 8 comments
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features

Comments

@martijnvg
Copy link
Member

in ShardFollowNodeTask and ShardFollowTask.

@martijnvg martijnvg added the :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features label Jul 2, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dliappis
Copy link
Contributor

dliappis commented Sep 28, 2018

TL;DR

It's hard to come up with general defaults that are optimal for all cases, esp. in multi node environments with different (primary+replica) shard settings.

this comment shows two categories of defaults that work well in different scenarios:

  • 5P 1R shard settings, 3node clusters, Network storage, AWS+GCP
  • 1P 1R shard settings, 3node clusters, SSD storage, AWS+GCP

The next comment will include optimal defaults for:

  • 3P 1R shard settings, 3node clusters, SSD storage, AWS+GCP

The final defaults will be based on the results from all the shard setting scenarios, however, the detailed results per category are still useful for documentation and tuning guidance.

All benchmarks were done with the same commit from master (73417bf) that doesn't include the unmerged, as of now, append-only final optimizations in #34099.

5P 1R, Network SSD Storage, 3 node cluster

"max_concurrent_read_batches": 5,
"max_concurrent_write_batches": 5,
"max_batch_operation_count": 32768,
"max_batch_size_in_bytes": 4718592,
"max_write_buffer_size": 164000

NOTE: While these defaults have been benchmarked against a large amount of scenarios, they are still static parameters. Elasticsearch will not auto-adjust based on performance during CCR activity and performance depends also on the amount of shards. Therefore they will likely still need tuning depending on environmental peculiarities.

Benchmarks Executed

All benchmarks where executed on the Cloud (AWS and GCP).

GCP environment

1 cluster, 3 leader nodes in europe-west, each node in different AZ europe-west1-b, europe-west1-c, europe-west1-d.

1 cluster, 3 follower nodes in us-central, each node in different AZ us-central1-a, us-central1-b, us-central1-c.

1 load gen on europe-west (europe-west1-b).

Each node root disk size: 500GB EBS SSD.

Each instance is: n1-highcpu-16.

Minimum allowable processor, Skylake (https://cloud.google.com/compute/docs/machine-types).

14GiB RAM (7GiB heap)

AWS environment

1 cluster, 3 leader nodes in eu-central-1, each node in different AZ (us-east-2a, us-east-2b, us-east-2c).

1 cluster, 3 follower nodes in us-east-2, each node in different AZ (eu-central-1a, eu-central-1b, eu-central-1c).

Load gen node: (eu-central-1a).

Each node root disk size: 500GB Persistent Disk SSD.

Each instance is: c5.2xlarge.

15GiB RAM (7GiB heap).

Rally Details

(see https://github.com/elastic/rally-tracks, https://github.com/elastic/rally-eventdata-track)

Tracks used:

  • geopoint
  • pmc
  • http_logs
  • eventdata

Type of load: indexing only
Clients used (see schedule): 4

Results with recommended settings:

GCP:

Track # Indices Docs/index store.size pri.store.size Total Duration Time to catchup after indexing
geopoint 1 60844404 6.7GB 3.3GB 00:08:45 0:00:00.289002
pmc 1 574199 48.3GB 24.7GB 0:11:15 0:00:00.470718
http_logs 3 12406628/30700742/193424966 2/5/36.1GB 1/2.5/18.1GB 0:43:29 0:00:00.948574

AWS:

Track # Indices Docs/index store.size pri.store.size Total Duration Time to catchup after indexing
geopoint 1 60844404 6.7GB 3.3GB 0:09:53 0:00:00.209117
pmc 1 574199 41GB 20.5GB 0:35:46 0:00:10.334900
http_logs 3 12406628/30700742/193424966 2/5/33.1GB 1/2.5/16.6GB 0:41:17 0:01:41.313000
eventdata 1 1000040000 444.4GB 223.2GB 13:29:16 0:00:15.781900
Other details Total benchmarks executed:

GCP: 43 (gradually increasing max_concurrent_batch_read/write values and max_batch_size_in_bytes)
AWS: 7

1P 1R, Local SSD Storage, 3 node cluster

"max_concurrent_read_batches": 9,
"max_concurrent_write_batches": 9,
"max_batch_operation_count": 163840,
"max_batch_size_in_bytes": 4718592,
"max_write_buffer_size": 2942700

Notes

This setup exerts high pressure on I/O (primarily) and cpu (secondarily) , as essentially only two out of three nodes deal with the load.
The setup used already with EBS/Persistent Storage proved a bottleneck and results with the same settings, or settings with higher concurrency + buffers didn't improve replication performance (see details).

Trying locally attached SSD disks shifted the bottleneck to the cpu (see details).

After finding the right instances in terms of cpu and i/o performance it was possible to tune settings to provide the optimal performance.

Benchmarks Executed

Final AWS environment 1 cluster, 3 leader nodes in `eu-central-1`, each node in different AZ (`us-east-2a`, `us-east-2b`, `us-east-2c`).

1 cluster, 3 follower nodes in us-east-2, each node in different AZ (eu-central-1a, eu-central-1b, eu-central-1c).

Load gen node: (eu-central-1a).

Each node root disk size: 100GB (EBS) but Elasticsearch data dir and Rally ~/.rally dirs use /dev/nvme1n1 local SSD of the instance.

Each instance is: m5d.2xlarge.

30GiB RAM (15GiB heap)

(Also tried i3.xlarge, which didn't have enough CPU capacity to deal with http_logs benchmarks at max indexing performance).

GCP environment 1 cluster, 3 leader nodes in `europe-west`, each node in different AZ `europe-west4-a`, `europe-west4-b`, `europe-west4-c` (had to switch from europe-west1-* due to lack of resources).

1 cluster, 3 follower nodes in us-central, each node in different AZ us-central1-a, us-central1-b, us-central1-c.

1 load gen on europe-west (europe-west4-a).

Each node root disk size: 500GBEBS SSD but/.rallyand/es` reside on locally attached SSD disk.

Each instance is: n1-highcpu-16.

Minimum allowable processor, Skylake (https://cloud.google.com/compute/docs/machine-types).

14GiB RAM (7GiB heap).

Rally Details (see https://github.com/elastic/rally-tracks, https://github.com/elastic/rally-eventdata-track)

indexing was throttled to 45000 doc/s

Tracks used:

  • geopoint
  • pmc
  • http_logs
  • eventdata

Type of load: indexing only
Clients used (see schedule): 4

Results with recommended settings:

GCP:

Track # Indices Docs/index store.size pri.store.size Total Duration Time to catchup after indexing
geopoint 1 60844404 9.6GB 4.2GB 0:37:53 0:00:00.257276
pmc 1 574199 40.7GB 20.3GB 0:54:25 0:00:01.204230
http_logs 3 12406628/30700742/193424966 2/4.9/31.9GB 1/2.4/16.1GB 2:35:01 0:00:01.885290

AWS:

Track # Indices Docs/index store.size pri.store.size Total Duration Time to catchup after indexing
geopoint 1 60844404 8.1GB 4GB 0:23:45 0:00:00.213884
pmc 1 574199 43.1GB 22.5GB 0:23:29 0:00:00.951584
http_logs 3 12406628/30700742/193424966 2/5/33.1GB 1/2.5/16.6GB 1:29:10 0:00:18.619700

@dliappis
Copy link
Contributor

dliappis commented Oct 3, 2018

3P 1R, Network SSD Storage, 3 node cluster

"max_concurrent_read_batches": 5,
"max_concurrent_write_batches": 5,
"max_batch_operation_count": 32768,
"max_batch_size_in_bytes": 4718592,
"max_write_buffer_size": 164000

Notes

As expected, using 3p 1r shard settings is lighter on cpu and I/O pressure than the 1/1 scenario.

All benchmarks were executed using local SSD drives and http_logs in particular throttled at 85% of max indexing performance, as otherwise normalized CPU consumption on AWS ranged between 95-98% (the instance on AWS has less cpus).

AWS environment

1 cluster, 3 leader nodes in eu-central-1, each node in different AZ (us-east-2a, us-east-2b, us-east-2c).

1 cluster, 3 follower nodes in us-east-2, each node in different AZ (eu-central-1a, eu-central-1b, eu-central-1c).

Load gen node: (eu-central-1a).

Each node root disk size: 100GB (EBS) but Elasticsearch data dir and Rally ~/.rally dirs use /dev/nvme1n1 local SSD of the instance.

Each instance is: m5d.2xlarge.

30GiB RAM (15GiB heap)

(Also tried i3.xlarge, which didn't have enough CPU capacity to deal with http_logs benchmarks at max indexing performance).

GCP environment

1 cluster, 3 leader nodes in europe-west, each node in different AZ europe-west4-a, europe-west4-b, europe-west4-c.

1 cluster, 3 follower nodes in us-central, each node in different AZ us-central1-a, us-central1-b, us-central1-c.

1 load gen on europe-west (europe-west4-a).

Each node root disk size: 500GB EBS SSD but .rally/ and es/ reside on locally attached SSD disk.

Each instance is: n1-highcpu-16.

Minimum allowable processor, Skylake (https://cloud.google.com/compute/docs/machine-types).

14GiB RAM (7GiB heap).

Rally Details

(see https://github.com/elastic/rally-tracks, https://github.com/elastic/rally-eventdata-track)

Tracks used:

  • geopoint
  • pmc
  • http_logs -> max indexing throughput
  • http_logs -> indexing throughput throttled to 85% of max performance (55000docs/s on AWS, 85000docs/s on GCP)
  • eventdata -> max indexing throughput; instance's CPU, esp follower was pegged at 98%
  • eventdata -> indexing throughput throttled to 85% of max performance

Type of load: indexing only
Clients used (see schedule): 4

Results with recommended settings:

GCP:

Track # Indices Docs/index store.size pri.store.size Total Duration Time to catchup after indexing
geopoint 1 60844404 7.2GB 3.5GB 0:13:26 0:00:00.265027
pmc 1 574199 70.4GB 34.7GB 0:13:06 0:00:00.477828
http_logs (max indexing throughput) 3 12406628/30700742/193424966 2/4.9/31.8GB 1/2.4/15.9GB 1:01:33 0:00:00.778742
http_logs (throttled 85% of max indexing) 3 12406628/30700742/193424966 2/5/32.1GB 1/2.5/16.1GB 1:13:01 0:00:00.779165
eventdata (max indexing throughput) 3 1000040000 427.7GB 214GB 21:37:14 4:48:11.400000
eventdata (throttled 85% of max indexing) 3 1000040000 428.1GB 214.1GB 22:14:09 0:00:00.532950

AWS:

Track # Indices Docs/index store.size pri.store.size Total Duration Time to catchup after indexing
geopoint 1 60844404 7.3GB 3.6GB 0:08:39 0:00:27.475100
pmc 1 574199 61.1GB 30.3GB 0:12:55 0:00:00.269416
http_logs (max indexing throughput) 3 12406628/30700742/193424966 2/4.9/31.1GB 1/2.4/15.5GB 0:45:06 0:05:00.847000
http_logs (throttled 85% of max indexing) 3 12406628/30700742/193424966 2/4.9/31.1GB 1/2.4/15.5GB 0:48:30 0:01:00.925000
eventdata (max indexing throughput) 3 1000040000 424.1GB 212GB 18:21:09 5:23:26.500000
eventdata (throttled 85% of max indexing) 3 1000040000 428.2GB 214GB 15:27:13 0:00:01.125280

@dliappis
Copy link
Contributor

dliappis commented Oct 3, 2018

Next steps

With the append-only follower optimizations merged, next step is to rerun the above benchmarks (at least the ones that demonstrated high resource usage, esp CPU) and evaluate if the common defaults used for 5p/1r, 3p/1r scenario demonstrate less resource usage and whether they are good enough even for 1p/1r.

For reference here are some visualizations for CPU usage:

AWS http_logs 3P/1R unthrottled

CPU

image

AWS http_logs 3P/1R throttled to 85% of max indexing throughput

CPU:

image

Progress of follower vs leader with sequence numbers:

image

GCP http_logs 3P/1R unthrottled

CPU:

image

GCP http_logs 3P/1R throttled to 85% of max indexing throughput

CPU:

image

AWS http_logs 1P/1R throttled to 85% of max indexing throughput

CPU

image

GCP http_logs 1P/1R throttled to 85% of max indexing throughput

CPU

image

@bleskes
Copy link
Contributor

bleskes commented Oct 23, 2018

@dliappis and I did some more research and here is a summary of our discoveries:

  1. Cross region network becomes more efficient when multiple connections are used concurrently. When we run with one reader the network added roughly 700ms on average to each request. With 2 concurrent readers it dropped to 400ms. With 4 readers it dropped even further. We saw this behavior both on GCP and AWS.
  2. The system worked nicely (i.e., the following index kept up nicely) when indexing to a single shard leader index on a 16 core CPU instance with 8 concurrent clients. The indexing rate was roughly 100K http log lines per second. The following index was set up on an identical machine and we used up to 8 concurrent read requests and 8 concurrent write requests. Increasing the the budget to 16r/16w didn't help performance and the read/write budget wasn't fully utilized. All of these experiments used a batch size of 5K ops, 5MB max batch size and an unlimited write buffer.

A few observations that we have made will doing the experiments:

  1. Readers have more work than writers: readers need to unpack lucene stored field blocks and extract the source - this is more work than a write needs to do for small documents as indexing involves parsing them and putting them in an in memory buffer.
  2. Both read and write settings are a maximum - concurrent requests will be allocated only if needed.
  3. By default we use 6 channels to connect to a remote cluster. This value has been inherited from CCS.
  4. On a busy cluster CCR caused ~10% reduction in indexing speed on the leader. On a non-saturated machine (same machine as above but with 4 concurrent writes), CCR cause ~5% slow downs. We are still researching why.

W.r.t defaults we have discussed multiple options, including being smart and using the number of processors on the machines to figure out our defaults. That said I feel we don't have a good enough grip of all the possible use cases/machines and it will take a considerable effort to get it, with unclear chance of actually getting a clear and simple picture. I'd prefer going with simple numbers that are easy to communicate and change. Once we gather more input we can adapt them.

With that in mind, I suggest going with the following defaults:

  1. Keep the currently hard coded 6 channels per remote cluster. I think we should consider making it configurable.
  2. Default to a maximum of 12 concurrent read requests (double the channel count). This a conservative setting that will allow for some room for faster networks (local LANs) where the ration between disk reads and network is different.
  3. Default to a maximum of 8 concurrent write requests. This is less than the number of readers based on expected ratio of work.

@dliappis did I forget anything?

bleskes added a commit to bleskes/elasticsearch that referenced this issue Oct 24, 2018
Per elastic#31717 this commits changes the defaults to the following:

Batch size of 5120 ops.
Maximum of 12 concurrent read requests.
Maximum of 9 concurrent write requests.
bleskes added a commit that referenced this issue Oct 24, 2018
Per #31717 this commit changes the defaults to the following:

Batch size of 5120 ops.
Maximum of 12 concurrent read requests.
Maximum of 9 concurrent write requests.

This is not necessarily our final values but it's good to have these as defaults for the purposes of initial testing.
bleskes added a commit that referenced this issue Oct 24, 2018
Per #31717 this commit changes the defaults to the following:

Batch size of 5120 ops.
Maximum of 12 concurrent read requests.
Maximum of 9 concurrent write requests.

This is not necessarily our final values but it's good to have these as defaults for the purposes of initial testing.
@dliappis
Copy link
Contributor

@bleskes Thank you for the concise summation.

Maybe worth mentioning that all experiments were done on instances using a locally attached SSDs (i.e. not AWS's EBS/GCP's persistent disks).

What initial defaults should we use for max_batch_size? So far we've been using 5MB but the current default is Long.MAX_VALUE, so we can also choose to leave it unchanged and rely only on the max_batch_operation_count == 5120 that you recently introduced in your PR.

I'd also have added the need to clarify defaults for max_write_buffer_count but this is now being worked on #34797 (and there's a new parameter max_write_buffer_bytes).

kcm pushed a commit that referenced this issue Oct 30, 2018
Per #31717 this commit changes the defaults to the following:

Batch size of 5120 ops.
Maximum of 12 concurrent read requests.
Maximum of 9 concurrent write requests.

This is not necessarily our final values but it's good to have these as defaults for the purposes of initial testing.
@dliappis
Copy link
Contributor

Here's a summary of what has happened in the mean time.

We devised a testing plan comprising 3 stages:

  • Stage 0:

    • Verify the impact of x-pack-security
    • Verify the impact of non-append workloadds
  • Stage 1:

    • Validate max_ defaults allow the follower to always catch up using 3 workloads (geopoint -> very small docs, http_logs -> medium doc size, large corpus and pmc -> very large docs) on both AWS and GCP cloud environments.
    • Compare CCR overhead vs "no CCR andsoft_deletes: false" as well as vs "no CCR with soft_deletes: true".
  • Stage 2:

    • Examine ability to catch up, system overhead, and stability with a larger worload (1bn docs, continuous execution over 3+ days) on both AWS and GCP cloud environments.
    • Compare CCR overhead.

Observations/Actions:

  1. We observed long GC pauses with work loads containing large doc sizes (pmc track) and changed the max_read_request_size to 32GB (from Long.MAX_VALUE i.e. practically unlimited).

  2. Observed a penalty in indexing throughput when CCR deals with conflicts (25% conflicts from total size, favoring most recent 25% of docs) as opposed to the same workload without CCR and soft_deletes. Initial calculation on a smaller 1-node environment on GCP showed this penalty to be ~12.5%, however
    more recent experiments shows this to be ~6% on a larger 3-node env in AWS.

    Additionally, we observed a ~19% penalty for merges_total_time_ and ~108% penalty in refresh_total_time. This PR has been raised to improve both metrics via a configurable soft_deletes.reclaim.delay parameter, which defaults to 1minute.

  3. Stage 1 (3 node) benchmarks across all tracks and cloud vendor combinations showed here and here that followers were always able to catch up with (<1s to catch up once indexing was over).

  4. Stage 1 benchmarks demonstrated a ~2.37-5% penalty in median indexing throughput when CCR is enabled (compared to indexing without CCR and no soft_deletes) on AWS. On GCP we've have gotten more inconsistent results, sometimes exhibiting the same 4-5% penalty but also not showing any penalty (or higher performance) which isn't clear why. The assumption at this point is that it's due to environmental factors e.g. noisy neighours but requires further investigation with randomized benchmarks at different hours. (See also following bullet 6. for a comparison of CCR overhead using a larger track over a longer period of time).

  5. 3-day benchmarks (Stage 2), using throttled indexing throughput of 3800 docs/s, finished successfully, followers always able to catch up and stable indexing throughput on both AWS and GCP.

  6. 1bn doc workload at maximum indexing throughput (16 clients) also finished successfully, following cluster was able to catch up. Comparison of the same workload with CCR and soft_deletes disabled showed that CCR has a ~6% penalty on AWS and a ~4.3% penalty on GCP in median indexing throughput.

  7. Finally, as CCR was progressing throughout the execution of the ^^ benchmarks we re-executed the 1bn doc track a few days before the 6.5.0 release to ensure there was no regression in the indexing throughput, ability for the follower to catch up or system resource utilization and didn't spot any issues.

@bleskes Did I miss anything?

NOTE: Some of the Kibana links in the gist point to a private Cloud cluster.

@dnhatn
Copy link
Member

dnhatn commented Mar 8, 2019

I am closing since we have good default parameters now. Thanks @dliappis for the awesome work.

@dnhatn dnhatn closed this as completed Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features
Projects
None yet
Development

No branches or pull requests

5 participants