Add BKD Optimization to Range aggregation #47712

polyfractal · 2019-10-07T23:13:01Z

If the range aggregator

is a top-level agg (no parent)
matches all the documents (match_all query)
does not contain unbounded ranges

We can use the BKD tree to match documents faster than collecting all the documents via the normal DocValue method.

If the aggregation meets the specific criteria above, the range agg will attempt to collect by BKD. There is an additional constraint that the estimated number of matching points should be < maxDocs * 0.75. The BKD approach loses it's advantage over DV when most of the
index is being aggregated into at least one range.

As a side effect, this also introduces some new methods on NumberFieldMapper so that we can encode the ranges into the Point format without knowing about the underlying format (float, double, long, etc).

From tests, this appears to be 70-90% faster latency in the optimized case without grossly affecting the un-optimized case.

Benchmarking

I tried a bunch of different tests. A brief overview:

single_range_dense: one range that matches a large percentage of the index
single_range_sparse: one range that matches a small percentage of the index
multiple_ranges_non_contiguous: several ranges which have gaps between each range
multiple_ranges_contiguous: several ranges, but with no gaps so they form a single, contiguous range
unbounded_ranges: first range has unbounded from, last range has unbounded to
overlapping_ranges: ranges that overlap the prior range to some degree
with_parent_non_contiguous: same ranges as multiple_ranges_non_contiguous, except the range is embedded under a parent aggregation (terms agg)
subagg_*: all the above test, but with a sub-agg to collect (sum aggregation)

Test setup was 12 clients, 50 warmup queries, 50 iteration queries, target-throughput of 10 queries/second. This was executed on a machine with 32 cores (two CPU, 16 physical, 32 w/ hyperthreading). The 10 qps throughput was chosen because all the tests could hit that throughput.

The test data was the eventdata rally track: 20m apache logs, ~15gb, five shards and no replicas.

The tl;dr: is that most of the tests are 70-90% faster latency. Unbounded ranges and parent tests are roughly the same as baseline because those are "un-optimized".

Note: The two parent tests are actually a bit slower (6%, 17%) than baseline. I'm not sure if this is real or just noise. Going to run a few more trials and statistical tests to see if the result is real. You can see visually that the distributions are a bit shifted so I think a t-test would agree. But it might be noise from using the machine at the same time (I know, sorry :( ). Will run some overnight tests to rule out interference.

Rally results - click to expand

|                                                        Metric |                                   Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|---------------------------------------:|------------:|------------:|---------:|-------:|
|                                                Min Throughput |                    single_range_dense1 |     9.96467 |     9.99072 |  0.02605 |  ops/s |
|                                             Median Throughput |                    single_range_dense1 |     9.97812 |     10.0062 |  0.02808 |  ops/s |
|                                                Max Throughput |                    single_range_dense1 |     10.0783 |      10.011 | -0.06727 |  ops/s |
|                                       50th percentile latency |                    single_range_dense1 |      351.71 |     55.0738 | -296.637 |     ms |
|                                       90th percentile latency |                    single_range_dense1 |     379.788 |     60.9318 | -318.857 |     ms |
|                                       99th percentile latency |                    single_range_dense1 |     412.644 |     193.375 | -219.268 |     ms |
|                                      100th percentile latency |                    single_range_dense1 |     431.247 |     202.228 |  -229.02 |     ms |
|                                  50th percentile service time |                    single_range_dense1 |      350.47 |     53.8654 | -296.604 |     ms |
|                                  90th percentile service time |                    single_range_dense1 |     377.771 |     59.7318 |  -318.04 |     ms |
|                                  99th percentile service time |                    single_range_dense1 |      410.72 |     192.487 | -218.232 |     ms |
|                                 100th percentile service time |                    single_range_dense1 |     428.747 |     201.001 | -227.746 |     ms |
|                                                    error rate |                    single_range_dense1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |                   single_range_sparse1 |     9.96874 |     10.0076 |  0.03889 |  ops/s |
|                                             Median Throughput |                   single_range_sparse1 |     9.98276 |     10.0102 |  0.02741 |  ops/s |
|                                                Max Throughput |                   single_range_sparse1 |     9.99273 |      10.015 |  0.02227 |  ops/s |
|                                       50th percentile latency |                   single_range_sparse1 |     305.613 |     8.72943 | -296.884 |     ms |
|                                       90th percentile latency |                   single_range_sparse1 |     326.595 |     10.5276 | -316.067 |     ms |
|                                       99th percentile latency |                   single_range_sparse1 |     343.944 |     12.0985 | -331.845 |     ms |
|                                      100th percentile latency |                   single_range_sparse1 |     360.575 |     38.4084 | -322.167 |     ms |
|                                  50th percentile service time |                   single_range_sparse1 |     304.231 |     7.31075 |  -296.92 |     ms |
|                                  90th percentile service time |                   single_range_sparse1 |     325.632 |     9.33232 |   -316.3 |     ms |
|                                  99th percentile service time |                   single_range_sparse1 |     342.953 |     10.8566 | -332.096 |     ms |
|                                 100th percentile service time |                   single_range_sparse1 |     359.352 |     37.1539 | -322.198 |     ms |
|                                                    error rate |                   single_range_sparse1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |        multiple_ranges_non_contiguous1 |     9.96779 |      10.004 |  0.03619 |  ops/s |
|                                             Median Throughput |        multiple_ranges_non_contiguous1 |     9.97976 |     10.0097 |  0.02992 |  ops/s |
|                                                Max Throughput |        multiple_ranges_non_contiguous1 |     9.98881 |     10.0147 |  0.02588 |  ops/s |
|                                       50th percentile latency |        multiple_ranges_non_contiguous1 |     337.067 |     17.3547 | -319.713 |     ms |
|                                       90th percentile latency |        multiple_ranges_non_contiguous1 |     359.589 |     20.9025 | -338.686 |     ms |
|                                       99th percentile latency |        multiple_ranges_non_contiguous1 |     382.382 |      63.451 | -318.931 |     ms |
|                                      100th percentile latency |        multiple_ranges_non_contiguous1 |     390.834 |     65.3144 |  -325.52 |     ms |
|                                  50th percentile service time |        multiple_ranges_non_contiguous1 |      336.13 |     16.1231 | -320.007 |     ms |
|                                  90th percentile service time |        multiple_ranges_non_contiguous1 |     358.217 |     19.6834 | -338.533 |     ms |
|                                  99th percentile service time |        multiple_ranges_non_contiguous1 |     380.087 |      62.211 | -317.876 |     ms |
|                                 100th percentile service time |        multiple_ranges_non_contiguous1 |     389.931 |     64.0478 | -325.883 |     ms |
|                                                    error rate |        multiple_ranges_non_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |            multiple_ranges_contiguous1 |     9.96242 |     10.0009 |  0.03847 |  ops/s |
|                                             Median Throughput |            multiple_ranges_contiguous1 |     9.97847 |     10.0076 |  0.02908 |  ops/s |
|                                                Max Throughput |            multiple_ranges_contiguous1 |     10.0847 |     10.0131 | -0.07164 |  ops/s |
|                                       50th percentile latency |            multiple_ranges_contiguous1 |     345.639 |     33.2116 | -312.427 |     ms |
|                                       90th percentile latency |            multiple_ranges_contiguous1 |      371.01 |     61.3064 | -309.703 |     ms |
|                                       99th percentile latency |            multiple_ranges_contiguous1 |     392.626 |     102.872 | -289.755 |     ms |
|                                      100th percentile latency |            multiple_ranges_contiguous1 |      402.19 |      108.87 |  -293.32 |     ms |
|                                  50th percentile service time |            multiple_ranges_contiguous1 |     344.964 |     31.8857 | -313.078 |     ms |
|                                  90th percentile service time |            multiple_ranges_contiguous1 |     368.602 |     58.0173 | -310.585 |     ms |
|                                  99th percentile service time |            multiple_ranges_contiguous1 |     391.753 |      101.64 | -290.113 |     ms |
|                                 100th percentile service time |            multiple_ranges_contiguous1 |      401.26 |     107.648 | -293.613 |     ms |
|                                                    error rate |            multiple_ranges_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |                      unbounded_ranges1 |     9.91478 |     9.94318 |  0.02841 |  ops/s |
|                                             Median Throughput |                      unbounded_ranges1 |     9.95934 |     9.96492 |  0.00559 |  ops/s |
|                                                Max Throughput |                      unbounded_ranges1 |     10.0797 |     9.98224 | -0.09742 |  ops/s |
|                                       50th percentile latency |                      unbounded_ranges1 |     521.416 |     488.072 | -33.3443 |     ms |
|                                       90th percentile latency |                      unbounded_ranges1 |     562.653 |     524.355 | -38.2987 |     ms |
|                                       99th percentile latency |                      unbounded_ranges1 |     1001.26 |     555.819 | -445.446 |     ms |
|                                      100th percentile latency |                      unbounded_ranges1 |     1174.01 |     577.722 | -596.287 |     ms |
|                                  50th percentile service time |                      unbounded_ranges1 |     520.657 |      487.23 | -33.4272 |     ms |
|                                  90th percentile service time |                      unbounded_ranges1 |     560.978 |     523.194 | -37.7834 |     ms |
|                                  99th percentile service time |                      unbounded_ranges1 |     1000.55 |     553.985 | -446.562 |     ms |
|                                 100th percentile service time |                      unbounded_ranges1 |     1173.22 |     576.654 | -596.565 |     ms |
|                                                    error rate |                      unbounded_ranges1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |                    overlapping_ranges1 |     9.96373 |     10.0015 |  0.03782 |  ops/s |
|                                             Median Throughput |                    overlapping_ranges1 |     9.97753 |     10.0067 |   0.0292 |  ops/s |
|                                                Max Throughput |                    overlapping_ranges1 |     9.98913 |      10.011 |  0.02183 |  ops/s |
|                                       50th percentile latency |                    overlapping_ranges1 |     338.362 |     41.7786 | -296.584 |     ms |
|                                       90th percentile latency |                    overlapping_ranges1 |     365.665 |     77.3572 | -288.307 |     ms |
|                                       99th percentile latency |                    overlapping_ranges1 |      385.64 |     104.236 | -281.405 |     ms |
|                                      100th percentile latency |                    overlapping_ranges1 |     411.049 |     110.344 | -300.706 |     ms |
|                                  50th percentile service time |                    overlapping_ranges1 |     337.413 |     40.6249 | -296.788 |     ms |
|                                  90th percentile service time |                    overlapping_ranges1 |     364.706 |     75.8103 | -288.896 |     ms |
|                                  99th percentile service time |                    overlapping_ranges1 |     384.698 |     103.356 | -281.342 |     ms |
|                                 100th percentile service time |                    overlapping_ranges1 |     410.661 |      109.35 | -301.311 |     ms |
|                                                    error rate |                    overlapping_ranges1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |            with_parent_non_contiguous1 |     9.89195 |     9.87149 | -0.02046 |  ops/s |
|                                             Median Throughput |            with_parent_non_contiguous1 |     9.92576 |     9.92004 | -0.00572 |  ops/s |
|                                                Max Throughput |            with_parent_non_contiguous1 |     9.99976 |     9.99981 |    5e-05 |  ops/s |
|                                       50th percentile latency |            with_parent_non_contiguous1 |      943.65 |     1004.47 |  60.8185 |     ms |
|                                       90th percentile latency |            with_parent_non_contiguous1 |     1099.67 |     1162.92 |  63.2501 |     ms |
|                                       99th percentile latency |            with_parent_non_contiguous1 |     1176.78 |     1891.27 |  714.484 |     ms |
|                                      100th percentile latency |            with_parent_non_contiguous1 |     1830.06 |     1926.52 |  96.4556 |     ms |
|                                  50th percentile service time |            with_parent_non_contiguous1 |     943.246 |     998.188 |  54.9417 |     ms |
|                                  90th percentile service time |            with_parent_non_contiguous1 |      1094.7 |     1137.45 |  42.7478 |     ms |
|                                  99th percentile service time |            with_parent_non_contiguous1 |     1167.45 |     1781.02 |  613.569 |     ms |
|                                 100th percentile service time |            with_parent_non_contiguous1 |     1823.62 |     1909.71 |  86.0859 |     ms |
|                                                    error rate |            with_parent_non_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |             subagg_single_range_dense1 |     9.95243 |      9.9859 |  0.03347 |  ops/s |
|                                             Median Throughput |             subagg_single_range_dense1 |     9.97069 |     9.99583 |  0.02515 |  ops/s |
|                                                Max Throughput |             subagg_single_range_dense1 |     10.0096 |      9.9997 | -0.00992 |  ops/s |
|                                       50th percentile latency |             subagg_single_range_dense1 |     420.574 |     155.614 |  -264.96 |     ms |
|                                       90th percentile latency |             subagg_single_range_dense1 |     456.612 |     167.436 | -289.176 |     ms |
|                                       99th percentile latency |             subagg_single_range_dense1 |     485.485 |     258.417 | -227.068 |     ms |
|                                      100th percentile latency |             subagg_single_range_dense1 |     492.337 |     293.077 |  -199.26 |     ms |
|                                  50th percentile service time |             subagg_single_range_dense1 |     419.795 |     154.096 | -265.699 |     ms |
|                                  90th percentile service time |             subagg_single_range_dense1 |     454.524 |     165.469 | -289.055 |     ms |
|                                  99th percentile service time |             subagg_single_range_dense1 |     484.589 |     257.738 | -226.851 |     ms |
|                                 100th percentile service time |             subagg_single_range_dense1 |     491.499 |     291.984 | -199.515 |     ms |
|                                                    error rate |             subagg_single_range_dense1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |            subagg_single_range_sparse1 |     9.96924 |     10.0063 |  0.03702 |  ops/s |
|                                             Median Throughput |            subagg_single_range_sparse1 |     9.98134 |       10.01 |  0.02867 |  ops/s |
|                                                Max Throughput |            subagg_single_range_sparse1 |     9.99262 |     10.0151 |  0.02246 |  ops/s |
|                                       50th percentile latency |            subagg_single_range_sparse1 |     315.686 |     8.46949 | -307.217 |     ms |
|                                       90th percentile latency |            subagg_single_range_sparse1 |     341.828 |     10.8845 | -330.943 |     ms |
|                                       99th percentile latency |            subagg_single_range_sparse1 |     411.802 |     47.2574 | -364.545 |     ms |
|                                      100th percentile latency |            subagg_single_range_sparse1 |     423.958 |     48.5549 | -375.404 |     ms |
|                                  50th percentile service time |            subagg_single_range_sparse1 |     314.208 |     7.35148 | -306.857 |     ms |
|                                  90th percentile service time |            subagg_single_range_sparse1 |       340.2 |     9.59369 | -330.606 |     ms |
|                                  99th percentile service time |            subagg_single_range_sparse1 |     409.384 |     45.9227 | -363.461 |     ms |
|                                 100th percentile service time |            subagg_single_range_sparse1 |     423.018 |     47.7502 | -375.268 |     ms |
|                                                    error rate |            subagg_single_range_sparse1 |           0 |           0 |        0 |      % |
|                                                Min Throughput | subagg_multiple_ranges_non_contiguous1 |     9.95972 |     10.0033 |  0.04356 |  ops/s |
|                                             Median Throughput | subagg_multiple_ranges_non_contiguous1 |       9.978 |     10.0088 |  0.03081 |  ops/s |
|                                                Max Throughput | subagg_multiple_ranges_non_contiguous1 |     9.98891 |     10.0132 |   0.0243 |  ops/s |
|                                       50th percentile latency | subagg_multiple_ranges_non_contiguous1 |       350.8 |     27.6223 | -323.178 |     ms |
|                                       90th percentile latency | subagg_multiple_ranges_non_contiguous1 |     373.893 |     32.7482 | -341.145 |     ms |
|                                       99th percentile latency | subagg_multiple_ranges_non_contiguous1 |     401.801 |     74.2184 | -327.583 |     ms |
|                                      100th percentile latency | subagg_multiple_ranges_non_contiguous1 |     420.371 |     78.2408 |  -342.13 |     ms |
|                                  50th percentile service time | subagg_multiple_ranges_non_contiguous1 |     349.596 |     26.3436 | -323.252 |     ms |
|                                  90th percentile service time | subagg_multiple_ranges_non_contiguous1 |      372.59 |     31.3181 | -341.272 |     ms |
|                                  99th percentile service time | subagg_multiple_ranges_non_contiguous1 |     400.921 |     73.3131 | -327.608 |     ms |
|                                 100th percentile service time | subagg_multiple_ranges_non_contiguous1 |     419.721 |     77.0005 |  -342.72 |     ms |
|                                                    error rate | subagg_multiple_ranges_non_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |     subagg_multiple_ranges_contiguous1 |     9.95869 |     9.99527 |  0.03659 |  ops/s |
|                                             Median Throughput |     subagg_multiple_ranges_contiguous1 |     9.97774 |     10.0031 |  0.02535 |  ops/s |
|                                                Max Throughput |     subagg_multiple_ranges_contiguous1 |     10.1084 |     10.0078 | -0.10059 |  ops/s |
|                                       50th percentile latency |     subagg_multiple_ranges_contiguous1 |     371.056 |     78.9161 |  -292.14 |     ms |
|                                       90th percentile latency |     subagg_multiple_ranges_contiguous1 |      401.35 |     99.6151 | -301.735 |     ms |
|                                       99th percentile latency |     subagg_multiple_ranges_contiguous1 |     421.268 |     182.958 |  -238.31 |     ms |
|                                      100th percentile latency |     subagg_multiple_ranges_contiguous1 |     433.057 |     187.969 | -245.088 |     ms |
|                                  50th percentile service time |     subagg_multiple_ranges_contiguous1 |     370.143 |      77.157 | -292.986 |     ms |
|                                  90th percentile service time |     subagg_multiple_ranges_contiguous1 |     400.265 |     98.4407 | -301.824 |     ms |
|                                  99th percentile service time |     subagg_multiple_ranges_contiguous1 |      420.66 |      181.77 |  -238.89 |     ms |
|                                 100th percentile service time |     subagg_multiple_ranges_contiguous1 |     427.662 |     186.799 | -240.863 |     ms |
|                                                    error rate |     subagg_multiple_ranges_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |               subagg_unbounded_ranges1 |     9.89574 |     9.88919 | -0.00655 |  ops/s |
|                                             Median Throughput |               subagg_unbounded_ranges1 |     9.93153 |      9.9298 | -0.00173 |  ops/s |
|                                                Max Throughput |               subagg_unbounded_ranges1 |     9.96604 |     9.98439 |  0.01835 |  ops/s |
|                                       50th percentile latency |               subagg_unbounded_ranges1 |     859.739 |     837.756 | -21.9834 |     ms |
|                                       90th percentile latency |               subagg_unbounded_ranges1 |      928.35 |     913.553 | -14.7964 |     ms |
|                                       99th percentile latency |               subagg_unbounded_ranges1 |     981.977 |     964.165 |  -17.812 |     ms |
|                                      100th percentile latency |               subagg_unbounded_ranges1 |     1016.18 |     992.742 | -23.4352 |     ms |
|                                  50th percentile service time |               subagg_unbounded_ranges1 |     859.315 |     837.378 | -21.9377 |     ms |
|                                  90th percentile service time |               subagg_unbounded_ranges1 |     927.876 |     913.137 | -14.7395 |     ms |
|                                  99th percentile service time |               subagg_unbounded_ranges1 |     981.589 |     963.828 | -17.7609 |     ms |
|                                 100th percentile service time |               subagg_unbounded_ranges1 |     1015.83 |     992.323 | -23.5038 |     ms |
|                                                    error rate |               subagg_unbounded_ranges1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |             subagg_overlapping_ranges1 |     9.95719 |     9.97401 |  0.01682 |  ops/s |
|                                             Median Throughput |             subagg_overlapping_ranges1 |      9.9763 |     9.99986 |  0.02356 |  ops/s |
|                                                Max Throughput |             subagg_overlapping_ranges1 |     10.0788 |     10.0186 | -0.06024 |  ops/s |
|                                       50th percentile latency |             subagg_overlapping_ranges1 |      386.25 |     113.525 | -272.725 |     ms |
|                                       90th percentile latency |             subagg_overlapping_ranges1 |     411.475 |     188.045 | -223.429 |     ms |
|                                       99th percentile latency |             subagg_overlapping_ranges1 |     429.831 |     348.727 | -81.1032 |     ms |
|                                      100th percentile latency |             subagg_overlapping_ranges1 |     437.367 |     368.382 | -68.9851 |     ms |
|                                  50th percentile service time |             subagg_overlapping_ranges1 |      385.45 |     112.026 | -273.424 |     ms |
|                                  90th percentile service time |             subagg_overlapping_ranges1 |     410.226 |     184.887 | -225.339 |     ms |
|                                  99th percentile service time |             subagg_overlapping_ranges1 |     428.136 |     347.582 | -80.5538 |     ms |
|                                 100th percentile service time |             subagg_overlapping_ranges1 |     436.533 |      367.24 | -69.2931 |     ms |
|                                                    error rate |             subagg_overlapping_ranges1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |     subagg_with_parent_non_contiguous1 |     9.86526 |     9.84599 | -0.01927 |  ops/s |
|                                             Median Throughput |     subagg_with_parent_non_contiguous1 |     9.92039 |     9.91319 |  -0.0072 |  ops/s |
|                                                Max Throughput |     subagg_with_parent_non_contiguous1 |     10.0075 |     9.98141 | -0.02605 |  ops/s |
|                                       50th percentile latency |     subagg_with_parent_non_contiguous1 |     957.492 |     1123.11 |  165.614 |     ms |
|                                       90th percentile latency |     subagg_with_parent_non_contiguous1 |     1100.21 |     1805.08 |  704.871 |     ms |
|                                       99th percentile latency |     subagg_with_parent_non_contiguous1 |     1163.41 |     1953.88 |   790.47 |     ms |
|                                      100th percentile latency |     subagg_with_parent_non_contiguous1 |     1828.93 |     2036.14 |  207.217 |     ms |
|                                  50th percentile service time |     subagg_with_parent_non_contiguous1 |     956.972 |     1114.01 |  157.042 |     ms |
|                                  90th percentile service time |     subagg_with_parent_non_contiguous1 |     1095.24 |     1213.23 |  117.997 |     ms |
|                                  99th percentile service time |     subagg_with_parent_non_contiguous1 |     1153.07 |     1576.21 |  423.145 |     ms |
|                                 100th percentile service time |     subagg_with_parent_non_contiguous1 |     1826.42 |     1866.21 |   39.791 |     ms |
|                                                    error rate |     subagg_with_parent_non_contiguous1 |           0 |           0 |        0 |      % |

If the range aggregator is A) a top-level agg (no parent), B) matching all the documents (match_all query) and C) does not contain unbounded ranges, we can use the BKD tree to match documents faster than collecting all the documents via the normal DocValue method. If the aggregation meets the specific criteria above, the range agg will attempt to collect by BKD. There is an additional constraint that the estimated number of matching points should be < maxDocs*0.75. The BKD approach loses it's advantage over DV when most of the index is being aggregated into at least one range. As a side effect, this also introduces some new methods on NumberFieldMapper so that we can encode the ranges into the Point format without knowing about the underlying format (float, double, long, etc). From tests, this appears to be 70-90% faster latency in the optimized case without grossly affecting the un-optimized case.

elasticmachine · 2019-10-07T23:13:03Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

polyfractal · 2019-10-07T23:16:11Z

server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java

+                estimatedPoints += pointValues.estimatePointCount(visitors[i]);
+            }
+
+            if (estimatedPoints < maxDoc * 0.75) {


It's not clear to me what we should actually set this threshold to. 75% might be too aggressive, since the BKD estimate could under-estimate some leaves considerably.

I wonder if maxDoc is the right parameter as we are counting values. Maybe we should be using pointValues.size() instead?

Is this heuristic extracted from performance tests?

Nothing very scientific at all, happy to change it for something better :)

It was derived from performance tests only in the sense that ranges which span most of the data tend to be same speed (or slower) and with a bit more overhead than just collecting them via DocValues.

The rationale was that the DocValue approach would have to touch maxDocs documents in the collect loop, and so I wanted to "collect" fewer than that many values from the BKD tree. But it's not very comparable since they have different costs (and even in the tree, collecting all the docs from one leaf is much different than when we have to unpack individual values)

I'm actually surprised this approach is not always faster. I wonder whether we're doing something stupid that slows things down, or maybe DocIdSetBuilder is more expensive than I think. It would be interesting to run under a profiler to get a sense of the relative cost of intersecting the BKD tree vs. collecting doc IDs.

iverase

In general looks good, I am just a bit worried about the threshold used and how it is calculated but probably is good enough. Just want to make sure I understand it.

iverase · 2019-10-08T09:56:31Z

server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java

+                // estimation won't call `grow()` on the visitor (only once we start intersecting)
+                results[i]  = new DocIdSetBuilder(maxDoc);
+                visitors[i] = getVisitor(liveDocs, encodedRanges[i * 2], encodedRanges[i * 2 + 1], results[i]);
+                estimatedPoints += pointValues.estimatePointCount(visitors[i]);


This operation is fast but still it have a cost. I wonder if we should break as soon as we are over the threshold.

Ah good point, agreed ++

jpountz

This idea sounds very appealing. I think the main issue right now is that it only collects doc IDs in order on a per-bucket basis, while the contract is that documents be collected in order regardless of their bucket id. This would be an issue if there are sub aggregations that work on doc values (e.g. a min aggregation on a long field). I can think of two ways that we could address it

Merge iterators using a heap to emit doc IDs in order.
Create a new sub-tree of aggregators for each bucket, similarly to what we do in MultiBucketAggregatorWrapper.

The latter is probably faster than the former, but it might be more challenging too (or not?).

jpountz · 2019-10-09T09:06:59Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

+
+            @Override
+            public int bytesPerEncodedPoint() {
+                return Integer.BYTES;


I thought we used 2 bytes for HalfFloat, not 4?

You can use HalfFloatPoint.BYTES.

jpountz · 2019-10-09T09:13:16Z

server/src/main/java/org/elasticsearch/search/aggregations/metrics/MinAggregator.java

@@ -181,7 +181,7 @@ public void doClose() {
        if (parent != null) {
            return null;
        }
-        if (config.fieldContext() != null && config.script() == null) {
+        if (config.fieldContext() != null && config.script() == null && config.missing() == null) {


this change looks unrelated?

Bug actually, stumbled into it while working on the PR and made a tweak. Since we wrap the DV with an anonymous class that injects missing values, and the BKD optimization aborts before we use those DVs, we can miss a potential min/max value if it happened to be the missing value.

@imotov just fixed this in #48970 due to a bug report, and better to have it fixed separately anyway so this part of the change can go away.

polyfractal · 2019-11-25T15:47:10Z

WIP update: I modified the PR to use a MultiBucketAggregatorWrapper scheme, although it's a little different from other usages. Writing this out to help clarify for myself :)

MultiBucketAggWrapper is used so that a new sub-tree is created for each bucket the agg resides in, and the ordinal is reset to zero in each case. This happens "above" the agg, in between the agg and it's parent.

In this case, we want to create a new sub-tree for each bucket the Range creates, rather than for each bucket the Range resides in. So we need to create a sub-tree for all the sub-aggs, ensuring that when we create buckets for ranges, the sub-aggs each get their own sub-tree.

To that end, this PR wraps the sub-aggregator AggregatorFactories object in a way that wraps all sub-aggs with a MultiBucketAggWrapper. This effectively creates a new sub-tree for each sub-agg and resets the ordinal, whether that agg would normally do so or not. In particular, this also applies to metric aggs which don't normally get the treatment.

I'm not 100% convinced this is correct, but tests pass and it seems ok. I'm going to add more tests to verify

I also have some concerns about how this was implemented, it feels a little hacky. But it avoided a lot of code complexity in the Range by reusing existing MBAWrapper code at this level.

jpountz · 2019-12-04T13:39:11Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

+
+            @Override
+            public int bytesPerEncodedPoint() {
+                return Integer.BYTES;


You can use HalfFloatPoint.BYTES.

.../java/org/elasticsearch/search/aggregations/bucket/range/AbstractRangeAggregatorFactory.java

server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java

jpountz · 2019-12-04T17:37:48Z

server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java

+                estimatedPoints += pointValues.estimatePointCount(visitors[i]);
+            }
+
+            if (estimatedPoints < maxDoc * 0.75) {


I'm actually surprised this approach is not always faster. I wonder whether we're doing something stupid that slows things down, or maybe DocIdSetBuilder is more expensive than I think. It would be interesting to run under a profiler to get a sense of the relative cost of intersecting the BKD tree vs. collecting doc IDs.

server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java

server/src/test/java/org/elasticsearch/search/aggregations/bucket/RangeIT.java

polyfractal · 2019-12-06T18:57:19Z

Ran some more tests on the "no heuristic" scenario (always apply optimization if possible, don't estimate points, apply even to unbounded ranges). Looks like it is faster in my tests without a noticeable slowdown elsewhere.

The heuristics did improve speed when I first added them, but I think that was before I correctly implemented the "bulk" BKD visit method (visit(DocIdSetIterator iterator, byte[] packedValue)). I'm going to do a bit more testing, but I think we can simplify the PR by pulling out the heuristics and always apply the optimization if it's an applicable scenario (no parent, etc).

Notably in this test, we see that the unbounded ranges (second and fourth chart on top row, "unbounded_ranges", "subagg_unbounded_ranges") show an improvement relative to the earlier test, without degradation elsewhere.

- HalfFloatPoint.BYTES - Remove heuristics, always apply BKD optim if possible - Move optim decision up to factory

polyfractal · 2019-12-09T17:02:20Z

Getting there :)

Removed heuristics so that the BKD optimization is always applied if the scenario is correct (no parents, etc).
Added worst-case breaker accounting for DocIdSetBuilder
Because we always run the optimization (excepting tripping the breaker), I moved the optimization decision up into the Factory. This allows us to only wrap when we know the optimization will be used. I also realized this is important if we want to apply a similar optimization to geo ranges, since they will need to specify their own point encoder.

jpountz

The change looks good to me code-wise. However I'd like to see this optimization tested by a unit test, not only by an integration test.

Some higher-level thoughts about this change: These optimizations have a high maintenance cost. We almost need to duplicate all tests for the case when the optimization applies and when it doesn't. And we also need to add more tests for things that are no longer granted by the aggregation framework, such as the handling of deleted documents. I think range aggs are a good aggregation to explore this sort of optimization but it's unclear to me whether many users would benefit from this optimization compared to its maintenance cost. Maybe I'm being over pessimistic and many users would enjoy it! However what I know for sure is that this optimization would be worth maintaining on date_histograms, as I think that the vast majority Kibana users run top-level date histograms at some point.

jpountz · 2019-12-09T21:21:21Z

server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java


-    final double[] maxTo;
+    private final String pointField;
+    private final boolean canOptimize;


I think it'd be easier to read by giving it a more explicit name, e.g. aggregateUsingPointsIndex or something along those lines

jpountz · 2019-12-09T21:28:17Z

server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java

+                    // `from` is unbounded or `min >= from`
+                    (from == null || compareUnsigned(minPackedValue, 0, packedLength, from, 0, from.length) >= 0)
+                    &&
+                        // `to` is unbounded or `max < to`


indentation looks wrong here

polyfractal · 2020-06-24T17:32:31Z

This is extremely out of date now, and we've decided to put a pin in it to focus on histo/date_histo first. So I'm going to close this for the time being.

$polyfractal$

$@polyfractal$ polyfractal added >enhancement :Analytics/Aggregations Aggregations labels Oct 7, 2019

$@polyfractal$ polyfractal requested review from jpountz and jimczi October 7, 2019 23:13

$polyfractal$

polyfractal commented Oct 7, 2019

View reviewed changes

$@polyfractal$ polyfractal requested a review from iverase October 7, 2019 23:19

iverase requested changes Oct 8, 2019

View reviewed changes

jpountz reviewed Oct 9, 2019

View reviewed changes

polyfractal added 2 commits November 20, 2019 11:55

$@polyfractal$

Merge remote-tracking branch 'origin/master' into range_optimization

a3317d6

$@polyfractal$

WIP: Wrap sub-aggs with MultiBucketAggregatorWrapper

a5ed1e0

jpountz reviewed Dec 4, 2019

View reviewed changes

$@polyfractal$

Address review comments

c0f649d

- HalfFloatPoint.BYTES - Remove heuristics, always apply BKD optim if possible - Move optim decision up to factory

jpountz requested changes Dec 9, 2019

View reviewed changes

$@polyfractal$ polyfractal mentioned this pull request Apr 21, 2020

ValuesSource support for reading points #55552

Closed

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

$@polyfractal$ polyfractal closed this Jun 24, 2020

nik9000 mentioned this pull request Oct 14, 2020

Speed up date_histogram without children #63643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BKD Optimization to Range aggregation #47712

Add BKD Optimization to Range aggregation #47712

$@polyfractal$ polyfractal commented Oct 7, 2019

elasticmachine commented Oct 7, 2019

$@polyfractal$ polyfractal Oct 7, 2019

iverase Oct 8, 2019

$@polyfractal$ polyfractal Oct 8, 2019

jpountz Dec 4, 2019

iverase left a comment

iverase Oct 8, 2019 •

edited

Loading

$@polyfractal$ polyfractal Oct 8, 2019

jpountz left a comment

jpountz Oct 9, 2019

jpountz Dec 4, 2019

jpountz Oct 9, 2019

$@polyfractal$ polyfractal Nov 12, 2019

polyfractal commented Nov 25, 2019

jpountz Dec 4, 2019

jpountz Dec 4, 2019

polyfractal commented Dec 6, 2019

polyfractal commented Dec 9, 2019

jpountz left a comment

jpountz Dec 9, 2019

jpountz Dec 9, 2019

polyfractal commented Jun 24, 2020

Add BKD Optimization to Range aggregation #47712

Add BKD Optimization to Range aggregation #47712

Conversation

polyfractal commented Oct 7, 2019

elasticmachine commented Oct 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase left a comment

Choose a reason for hiding this comment

iverase Oct 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Nov 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Dec 6, 2019

polyfractal commented Dec 9, 2019

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Jun 24, 2020

$@polyfractal$ polyfractal commented Oct 7, 2019

iverase Oct 8, 2019 •

edited

Loading