Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BKD Optimization to Range aggregation #47712

Closed
wants to merge 4 commits into from

Conversation

polyfractal
Copy link
Contributor

If the range aggregator

  1. is a top-level agg (no parent)
  2. matches all the documents (match_all query)
  3. does not contain unbounded ranges

We can use the BKD tree to match documents faster than collecting all the documents via the normal DocValue method.

If the aggregation meets the specific criteria above, the range agg will attempt to collect by BKD. There is an additional constraint that the estimated number of matching points should be < maxDocs * 0.75. The BKD approach loses it's advantage over DV when most of the
index is being aggregated into at least one range.

As a side effect, this also introduces some new methods on NumberFieldMapper so that we can encode the ranges into the Point format without knowing about the underlying format (float, double, long, etc).

From tests, this appears to be 70-90% faster latency in the optimized case without grossly affecting the un-optimized case.

Benchmarking

I tried a bunch of different tests. A brief overview:

  • single_range_dense: one range that matches a large percentage of the index
  • single_range_sparse: one range that matches a small percentage of the index
  • multiple_ranges_non_contiguous: several ranges which have gaps between each range
  • multiple_ranges_contiguous: several ranges, but with no gaps so they form a single, contiguous range
  • unbounded_ranges: first range has unbounded from, last range has unbounded to
  • overlapping_ranges: ranges that overlap the prior range to some degree
  • with_parent_non_contiguous: same ranges as multiple_ranges_non_contiguous, except the range is embedded under a parent aggregation (terms agg)
  • subagg_*: all the above test, but with a sub-agg to collect (sum aggregation)

Test setup was 12 clients, 50 warmup queries, 50 iteration queries, target-throughput of 10 queries/second. This was executed on a machine with 32 cores (two CPU, 16 physical, 32 w/ hyperthreading). The 10 qps throughput was chosen because all the tests could hit that throughput.

The test data was the eventdata rally track: 20m apache logs, ~15gb, five shards and no replicas.

The tl;dr: is that most of the tests are 70-90% faster latency. Unbounded ranges and parent tests are roughly the same as baseline because those are "un-optimized".

Note: The two parent tests are actually a bit slower (6%, 17%) than baseline. I'm not sure if this is real or just noise. Going to run a few more trials and statistical tests to see if the result is real. You can see visually that the distributions are a bit shifted so I think a t-test would agree. But it might be noise from using the machine at the same time (I know, sorry :( ). Will run some overnight tests to rule out interference.

image

Rally results - click to expand
|                                                        Metric |                                   Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|---------------------------------------:|------------:|------------:|---------:|-------:|
|                                                Min Throughput |                    single_range_dense1 |     9.96467 |     9.99072 |  0.02605 |  ops/s |
|                                             Median Throughput |                    single_range_dense1 |     9.97812 |     10.0062 |  0.02808 |  ops/s |
|                                                Max Throughput |                    single_range_dense1 |     10.0783 |      10.011 | -0.06727 |  ops/s |
|                                       50th percentile latency |                    single_range_dense1 |      351.71 |     55.0738 | -296.637 |     ms |
|                                       90th percentile latency |                    single_range_dense1 |     379.788 |     60.9318 | -318.857 |     ms |
|                                       99th percentile latency |                    single_range_dense1 |     412.644 |     193.375 | -219.268 |     ms |
|                                      100th percentile latency |                    single_range_dense1 |     431.247 |     202.228 |  -229.02 |     ms |
|                                  50th percentile service time |                    single_range_dense1 |      350.47 |     53.8654 | -296.604 |     ms |
|                                  90th percentile service time |                    single_range_dense1 |     377.771 |     59.7318 |  -318.04 |     ms |
|                                  99th percentile service time |                    single_range_dense1 |      410.72 |     192.487 | -218.232 |     ms |
|                                 100th percentile service time |                    single_range_dense1 |     428.747 |     201.001 | -227.746 |     ms |
|                                                    error rate |                    single_range_dense1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |                   single_range_sparse1 |     9.96874 |     10.0076 |  0.03889 |  ops/s |
|                                             Median Throughput |                   single_range_sparse1 |     9.98276 |     10.0102 |  0.02741 |  ops/s |
|                                                Max Throughput |                   single_range_sparse1 |     9.99273 |      10.015 |  0.02227 |  ops/s |
|                                       50th percentile latency |                   single_range_sparse1 |     305.613 |     8.72943 | -296.884 |     ms |
|                                       90th percentile latency |                   single_range_sparse1 |     326.595 |     10.5276 | -316.067 |     ms |
|                                       99th percentile latency |                   single_range_sparse1 |     343.944 |     12.0985 | -331.845 |     ms |
|                                      100th percentile latency |                   single_range_sparse1 |     360.575 |     38.4084 | -322.167 |     ms |
|                                  50th percentile service time |                   single_range_sparse1 |     304.231 |     7.31075 |  -296.92 |     ms |
|                                  90th percentile service time |                   single_range_sparse1 |     325.632 |     9.33232 |   -316.3 |     ms |
|                                  99th percentile service time |                   single_range_sparse1 |     342.953 |     10.8566 | -332.096 |     ms |
|                                 100th percentile service time |                   single_range_sparse1 |     359.352 |     37.1539 | -322.198 |     ms |
|                                                    error rate |                   single_range_sparse1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |        multiple_ranges_non_contiguous1 |     9.96779 |      10.004 |  0.03619 |  ops/s |
|                                             Median Throughput |        multiple_ranges_non_contiguous1 |     9.97976 |     10.0097 |  0.02992 |  ops/s |
|                                                Max Throughput |        multiple_ranges_non_contiguous1 |     9.98881 |     10.0147 |  0.02588 |  ops/s |
|                                       50th percentile latency |        multiple_ranges_non_contiguous1 |     337.067 |     17.3547 | -319.713 |     ms |
|                                       90th percentile latency |        multiple_ranges_non_contiguous1 |     359.589 |     20.9025 | -338.686 |     ms |
|                                       99th percentile latency |        multiple_ranges_non_contiguous1 |     382.382 |      63.451 | -318.931 |     ms |
|                                      100th percentile latency |        multiple_ranges_non_contiguous1 |     390.834 |     65.3144 |  -325.52 |     ms |
|                                  50th percentile service time |        multiple_ranges_non_contiguous1 |      336.13 |     16.1231 | -320.007 |     ms |
|                                  90th percentile service time |        multiple_ranges_non_contiguous1 |     358.217 |     19.6834 | -338.533 |     ms |
|                                  99th percentile service time |        multiple_ranges_non_contiguous1 |     380.087 |      62.211 | -317.876 |     ms |
|                                 100th percentile service time |        multiple_ranges_non_contiguous1 |     389.931 |     64.0478 | -325.883 |     ms |
|                                                    error rate |        multiple_ranges_non_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |            multiple_ranges_contiguous1 |     9.96242 |     10.0009 |  0.03847 |  ops/s |
|                                             Median Throughput |            multiple_ranges_contiguous1 |     9.97847 |     10.0076 |  0.02908 |  ops/s |
|                                                Max Throughput |            multiple_ranges_contiguous1 |     10.0847 |     10.0131 | -0.07164 |  ops/s |
|                                       50th percentile latency |            multiple_ranges_contiguous1 |     345.639 |     33.2116 | -312.427 |     ms |
|                                       90th percentile latency |            multiple_ranges_contiguous1 |      371.01 |     61.3064 | -309.703 |     ms |
|                                       99th percentile latency |            multiple_ranges_contiguous1 |     392.626 |     102.872 | -289.755 |     ms |
|                                      100th percentile latency |            multiple_ranges_contiguous1 |      402.19 |      108.87 |  -293.32 |     ms |
|                                  50th percentile service time |            multiple_ranges_contiguous1 |     344.964 |     31.8857 | -313.078 |     ms |
|                                  90th percentile service time |            multiple_ranges_contiguous1 |     368.602 |     58.0173 | -310.585 |     ms |
|                                  99th percentile service time |            multiple_ranges_contiguous1 |     391.753 |      101.64 | -290.113 |     ms |
|                                 100th percentile service time |            multiple_ranges_contiguous1 |      401.26 |     107.648 | -293.613 |     ms |
|                                                    error rate |            multiple_ranges_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |                      unbounded_ranges1 |     9.91478 |     9.94318 |  0.02841 |  ops/s |
|                                             Median Throughput |                      unbounded_ranges1 |     9.95934 |     9.96492 |  0.00559 |  ops/s |
|                                                Max Throughput |                      unbounded_ranges1 |     10.0797 |     9.98224 | -0.09742 |  ops/s |
|                                       50th percentile latency |                      unbounded_ranges1 |     521.416 |     488.072 | -33.3443 |     ms |
|                                       90th percentile latency |                      unbounded_ranges1 |     562.653 |     524.355 | -38.2987 |     ms |
|                                       99th percentile latency |                      unbounded_ranges1 |     1001.26 |     555.819 | -445.446 |     ms |
|                                      100th percentile latency |                      unbounded_ranges1 |     1174.01 |     577.722 | -596.287 |     ms |
|                                  50th percentile service time |                      unbounded_ranges1 |     520.657 |      487.23 | -33.4272 |     ms |
|                                  90th percentile service time |                      unbounded_ranges1 |     560.978 |     523.194 | -37.7834 |     ms |
|                                  99th percentile service time |                      unbounded_ranges1 |     1000.55 |     553.985 | -446.562 |     ms |
|                                 100th percentile service time |                      unbounded_ranges1 |     1173.22 |     576.654 | -596.565 |     ms |
|                                                    error rate |                      unbounded_ranges1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |                    overlapping_ranges1 |     9.96373 |     10.0015 |  0.03782 |  ops/s |
|                                             Median Throughput |                    overlapping_ranges1 |     9.97753 |     10.0067 |   0.0292 |  ops/s |
|                                                Max Throughput |                    overlapping_ranges1 |     9.98913 |      10.011 |  0.02183 |  ops/s |
|                                       50th percentile latency |                    overlapping_ranges1 |     338.362 |     41.7786 | -296.584 |     ms |
|                                       90th percentile latency |                    overlapping_ranges1 |     365.665 |     77.3572 | -288.307 |     ms |
|                                       99th percentile latency |                    overlapping_ranges1 |      385.64 |     104.236 | -281.405 |     ms |
|                                      100th percentile latency |                    overlapping_ranges1 |     411.049 |     110.344 | -300.706 |     ms |
|                                  50th percentile service time |                    overlapping_ranges1 |     337.413 |     40.6249 | -296.788 |     ms |
|                                  90th percentile service time |                    overlapping_ranges1 |     364.706 |     75.8103 | -288.896 |     ms |
|                                  99th percentile service time |                    overlapping_ranges1 |     384.698 |     103.356 | -281.342 |     ms |
|                                 100th percentile service time |                    overlapping_ranges1 |     410.661 |      109.35 | -301.311 |     ms |
|                                                    error rate |                    overlapping_ranges1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |            with_parent_non_contiguous1 |     9.89195 |     9.87149 | -0.02046 |  ops/s |
|                                             Median Throughput |            with_parent_non_contiguous1 |     9.92576 |     9.92004 | -0.00572 |  ops/s |
|                                                Max Throughput |            with_parent_non_contiguous1 |     9.99976 |     9.99981 |    5e-05 |  ops/s |
|                                       50th percentile latency |            with_parent_non_contiguous1 |      943.65 |     1004.47 |  60.8185 |     ms |
|                                       90th percentile latency |            with_parent_non_contiguous1 |     1099.67 |     1162.92 |  63.2501 |     ms |
|                                       99th percentile latency |            with_parent_non_contiguous1 |     1176.78 |     1891.27 |  714.484 |     ms |
|                                      100th percentile latency |            with_parent_non_contiguous1 |     1830.06 |     1926.52 |  96.4556 |     ms |
|                                  50th percentile service time |            with_parent_non_contiguous1 |     943.246 |     998.188 |  54.9417 |     ms |
|                                  90th percentile service time |            with_parent_non_contiguous1 |      1094.7 |     1137.45 |  42.7478 |     ms |
|                                  99th percentile service time |            with_parent_non_contiguous1 |     1167.45 |     1781.02 |  613.569 |     ms |
|                                 100th percentile service time |            with_parent_non_contiguous1 |     1823.62 |     1909.71 |  86.0859 |     ms |
|                                                    error rate |            with_parent_non_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |             subagg_single_range_dense1 |     9.95243 |      9.9859 |  0.03347 |  ops/s |
|                                             Median Throughput |             subagg_single_range_dense1 |     9.97069 |     9.99583 |  0.02515 |  ops/s |
|                                                Max Throughput |             subagg_single_range_dense1 |     10.0096 |      9.9997 | -0.00992 |  ops/s |
|                                       50th percentile latency |             subagg_single_range_dense1 |     420.574 |     155.614 |  -264.96 |     ms |
|                                       90th percentile latency |             subagg_single_range_dense1 |     456.612 |     167.436 | -289.176 |     ms |
|                                       99th percentile latency |             subagg_single_range_dense1 |     485.485 |     258.417 | -227.068 |     ms |
|                                      100th percentile latency |             subagg_single_range_dense1 |     492.337 |     293.077 |  -199.26 |     ms |
|                                  50th percentile service time |             subagg_single_range_dense1 |     419.795 |     154.096 | -265.699 |     ms |
|                                  90th percentile service time |             subagg_single_range_dense1 |     454.524 |     165.469 | -289.055 |     ms |
|                                  99th percentile service time |             subagg_single_range_dense1 |     484.589 |     257.738 | -226.851 |     ms |
|                                 100th percentile service time |             subagg_single_range_dense1 |     491.499 |     291.984 | -199.515 |     ms |
|                                                    error rate |             subagg_single_range_dense1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |            subagg_single_range_sparse1 |     9.96924 |     10.0063 |  0.03702 |  ops/s |
|                                             Median Throughput |            subagg_single_range_sparse1 |     9.98134 |       10.01 |  0.02867 |  ops/s |
|                                                Max Throughput |            subagg_single_range_sparse1 |     9.99262 |     10.0151 |  0.02246 |  ops/s |
|                                       50th percentile latency |            subagg_single_range_sparse1 |     315.686 |     8.46949 | -307.217 |     ms |
|                                       90th percentile latency |            subagg_single_range_sparse1 |     341.828 |     10.8845 | -330.943 |     ms |
|                                       99th percentile latency |            subagg_single_range_sparse1 |     411.802 |     47.2574 | -364.545 |     ms |
|                                      100th percentile latency |            subagg_single_range_sparse1 |     423.958 |     48.5549 | -375.404 |     ms |
|                                  50th percentile service time |            subagg_single_range_sparse1 |     314.208 |     7.35148 | -306.857 |     ms |
|                                  90th percentile service time |            subagg_single_range_sparse1 |       340.2 |     9.59369 | -330.606 |     ms |
|                                  99th percentile service time |            subagg_single_range_sparse1 |     409.384 |     45.9227 | -363.461 |     ms |
|                                 100th percentile service time |            subagg_single_range_sparse1 |     423.018 |     47.7502 | -375.268 |     ms |
|                                                    error rate |            subagg_single_range_sparse1 |           0 |           0 |        0 |      % |
|                                                Min Throughput | subagg_multiple_ranges_non_contiguous1 |     9.95972 |     10.0033 |  0.04356 |  ops/s |
|                                             Median Throughput | subagg_multiple_ranges_non_contiguous1 |       9.978 |     10.0088 |  0.03081 |  ops/s |
|                                                Max Throughput | subagg_multiple_ranges_non_contiguous1 |     9.98891 |     10.0132 |   0.0243 |  ops/s |
|                                       50th percentile latency | subagg_multiple_ranges_non_contiguous1 |       350.8 |     27.6223 | -323.178 |     ms |
|                                       90th percentile latency | subagg_multiple_ranges_non_contiguous1 |     373.893 |     32.7482 | -341.145 |     ms |
|                                       99th percentile latency | subagg_multiple_ranges_non_contiguous1 |     401.801 |     74.2184 | -327.583 |     ms |
|                                      100th percentile latency | subagg_multiple_ranges_non_contiguous1 |     420.371 |     78.2408 |  -342.13 |     ms |
|                                  50th percentile service time | subagg_multiple_ranges_non_contiguous1 |     349.596 |     26.3436 | -323.252 |     ms |
|                                  90th percentile service time | subagg_multiple_ranges_non_contiguous1 |      372.59 |     31.3181 | -341.272 |     ms |
|                                  99th percentile service time | subagg_multiple_ranges_non_contiguous1 |     400.921 |     73.3131 | -327.608 |     ms |
|                                 100th percentile service time | subagg_multiple_ranges_non_contiguous1 |     419.721 |     77.0005 |  -342.72 |     ms |
|                                                    error rate | subagg_multiple_ranges_non_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |     subagg_multiple_ranges_contiguous1 |     9.95869 |     9.99527 |  0.03659 |  ops/s |
|                                             Median Throughput |     subagg_multiple_ranges_contiguous1 |     9.97774 |     10.0031 |  0.02535 |  ops/s |
|                                                Max Throughput |     subagg_multiple_ranges_contiguous1 |     10.1084 |     10.0078 | -0.10059 |  ops/s |
|                                       50th percentile latency |     subagg_multiple_ranges_contiguous1 |     371.056 |     78.9161 |  -292.14 |     ms |
|                                       90th percentile latency |     subagg_multiple_ranges_contiguous1 |      401.35 |     99.6151 | -301.735 |     ms |
|                                       99th percentile latency |     subagg_multiple_ranges_contiguous1 |     421.268 |     182.958 |  -238.31 |     ms |
|                                      100th percentile latency |     subagg_multiple_ranges_contiguous1 |     433.057 |     187.969 | -245.088 |     ms |
|                                  50th percentile service time |     subagg_multiple_ranges_contiguous1 |     370.143 |      77.157 | -292.986 |     ms |
|                                  90th percentile service time |     subagg_multiple_ranges_contiguous1 |     400.265 |     98.4407 | -301.824 |     ms |
|                                  99th percentile service time |     subagg_multiple_ranges_contiguous1 |      420.66 |      181.77 |  -238.89 |     ms |
|                                 100th percentile service time |     subagg_multiple_ranges_contiguous1 |     427.662 |     186.799 | -240.863 |     ms |
|                                                    error rate |     subagg_multiple_ranges_contiguous1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |               subagg_unbounded_ranges1 |     9.89574 |     9.88919 | -0.00655 |  ops/s |
|                                             Median Throughput |               subagg_unbounded_ranges1 |     9.93153 |      9.9298 | -0.00173 |  ops/s |
|                                                Max Throughput |               subagg_unbounded_ranges1 |     9.96604 |     9.98439 |  0.01835 |  ops/s |
|                                       50th percentile latency |               subagg_unbounded_ranges1 |     859.739 |     837.756 | -21.9834 |     ms |
|                                       90th percentile latency |               subagg_unbounded_ranges1 |      928.35 |     913.553 | -14.7964 |     ms |
|                                       99th percentile latency |               subagg_unbounded_ranges1 |     981.977 |     964.165 |  -17.812 |     ms |
|                                      100th percentile latency |               subagg_unbounded_ranges1 |     1016.18 |     992.742 | -23.4352 |     ms |
|                                  50th percentile service time |               subagg_unbounded_ranges1 |     859.315 |     837.378 | -21.9377 |     ms |
|                                  90th percentile service time |               subagg_unbounded_ranges1 |     927.876 |     913.137 | -14.7395 |     ms |
|                                  99th percentile service time |               subagg_unbounded_ranges1 |     981.589 |     963.828 | -17.7609 |     ms |
|                                 100th percentile service time |               subagg_unbounded_ranges1 |     1015.83 |     992.323 | -23.5038 |     ms |
|                                                    error rate |               subagg_unbounded_ranges1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |             subagg_overlapping_ranges1 |     9.95719 |     9.97401 |  0.01682 |  ops/s |
|                                             Median Throughput |             subagg_overlapping_ranges1 |      9.9763 |     9.99986 |  0.02356 |  ops/s |
|                                                Max Throughput |             subagg_overlapping_ranges1 |     10.0788 |     10.0186 | -0.06024 |  ops/s |
|                                       50th percentile latency |             subagg_overlapping_ranges1 |      386.25 |     113.525 | -272.725 |     ms |
|                                       90th percentile latency |             subagg_overlapping_ranges1 |     411.475 |     188.045 | -223.429 |     ms |
|                                       99th percentile latency |             subagg_overlapping_ranges1 |     429.831 |     348.727 | -81.1032 |     ms |
|                                      100th percentile latency |             subagg_overlapping_ranges1 |     437.367 |     368.382 | -68.9851 |     ms |
|                                  50th percentile service time |             subagg_overlapping_ranges1 |      385.45 |     112.026 | -273.424 |     ms |
|                                  90th percentile service time |             subagg_overlapping_ranges1 |     410.226 |     184.887 | -225.339 |     ms |
|                                  99th percentile service time |             subagg_overlapping_ranges1 |     428.136 |     347.582 | -80.5538 |     ms |
|                                 100th percentile service time |             subagg_overlapping_ranges1 |     436.533 |      367.24 | -69.2931 |     ms |
|                                                    error rate |             subagg_overlapping_ranges1 |           0 |           0 |        0 |      % |
|                                                Min Throughput |     subagg_with_parent_non_contiguous1 |     9.86526 |     9.84599 | -0.01927 |  ops/s |
|                                             Median Throughput |     subagg_with_parent_non_contiguous1 |     9.92039 |     9.91319 |  -0.0072 |  ops/s |
|                                                Max Throughput |     subagg_with_parent_non_contiguous1 |     10.0075 |     9.98141 | -0.02605 |  ops/s |
|                                       50th percentile latency |     subagg_with_parent_non_contiguous1 |     957.492 |     1123.11 |  165.614 |     ms |
|                                       90th percentile latency |     subagg_with_parent_non_contiguous1 |     1100.21 |     1805.08 |  704.871 |     ms |
|                                       99th percentile latency |     subagg_with_parent_non_contiguous1 |     1163.41 |     1953.88 |   790.47 |     ms |
|                                      100th percentile latency |     subagg_with_parent_non_contiguous1 |     1828.93 |     2036.14 |  207.217 |     ms |
|                                  50th percentile service time |     subagg_with_parent_non_contiguous1 |     956.972 |     1114.01 |  157.042 |     ms |
|                                  90th percentile service time |     subagg_with_parent_non_contiguous1 |     1095.24 |     1213.23 |  117.997 |     ms |
|                                  99th percentile service time |     subagg_with_parent_non_contiguous1 |     1153.07 |     1576.21 |  423.145 |     ms |
|                                 100th percentile service time |     subagg_with_parent_non_contiguous1 |     1826.42 |     1866.21 |   39.791 |     ms |
|                                                    error rate |     subagg_with_parent_non_contiguous1 |           0 |           0 |        0 |      % |

If the range aggregator is A) a top-level agg (no parent), B) matching
all the documents (match_all query) and C) does not contain unbounded
ranges, we can use the BKD tree to match documents faster than
collecting all the documents via the normal DocValue method.

If the aggregation meets the specific criteria above, the range agg
will attempt to collect by BKD.  There is an additional constraint
that the estimated number of matching points should be < maxDocs*0.75.
The BKD approach loses it's advantage over DV when most of the
index is being aggregated into at least one range.

As a side effect, this also introduces some new methods on
NumberFieldMapper so that we can encode the ranges into the Point
format without knowing about the underlying format (float, double, long,
etc).

From tests, this appears to be 70-90% faster latency in the optimized
case without grossly affecting the un-optimized case.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

estimatedPoints += pointValues.estimatePointCount(visitors[i]);
}

if (estimatedPoints < maxDoc * 0.75) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what we should actually set this threshold to. 75% might be too aggressive, since the BKD estimate could under-estimate some leaves considerably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if maxDoc is the right parameter as we are counting values. Maybe we should be using pointValues.size() instead?

Is this heuristic extracted from performance tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing very scientific at all, happy to change it for something better :)

It was derived from performance tests only in the sense that ranges which span most of the data tend to be same speed (or slower) and with a bit more overhead than just collecting them via DocValues.

The rationale was that the DocValue approach would have to touch maxDocs documents in the collect loop, and so I wanted to "collect" fewer than that many values from the BKD tree. But it's not very comparable since they have different costs (and even in the tree, collecting all the docs from one leaf is much different than when we have to unpack individual values)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually surprised this approach is not always faster. I wonder whether we're doing something stupid that slows things down, or maybe DocIdSetBuilder is more expensive than I think. It would be interesting to run under a profiler to get a sense of the relative cost of intersecting the BKD tree vs. collecting doc IDs.

@polyfractal polyfractal requested a review from iverase October 7, 2019 23:19
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good, I am just a bit worried about the threshold used and how it is calculated but probably is good enough. Just want to make sure I understand it.

// estimation won't call `grow()` on the visitor (only once we start intersecting)
results[i] = new DocIdSetBuilder(maxDoc);
visitors[i] = getVisitor(liveDocs, encodedRanges[i * 2], encodedRanges[i * 2 + 1], results[i]);
estimatedPoints += pointValues.estimatePointCount(visitors[i]);
Copy link
Contributor

@iverase iverase Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is fast but still it have a cost. I wonder if we should break as soon as we are over the threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, agreed ++

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idea sounds very appealing. I think the main issue right now is that it only collects doc IDs in order on a per-bucket basis, while the contract is that documents be collected in order regardless of their bucket id. This would be an issue if there are sub aggregations that work on doc values (e.g. a min aggregation on a long field). I can think of two ways that we could address it

  • Merge iterators using a heap to emit doc IDs in order.
  • Create a new sub-tree of aggregators for each bucket, similarly to what we do in MultiBucketAggregatorWrapper.

The latter is probably faster than the former, but it might be more challenging too (or not?).


@Override
public int bytesPerEncodedPoint() {
return Integer.BYTES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we used 2 bytes for HalfFloat, not 4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use HalfFloatPoint.BYTES.

@@ -181,7 +181,7 @@ public void doClose() {
if (parent != null) {
return null;
}
if (config.fieldContext() != null && config.script() == null) {
if (config.fieldContext() != null && config.script() == null && config.missing() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change looks unrelated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug actually, stumbled into it while working on the PR and made a tweak. Since we wrap the DV with an anonymous class that injects missing values, and the BKD optimization aborts before we use those DVs, we can miss a potential min/max value if it happened to be the missing value.

@imotov just fixed this in #48970 due to a bug report, and better to have it fixed separately anyway so this part of the change can go away.

@polyfractal
Copy link
Contributor Author

WIP update: I modified the PR to use a MultiBucketAggregatorWrapper scheme, although it's a little different from other usages. Writing this out to help clarify for myself :)

MultiBucketAggWrapper is used so that a new sub-tree is created for each bucket the agg resides in, and the ordinal is reset to zero in each case. This happens "above" the agg, in between the agg and it's parent.

In this case, we want to create a new sub-tree for each bucket the Range creates, rather than for each bucket the Range resides in. So we need to create a sub-tree for all the sub-aggs, ensuring that when we create buckets for ranges, the sub-aggs each get their own sub-tree.

To that end, this PR wraps the sub-aggregator AggregatorFactories object in a way that wraps all sub-aggs with a MultiBucketAggWrapper. This effectively creates a new sub-tree for each sub-agg and resets the ordinal, whether that agg would normally do so or not. In particular, this also applies to metric aggs which don't normally get the treatment.

I'm not 100% convinced this is correct, but tests pass and it seems ok. I'm going to add more tests to verify

I also have some concerns about how this was implemented, it feels a little hacky. But it avoided a lot of code complexity in the Range by reusing existing MBAWrapper code at this level.


@Override
public int bytesPerEncodedPoint() {
return Integer.BYTES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use HalfFloatPoint.BYTES.

estimatedPoints += pointValues.estimatePointCount(visitors[i]);
}

if (estimatedPoints < maxDoc * 0.75) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually surprised this approach is not always faster. I wonder whether we're doing something stupid that slows things down, or maybe DocIdSetBuilder is more expensive than I think. It would be interesting to run under a profiler to get a sense of the relative cost of intersecting the BKD tree vs. collecting doc IDs.

@polyfractal
Copy link
Contributor Author

Ran some more tests on the "no heuristic" scenario (always apply optimization if possible, don't estimate points, apply even to unbounded ranges). Looks like it is faster in my tests without a noticeable slowdown elsewhere.

The heuristics did improve speed when I first added them, but I think that was before I correctly implemented the "bulk" BKD visit method (visit(DocIdSetIterator iterator, byte[] packedValue)). I'm going to do a bit more testing, but I think we can simplify the PR by pulling out the heuristics and always apply the optimization if it's an applicable scenario (no parent, etc).

image

Notably in this test, we see that the unbounded ranges (second and fourth chart on top row, "unbounded_ranges", "subagg_unbounded_ranges") show an improvement relative to the earlier test, without degradation elsewhere.

- HalfFloatPoint.BYTES
- Remove heuristics, always apply BKD optim if possible
- Move optim decision up to factory
@polyfractal
Copy link
Contributor Author

Getting there :)

  • Removed heuristics so that the BKD optimization is always applied if the scenario is correct (no parents, etc).
  • Added worst-case breaker accounting for DocIdSetBuilder
  • Because we always run the optimization (excepting tripping the breaker), I moved the optimization decision up into the Factory. This allows us to only wrap when we know the optimization will be used. I also realized this is important if we want to apply a similar optimization to geo ranges, since they will need to specify their own point encoder.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me code-wise. However I'd like to see this optimization tested by a unit test, not only by an integration test.

Some higher-level thoughts about this change: These optimizations have a high maintenance cost. We almost need to duplicate all tests for the case when the optimization applies and when it doesn't. And we also need to add more tests for things that are no longer granted by the aggregation framework, such as the handling of deleted documents. I think range aggs are a good aggregation to explore this sort of optimization but it's unclear to me whether many users would benefit from this optimization compared to its maintenance cost. Maybe I'm being over pessimistic and many users would enjoy it! However what I know for sure is that this optimization would be worth maintaining on date_histograms, as I think that the vast majority Kibana users run top-level date histograms at some point.


final double[] maxTo;
private final String pointField;
private final boolean canOptimize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be easier to read by giving it a more explicit name, e.g. aggregateUsingPointsIndex or something along those lines

// `from` is unbounded or `min >= from`
(from == null || compareUnsigned(minPackedValue, 0, packedLength, from, 0, from.length) >= 0)
&&
// `to` is unbounded or `max < to`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation looks wrong here

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@polyfractal
Copy link
Contributor Author

This is extremely out of date now, and we've decided to put a pin in it to focus on histo/date_histo first. So I'm going to close this for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants