-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BKD Optimization to Range aggregation #47712
Conversation
If the range aggregator is A) a top-level agg (no parent), B) matching all the documents (match_all query) and C) does not contain unbounded ranges, we can use the BKD tree to match documents faster than collecting all the documents via the normal DocValue method. If the aggregation meets the specific criteria above, the range agg will attempt to collect by BKD. There is an additional constraint that the estimated number of matching points should be < maxDocs*0.75. The BKD approach loses it's advantage over DV when most of the index is being aggregated into at least one range. As a side effect, this also introduces some new methods on NumberFieldMapper so that we can encode the ranges into the Point format without knowing about the underlying format (float, double, long, etc). From tests, this appears to be 70-90% faster latency in the optimized case without grossly affecting the un-optimized case.
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
estimatedPoints += pointValues.estimatePointCount(visitors[i]); | ||
} | ||
|
||
if (estimatedPoints < maxDoc * 0.75) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me what we should actually set this threshold to. 75% might be too aggressive, since the BKD estimate could under-estimate some leaves considerably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if maxDoc
is the right parameter as we are counting values. Maybe we should be using pointValues.size()
instead?
Is this heuristic extracted from performance tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing very scientific at all, happy to change it for something better :)
It was derived from performance tests only in the sense that ranges which span most of the data tend to be same speed (or slower) and with a bit more overhead than just collecting them via DocValues.
The rationale was that the DocValue approach would have to touch maxDocs
documents in the collect loop, and so I wanted to "collect" fewer than that many values from the BKD tree. But it's not very comparable since they have different costs (and even in the tree, collecting all the docs from one leaf is much different than when we have to unpack individual values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually surprised this approach is not always faster. I wonder whether we're doing something stupid that slows things down, or maybe DocIdSetBuilder is more expensive than I think. It would be interesting to run under a profiler to get a sense of the relative cost of intersecting the BKD tree vs. collecting doc IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general looks good, I am just a bit worried about the threshold used and how it is calculated but probably is good enough. Just want to make sure I understand it.
// estimation won't call `grow()` on the visitor (only once we start intersecting) | ||
results[i] = new DocIdSetBuilder(maxDoc); | ||
visitors[i] = getVisitor(liveDocs, encodedRanges[i * 2], encodedRanges[i * 2 + 1], results[i]); | ||
estimatedPoints += pointValues.estimatePointCount(visitors[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation is fast but still it have a cost. I wonder if we should break as soon as we are over the threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point, agreed ++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This idea sounds very appealing. I think the main issue right now is that it only collects doc IDs in order on a per-bucket basis, while the contract is that documents be collected in order regardless of their bucket id. This would be an issue if there are sub aggregations that work on doc values (e.g. a min
aggregation on a long
field). I can think of two ways that we could address it
- Merge iterators using a heap to emit doc IDs in order.
- Create a new sub-tree of aggregators for each bucket, similarly to what we do in MultiBucketAggregatorWrapper.
The latter is probably faster than the former, but it might be more challenging too (or not?).
|
||
@Override | ||
public int bytesPerEncodedPoint() { | ||
return Integer.BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we used 2 bytes for HalfFloat, not 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use HalfFloatPoint.BYTES
.
@@ -181,7 +181,7 @@ public void doClose() { | |||
if (parent != null) { | |||
return null; | |||
} | |||
if (config.fieldContext() != null && config.script() == null) { | |||
if (config.fieldContext() != null && config.script() == null && config.missing() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change looks unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug actually, stumbled into it while working on the PR and made a tweak. Since we wrap the DV with an anonymous class that injects missing values, and the BKD optimization aborts before we use those DVs, we can miss a potential min/max value if it happened to be the missing value.
@imotov just fixed this in #48970 due to a bug report, and better to have it fixed separately anyway so this part of the change can go away.
WIP update: I modified the PR to use a MultiBucketAggregatorWrapper scheme, although it's a little different from other usages. Writing this out to help clarify for myself :) MultiBucketAggWrapper is used so that a new sub-tree is created for each bucket the agg resides in, and the ordinal is reset to zero in each case. This happens "above" the agg, in between the agg and it's parent. In this case, we want to create a new sub-tree for each bucket the Range creates, rather than for each bucket the Range resides in. So we need to create a sub-tree for all the sub-aggs, ensuring that when we create buckets for ranges, the sub-aggs each get their own sub-tree. To that end, this PR wraps the sub-aggregator I'm not 100% convinced this is correct, but tests pass and it seems ok. I'm going to add more tests to verify I also have some concerns about how this was implemented, it feels a little hacky. But it avoided a lot of code complexity in the Range by reusing existing MBAWrapper code at this level. |
|
||
@Override | ||
public int bytesPerEncodedPoint() { | ||
return Integer.BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use HalfFloatPoint.BYTES
.
.../java/org/elasticsearch/search/aggregations/bucket/range/AbstractRangeAggregatorFactory.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java
Outdated
Show resolved
Hide resolved
estimatedPoints += pointValues.estimatePointCount(visitors[i]); | ||
} | ||
|
||
if (estimatedPoints < maxDoc * 0.75) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually surprised this approach is not always faster. I wonder whether we're doing something stupid that slows things down, or maybe DocIdSetBuilder is more expensive than I think. It would be interesting to run under a profiler to get a sense of the relative cost of intersecting the BKD tree vs. collecting doc IDs.
server/src/main/java/org/elasticsearch/search/aggregations/bucket/range/RangeAggregator.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/search/aggregations/bucket/RangeIT.java
Show resolved
Hide resolved
Ran some more tests on the "no heuristic" scenario (always apply optimization if possible, don't estimate points, apply even to unbounded ranges). Looks like it is faster in my tests without a noticeable slowdown elsewhere. The heuristics did improve speed when I first added them, but I think that was before I correctly implemented the "bulk" BKD visit method ( Notably in this test, we see that the unbounded ranges (second and fourth chart on top row, |
- HalfFloatPoint.BYTES - Remove heuristics, always apply BKD optim if possible - Move optim decision up to factory
Getting there :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me code-wise. However I'd like to see this optimization tested by a unit test, not only by an integration test.
Some higher-level thoughts about this change: These optimizations have a high maintenance cost. We almost need to duplicate all tests for the case when the optimization applies and when it doesn't. And we also need to add more tests for things that are no longer granted by the aggregation framework, such as the handling of deleted documents. I think range aggs are a good aggregation to explore this sort of optimization but it's unclear to me whether many users would benefit from this optimization compared to its maintenance cost. Maybe I'm being over pessimistic and many users would enjoy it! However what I know for sure is that this optimization would be worth maintaining on date_histogram
s, as I think that the vast majority Kibana users run top-level date histograms at some point.
|
||
final double[] maxTo; | ||
private final String pointField; | ||
private final boolean canOptimize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be easier to read by giving it a more explicit name, e.g. aggregateUsingPointsIndex
or something along those lines
// `from` is unbounded or `min >= from` | ||
(from == null || compareUnsigned(minPackedValue, 0, packedLength, from, 0, from.length) >= 0) | ||
&& | ||
// `to` is unbounded or `max < to` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation looks wrong here
This is extremely out of date now, and we've decided to put a pin in it to focus on histo/date_histo first. So I'm going to close this for the time being. |
If the range aggregator
We can use the BKD tree to match documents faster than collecting all the documents via the normal DocValue method.
If the aggregation meets the specific criteria above, the range agg will attempt to collect by BKD. There is an additional constraint that the estimated number of matching points should be <
maxDocs * 0.75
. The BKD approach loses it's advantage over DV when most of theindex is being aggregated into at least one range.
As a side effect, this also introduces some new methods on NumberFieldMapper so that we can encode the ranges into the Point format without knowing about the underlying format (float, double, long, etc).
From tests, this appears to be 70-90% faster latency in the optimized case without grossly affecting the un-optimized case.
Benchmarking
I tried a bunch of different tests. A brief overview:
single_range_dense
: one range that matches a large percentage of the indexsingle_range_sparse
: one range that matches a small percentage of the indexmultiple_ranges_non_contiguous
: several ranges which have gaps between each rangemultiple_ranges_contiguous
: several ranges, but with no gaps so they form a single, contiguous rangeunbounded_ranges
: first range has unboundedfrom
, last range has unboundedto
overlapping_ranges
: ranges that overlap the prior range to some degreewith_parent_non_contiguous
: same ranges asmultiple_ranges_non_contiguous
, except the range is embedded under a parent aggregation (terms agg)subagg_*
: all the above test, but with a sub-agg to collect (sum
aggregation)Test setup was 12 clients, 50 warmup queries, 50 iteration queries, target-throughput of 10 queries/second. This was executed on a machine with 32 cores (two CPU, 16 physical, 32 w/ hyperthreading). The 10 qps throughput was chosen because all the tests could hit that throughput.
The test data was the
eventdata
rally track: 20m apache logs, ~15gb, five shards and no replicas.The tl;dr: is that most of the tests are 70-90% faster latency. Unbounded ranges and parent tests are roughly the same as baseline because those are "un-optimized".
Note: The two parent tests are actually a bit slower (6%, 17%) than baseline. I'm not sure if this is real or just noise. Going to run a few more trials and statistical tests to see if the result is real. You can see visually that the distributions are a bit shifted so I think a t-test would agree. But it might be noise from using the machine at the same time (I know, sorry :( ). Will run some overnight tests to rule out interference.
Rally results - click to expand