Add vectorization for druid-histogram extension #10304

abhishekagarwal87 · 2020-08-20T13:39:07Z

Description

This PR adds vectorization support for Aggregators in the druid-histogram extension. While these changes are unlikely to result in the usage of SIMD instructions, they can still help gain performance in two ways I can think of

Being more cache-friendly and less number of function calls
Enable vectorization for the whole query when one of the participating aggregator is Approximate Histogram assuming other aggregators in query support vectorization.

The code is refactored to reduce duplicate code. Much of the buffer manipulations are now called from *HistogramBufferAggregatorInternal classes which are in-turn used by *HistogramBufferAggregator and *HistogramVectorAggregator

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

Key changed/added classes in this PR

ApproximateHistogramVectorAggregator
ApproximateHistogramFoldingVectorAggregator
FixedBucketsHistogramVectorAggregator

…rization

abhishekagarwal87 · 2020-08-20T14:47:53Z

.../apache/druid/query/aggregation/histogram/FixedBucketsHistogramBufferAggregatorInternal.java

+  public void combine(FixedBucketsHistogram histogram, @Nullable Object next)
+  {
+    if (next == null) {
+      if (NullHandling.replaceWithDefault()) {


I might have carried forward a bug here. This if/else should most likely be inverted. cc @jon-wei

lgtm-com · 2020-08-20T15:08:31Z

This pull request fixes 1 alert when merging 55401cd into b36dab0 - view on LGTM.com

fixed alerts:

1 for Boxed variable is never null

abhishekagarwal87 · 2020-08-21T15:45:58Z

...e/druid/query/aggregation/histogram/ApproximateHistogramFoldingBufferAggregatorInternal.java

+  public void foldFast(ApproximateHistogram left, ApproximateHistogram right)
+  {
+    //TODO: do these have to set in every call
+    left.setLowerLimit(lowerLimit);


this looks unnecessary

A quick look and I think I agree, but am not totally certain. Can you try to find out if it is needed so we can remove this TODO and either remove the code, or add a comment on why it needs to be here?

I think it's necessary, the fromBytesDense call ends up using this constructor which uses a default lower/upper limit:

public ApproximateHistogram(int binCount, float[] positions, long[] bins, float min, float max) { this( positions.length, //size positions, //positions bins, //bins binCount, //binCount min, //min max, //max sumBins(bins, binCount), //count Float.NEGATIVE_INFINITY, //lowerLimit Float.POSITIVE_INFINITY //upperLimit ); }

Thanks for the explanation. I didn't realize that these limits are transient. I will add a similar comment here for future reference.

abhishekagarwal87 · 2020-08-21T15:49:31Z

...e/druid/query/aggregation/histogram/ApproximateHistogramFoldingBufferAggregatorInternal.java

+    //TODO: do these have to set in every call
+    left.setLowerLimit(lowerLimit);
+    left.setUpperLimit(upperLimit);
+    left.foldFast(right, tmpBufferA, tmpBufferB);


This is a copy of old implementation. However, I noticed that ApproximateHistogramAggregator has the following implementation instead which looks more correct. calling foldFast with inadequate space results in an exception.

if (left.binCount() + right.binCount() <= tmpBufferB.length) { left.foldFast(right, tmpBufferA, tmpBufferB); } else { left.foldFast(right); }

Hm, what was the exception you saw?

From looking at the code it seems like the buffers allocated in ApproximateHistogram.foldRule when foldFast is called with a single argument should be the same size as tmpBufferA and tmpBufferB.

can't reproduce it now. I am not sure either why ApproximateHistogramAggregator will have a different implementation. Will leave it as it is.

lgtm-com · 2020-08-21T16:38:10Z

This pull request fixes 1 alert when merging b47c906 into 7620b0c - view on LGTM.com

fixed alerts:

1 for Boxed variable is never null

lgtm-com · 2020-08-25T14:03:39Z

This pull request fixes 1 alert when merging 81e72a5 into f53785c - view on LGTM.com

fixed alerts:

1 for Boxed variable is never null

clintropolis

any idea on the performance difference from vectorizing these aggregators?

clintropolis · 2020-08-26T10:28:01Z

.../java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramBufferAggregator.java

@@ -28,54 +28,30 @@
 public class ApproximateHistogramBufferAggregator implements BufferAggregator
 {
  private final BaseFloatColumnValueSelector selector;
-  private final int resolution;
+  private final ApproximateHistogramBufferAggregatorInternal innerAggregator;


Did you consider making the shared functionality just be static methods to be more consistent with how HyperUniquesBufferAggregator and HyperUniquesVectorAggregator are implemented? This is totally nitpicking, but something just seems off about these things having a thing called 'innerAggregator' that doesn't implement any of the aggregator interfaces.

Yes but decided against it as static methods are not so good when it comes to unit testing. I can pass mock dependencies in the constructor, unlike the static methods. Then, some classes also have temporary buffers as a state which I can put inside the instances with common functionality. E.g. ApproximateHistogramFoldingBufferAggregatorInternal has temporary buffers created just once. I could make these temp buffers static too but then synchronization issues kick in.

clintropolis · 2020-08-26T10:31:12Z

...togram/src/main/java/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogram.java

@@ -431,6 +433,33 @@ public void incrementMissing()
    }
  }

+  /**
+   * Merge another datapoint into this one. The other datapoin could be


typo: 'datapoin' -> 'datapoint'

clintropolis · 2020-08-26T10:31:55Z

...ava/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogramAggregatorFactory.java

+  public VectorAggregator factorizeVector(VectorColumnSelectorFactory columnSelectorFactory)
+  {
+    ColumnCapabilities capabilities = columnSelectorFactory.getColumnCapabilities(fieldName);
+    if (null == capabilities) {


I think this isn't possible since canVectorize checks that capabilities isn't null

I have seen this trend in a few places where an extra guard is put up. That the caller may not call canVectorize before factorizeVector. I can remove it ifs unnecessarily defensive.

clintropolis · 2020-08-26T10:37:27Z

...togram/src/main/java/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogram.java

+   *
+   * @param val
+   */
+  void combine(@Nullable Object val)


I'm not sure this function should be shared between the vectorized and non-vectorized aggregator. For the vector aggregator the if should probably be outside of the for loop i think, because the contents of the vector will be consistent throughout the loop.

Also, I think you might need different selectors depending on if the inputs to the aggregator are numeric primitives (value selector to get double vector and null boolean vector), or if the input is other fixed bucket histogram sketches (object selector to get array of histogram objects). The fixed bucket histogram aggregator is a combined primitive and sketch merging aggregator, unlike the approximate histogram aggregators which are split and handles the sketch inputs and result merges with the 'fold' aggregators.

Good point. I will let this method remain here. Since I am only tackling numeric values for now, my vector implementation can call add directly on fixed histogram. will make that change.

clintropolis · 2020-08-26T10:47:43Z

.../java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramVectorAggregator.java

+    ApproximateHistogram histogram = innerAggregator.get(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      if (isValueNull != null && isValueNull[i]) {


you can also ignore null checks entirely if NullHandling.sqlCompatible() is true, would suggest saving it as a private final field in the constructor and then maybe add something like final boolean checkNulls = hasNulls && isValueNull != null

How about if I take isValueNull != null out of loop and line L57 becomes

boolean hasNulls = isValueNull != null;

for (int i = startRow; i < endRow; i++) { if (hasNulls && isValueNull[i]) {

This way, I don't need to introduce another predicate in my if condition and number of null checks will still be reduced somewhat.

Ah yeah i wasn't imagining checking all the conditions in the loop, the checkNulls value I was thinking of would be in the loop, similar to hasNulls in your example. Thinking further about it though, there is no real need/advantage to checking NullHandling.sqlCompatible().

clintropolis · 2020-08-26T10:49:34Z

...e/druid/query/aggregation/histogram/ApproximateHistogramFoldingBufferAggregatorInternal.java

+  public void foldFast(ApproximateHistogram left, ApproximateHistogram right)
+  {
+    //TODO: do these have to set in every call
+    left.setLowerLimit(lowerLimit);


A quick look and I think I agree, but am not totally certain. Can you try to find out if it is needed so we can remove this TODO and either remove the code, or add a comment on why it needs to be here?

clintropolis · 2020-08-26T10:54:11Z

...ava/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogramAggregatorFactory.java

+  public boolean canVectorize(ColumnInspector columnInspector)
+  {
+    ColumnCapabilities capabilities = columnInspector.getColumnCapabilities(fieldName);
+    return (capabilities != null) && capabilities.getType().isNumeric();


Did you mean to only handle numeric primitive inputs? The input type could also be complex if you handle fixed bucket histogram inputs, but you would need another vector aggregator implementation I think that takes an object selector instead of value selector

Yes. I only meant to handle numeric types for now. It seems it can also take String (base64) as well as complex objects. I decided to not do that in the current PR given these extensions are also deprecated and not recommended for use anymore.

clintropolis · 2020-08-26T10:59:56Z

...a/org/apache/druid/query/aggregation/histogram/ApproximateHistogramVectorAggregatorTest.java

+    EasyMock.replay(vectorValueSelector_2);
+
+    ColumnCapabilities columnCapabilities
+        = new ColumnCapabilitiesImpl().setType(ValueType.DOUBLE).setDictionaryEncoded(true);


nit: suggest ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.DOUBLE) since it will create realistic double capabilities (numbers are not dictionary encoded for example)

clintropolis · 2020-08-26T11:00:33Z

.../org/apache/druid/query/aggregation/histogram/FixedBucketsHistogramVectorAggregatorTest.java

+    EasyMock.replay(vectorValueSelector_2);
+
+    ColumnCapabilities columnCapabilities
+        = new ColumnCapabilitiesImpl().setType(ValueType.DOUBLE).setDictionaryEncoded(true);


nit: same comment about capabilities

clintropolis · 2020-08-26T11:04:21Z

...java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramAggregatorFactory.java

+  @Override
+  public boolean canVectorize(ColumnInspector columnInspector)
+  {
+    return true;


should this check if the column is numeric or complex similar to the fixed buckets aggregator factory? I don't think we have a good way for aggregators to handle string inputs in vectorized engine yet either, unless you use SingleValueDimensionVectorSelector or MultiValueDimensionVectorSelector and lookup the string values for the int arrays yourself, so should probably exclude strings at least (not that they make much sense as an input anyway).

The approximate histogram aggregators do not handle the strings. The way I see it, canVectorize only indicates when the aggregation cannot be vectorized. E.g. fixed bucket aggregator factory can aggregate complex objects but not with vectorization and this is what canVectorize in fixedBucket**Factory checks.

In this class, any type that can be aggregated in regular aggregators, is supported by vector aggregator as well. Any type that is not supported by regular aggregator is not supported by vector aggregator as well. Hence the method canVectorize just returns true.

do you think it would still make sense to check for input type?

We need to handle it somehow because if not it will fail when making the value selector (because there is no string value selector) org.apache.druid.query.QueryInterruptedException: Cannot make VectorValueSelector for column with class[org.apache.druid.segment.column.StringDictionaryEncodedColumn]. This is inconsistent with the non-vectorized behavior, which treats the input as 0 from the dimension selectors.

The ways it can be handled are with either the canVectorize method checking explicitly for numeric types, or special handling in factorizeVector to use a nil vector selector instead of trying to make a value selector. You probably want similar checks for other agg factories, as is appropriate for the types they handle.

abhishekagarwal87 · 2020-08-27T14:05:51Z

Gather around everyone. I have got the results. Vectorization has improved performance.

Run 1 - ApproximateHistogram aggregator in query with float input type

Benchmark                                              (numSegments)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score      Error  Units
TimeseriesBenchmark.queryFilteredSingleQueryableIndex              1            750000           basic.A        false  avgt   15    9427.001 ±  175.655  us/op
TimeseriesBenchmark.queryFilteredSingleQueryableIndex              1            750000           basic.A         true  avgt   15    9343.698 ±   67.070  us/op
TimeseriesBenchmark.queryMultiQueryableIndex                       1            750000           basic.A        false  avgt   15   72868.337 ± 6823.255  us/op
TimeseriesBenchmark.queryMultiQueryableIndex                       1            750000           basic.A         true  avgt   15   27122.215 ±  515.354  us/op
TimeseriesBenchmark.querySingleQueryableIndex                      1            750000           basic.A        false  avgt   15   70254.501 ± 7000.412  us/op
TimeseriesBenchmark.querySingleQueryableIndex                      1            750000           basic.A         true  avgt   15   30643.611 ± 4038.082  us/op


Run 2 - ApproximateHistogramFolding aggregator in query with ApproximateHistogram input type


Benchmark                                              (numSegments)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
TimeseriesBenchmark.queryFilteredSingleQueryableIndex              1            750000           basic.A        false  avgt   15    9804.513 ±   268.530  us/op
TimeseriesBenchmark.queryFilteredSingleQueryableIndex              1            750000           basic.A         true  avgt   15    9495.384 ±    55.576  us/op
TimeseriesBenchmark.queryMultiQueryableIndex                       1            750000           basic.A        false  avgt   15  175868.359 ± 14097.956  us/op
TimeseriesBenchmark.queryMultiQueryableIndex                       1            750000           basic.A         true  avgt   15  129778.787 ±  4278.879  us/op
TimeseriesBenchmark.querySingleQueryableIndex                      1            750000           basic.A        false  avgt   15  147838.235 ±  2350.725  us/op
TimeseriesBenchmark.querySingleQueryableIndex                      1            750000           basic.A         true  avgt   15  133302.428 ±  2139.185  us/op




Run 3 - FixedBucketHistogram aggregator in query with long input type


TimeseriesBenchmark.queryFilteredSingleQueryableIndex              1            750000           basic.A        false  avgt   15   9426.354 ±   94.798  us/op
TimeseriesBenchmark.queryFilteredSingleQueryableIndex              1            750000           basic.A         true  avgt   15   9354.156 ±   83.472  us/op
TimeseriesBenchmark.queryMultiQueryableIndex                       1            750000           basic.A        false  avgt   15  72228.744 ± 9069.249  us/op
TimeseriesBenchmark.queryMultiQueryableIndex                       1            750000           basic.A         true  avgt   15  26900.564 ±  410.899  us/op
TimeseriesBenchmark.querySingleQueryableIndex                      1            750000           basic.A        false  avgt   15  71854.254 ± 6718.542  us/op
TimeseriesBenchmark.querySingleQueryableIndex                      1            750000           basic.A         true  avgt   15  26808.244 ±  756.499  us/op

lgtm-com · 2020-08-27T17:49:57Z

This pull request fixes 1 alert when merging 19b2b72 into f82fd22 - view on LGTM.com

fixed alerts:

1 for Boxed variable is never null

lgtm-com · 2020-08-28T08:11:23Z

This pull request fixes 1 alert when merging db59ddd into f82fd22 - view on LGTM.com

fixed alerts:

1 for Boxed variable is never null

lgtm-com · 2020-08-28T10:10:32Z

This pull request fixes 1 alert when merging 182b610 into f82fd22 - view on LGTM.com

fixed alerts:

1 for Boxed variable is never null

clintropolis · 2020-09-02T10:04:30Z

...java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramAggregatorFactory.java

+  @Override
+  public boolean canVectorize(ColumnInspector columnInspector)
+  {
+    return true;


We need to handle it somehow because if not it will fail when making the value selector (because there is no string value selector) org.apache.druid.query.QueryInterruptedException: Cannot make VectorValueSelector for column with class[org.apache.druid.segment.column.StringDictionaryEncodedColumn]. This is inconsistent with the non-vectorized behavior, which treats the input as 0 from the dimension selectors.

The ways it can be handled are with either the canVectorize method checking explicitly for numeric types, or special handling in factorizeVector to use a nil vector selector instead of trying to make a value selector. You probably want similar checks for other agg factories, as is appropriate for the types they handle.

clintropolis · 2020-09-02T10:05:59Z

...g/apache/druid/query/aggregation/histogram/ApproximateHistogramBufferAggregatorInternal.java

+ * A helper class used by {@link ApproximateHistogramBufferAggregator} and {@link ApproximateHistogramVectorAggregator}
+ * for aggregation operations on byte buffers. Getting the object from value selectors is outside this class.
+ */
+final class ApproximateHistogramBufferAggregatorInternal


super nitpick, feel free to ignore, but maybe consider naming this (and similar classes) to something like ApproximateHistogramBufferAggregatorHelper instead of ApproximateHistogramBufferAggregatorInternal to be more consistent with the naming of this style of class with the rest of the codebase. I looked around and this PR has the only classes with an Internal suffix but there are many with the Helper suffix, and is consistent with the javadoc for this class.

Its good to be consistent. I will rename these classes.

clintropolis · 2020-09-02T10:09:39Z

.../java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramVectorAggregator.java

+    ApproximateHistogram histogram = innerAggregator.get(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      if (isValueNull != null && isValueNull[i]) {


Ah yeah i wasn't imagining checking all the conditions in the loop, the checkNulls value I was thinking of would be in the loop, similar to hasNulls in your example. Thinking further about it though, there is no real need/advantage to checking NullHandling.sqlCompatible().

clintropolis · 2020-09-02T11:55:31Z

...a/org/apache/druid/query/aggregation/histogram/ApproximateHistogramVectorAggregatorTest.java

+import static org.easymock.EasyMock.createMock;
+import static org.easymock.EasyMock.expect;
+
+public class ApproximateHistogramVectorAggregatorTest


it isn't obvious from this PR, but out of curiosity are there any tests which confirm that the vectorized aggregator results match the non-vectorized output?

There isn't a test that runs on both non-vectorized and vectorized at the same time. Though the input/output used in vector aggregator tests is almost same as what is used in tests for non-vector aggregator.

lgtm-com · 2020-09-03T11:09:25Z

This pull request fixes 1 alert when merging 09ace2d into 3fc8bc0 - view on LGTM.com

fixed alerts:

1 for Boxed variable is never null

lgtm-com · 2020-09-03T12:44:05Z

This pull request fixes 1 alert when merging adfb135 into 3fc8bc0 - view on LGTM.com

fixed alerts:

1 for Boxed variable is never null

clintropolis

🤘

* First draft * Remove redundant code from FixedBucketsHistogramAggregator classes * Add test cases for new classes * Fix tests in sql compatible mode * Typo fix * Fix comment * Add spelling * Vectorize only for supported types * Rename internal aggregator files * Fix tests

* Druid Avatica - Handle escaping of search characters correctly (#10040) Fix Avatica based metadata queries by appending ESCAPE '\' clause to the LIKE expressions * IntelliJ inspection and checkstyle rule for "Collection.EMPTY_* field accesses replaceable with Collections.empty*()" (#9690) * IntelliJ inspection and checkstyle rule for "Collection.EMPTY_* field accesses replaceable with Collections.empty*()" * Reverted checkstyle rule * Added tests to pass CI * Codestyle * fix docs (#9114) Co-authored-by: tomscut <tomscut@gmail.com> * global table only if joinable (#10041) * global table if only joinable * oops * fix style, add more tests * Update sql/src/test/java/org/apache/druid/sql/calcite/schema/DruidSchemaTest.java * better information schema columns, distinguish broadcast from joinable * fix javadoc * fix mistake Co-authored-by: Jihoon Son <jihoonson@apache.org> * Coordinator loadstatus API full format does not consider Broadcast rules (#10048) * Coordinator loadstatus API full format does not consider Broadcast rules * address comments * fix checkstyle * minor optimization * address comments * Remove changes from #9114 (#10050) * Create packed core partitions for hash/range-partitioned segments in native batch ingestion (#10025) * Fill in the core partition set size properly for batch ingestion with dynamic partitioning * incomplete javadoc * Address comments * fix tests * fix json serde, add tests * checkstyle * Set core partition set size for hash-partitioned segments properly in batch ingestion * test for both parallel and single-threaded task * unused variables * fix test * unused imports * add hash/range buckets * some test adjustment and missing json serde * centralized partition id allocation in parallel and simple tasks * remove string partition chunk * revive string partition chunk * fill numCorePartitions for hadoop * clean up hash stuffs * resolved todos * javadocs * Fix tests * add more tests * doc * unused imports * Fix join filter rewrites with nested queries (#10015) * Fix join filter rewrites with nested queries * Fix test, inspection, coverage * Remove clauses from group key * Fix import order Co-authored-by: Gian Merlino <gianmerlino@gmail.com> * fix topn on string columns with non-sorted or non-unique dictionaries (#10053) * fix topn on string columns with non-sorted or non-unique dictionaries * fix metadata tests * refactor, clarify comments and code, fix ci failures * Add safeguard to make sure new Rules added are aware of Rule usage in loadstatus API (#10054) * Add safeguard to make sure new Rules added are aware of Rule usuage in loadstatus API * address comments * address comments * add tests * SketchAggregator.updateUnion should handle null inside List update object (#10055) * fix docs error in hadoop-based part (#9907) * fix docs error: google to azure and hdfs to http * fix docs error: indexSpecForIntermediatePersists of tuningConfig in hadoop-based batch part * fix docs error: logParseExceptions of tuningConfig in hadoop-based batch part * fix docs error: maxParseExceptions of tuningConfig in hadoop-based batch part * minor rework of topn algorithm selection for clarity and more javadocs (#10058) * minor refactor of topn engine algorithm selection for clarity * adjust * more javadoc * change default number of segment loading threads (#9856) * change default number of segment loading threads * fix docs * missed file * min -> max for segment loading threads Co-authored-by: Dylan <dwylie@spotx.tv> * retry 500 and 503 errors against kinesis (#10059) * retry 500 and 503 errors against kinesis * add test that exercises retry logic * more branch coverage * retry 500 and 503 on getRecords request when fetching sequence numberu Co-authored-by: Harshpreet Singh <hrshpr@twitch.tv> * Druid user permissions (#10047) * Druid user permissions apply in the console * Update index.md * noting user warning in console page; some minor shuffling * noting user warning in console page; some minor shuffling 1 * touchups * link checking fixes * Updated per suggestions * Fix HyperUniquesAggregatorFactory.estimateCardinality null handling to respect output type (#10063) * fix return type from HyperUniquesAggregator/HyperUniquesVectorAggregator * address comments * address comments * Enable query vectorization by default (#10065) * Enable query vectorization by default * update docs * Optimize protobuf parsing for flatten data (#9999) * optimize for protobuf parsing * fix import error and maven dependency * add unit test in protobufInputrowParserTest for flatten data * solve code duplication (remove the log and main()) * rename 'flatten' to 'flat' to make it clearer Co-authored-by: xionghuilin <xionghuilin@bytedance.com> * fix dimension names for jvm monitor metrics (#10071) * update avatica to handle additional character sets over jdbc (#10074) * update avatica to handle additional character sets over jdbc * update license yaml, fix test * oops * Fix balancer strategy (#10070) * fix server overassignment * fix random balancer strategy, add more tests * comment * added more tests * fix forbidden apis * fix typo * fix dropwizard emitter jvm bufferpoolName metric (#10075) * fix dropwizard emitter jvm bufferpoolName metric * fixes * Allow append to existing datasources when dynamic partitioning is used (#10033) * Fill in the core partition set size properly for batch ingestion with dynamic partitioning * incomplete javadoc * Address comments * fix tests * fix json serde, add tests * checkstyle * Set core partition set size for hash-partitioned segments properly in batch ingestion * test for both parallel and single-threaded task * unused variables * fix test * unused imports * add hash/range buckets * some test adjustment and missing json serde * centralized partition id allocation in parallel and simple tasks * remove string partition chunk * revive string partition chunk * fill numCorePartitions for hadoop * clean up hash stuffs * resolved todos * javadocs * Fix tests * add more tests * doc * unused imports * Allow append to existing datasources when dynamic partitioing is used * fix test * checkstyle * checkstyle * fix test * fix test * fix other tests.. * checkstyle * hansle unknown core partitions size in overlord segment allocation * fail to append when numCorePartitions is unknown * log * fix comment; rename to be more intuitive * double append test * cleanup complete(); add tests * fix build * add tests * address comments * checkstyle * Fix missing temp dir for native single_dim (#10046) * Fix missing temp dir for native single_dim Native single dim indexing throws a file not found exception from InputEntityIteratingReader.java:81. This MR creates the required temporary directory when setting up the PartialDimensionDistributionTask. The change was tested on a Druid cluster. After installing the change native single_dim indexing completes successfully. * Fix indentation * Use SinglePhaseSubTask as example for creating the temp dir * Move temporary indexing dir creation in to TaskToolbox * Remove unused dependency Co-authored-by: Morri Feldman <morri@appsflyer.com> * More prominent instructions on code coverage failure (#10060) * More prominent instructions on code coverage failure * Update .travis.yml * Add NonnullPair (#10013) * Add NonnullPair * new line * test * make it consistent * Add integration tests for SqlInputSource (#10080) * Add integration tests for SqlInputSource * make it faster * ensure ParallelMergeCombiningSequence closes its closeables (#10076) * ensure close for all closeables of ParallelMergeCombiningSequence * revert unneeded change * consolidate methods * catch throwable instead of exception * fix MaterializedView gropuby query return arry result by default (#9936) * fix bug:MaterializedView gropuby query return map result by default * add unit test * add unit test * add unit test * fix bug:MaterializedView gropuby query return map result by default * add unit test * add unit test * add unit test * update pr * update pr Co-authored-by: xiangqiao <xiangqiao@kuaishou.com> * Fix NPE when brokers use custom priority list (#9878) * fix query memory leak (#10027) * fix query memory leak * rollup ./idea * roll up .idea * clean code * optimize style * optimize cancel function * optimize style * add concurrentGroupTest test case * add test case * add unit test * fix code style * optimize cancell method use * format code * reback code * optimize cancelAll * clean code * add comment * Segment timeline doesn't show results older than 3 months (#9956) * Segment timeline doesn't show results older than 3 months * Adoption testing patch for web segment timeline view and also refactoring default time config * Filter http requests by http method (#10085) * Filter http requests by http method Add a config that allows a user which http methods to allow against their Druid server. Druid will only accept http requests with the method: GET, PUT, POST, DELETE and OPTIONS. If a Druid admin wants to allow other methods, they can do so by using the ServerConfig#allowedHttpMethods config. If a Druid user would like to disallow OPTIONS, this can be done by changing the AuthConfig#allowUnauthenticatedHttpOptions config * Exclude OPTIONS from always supported HTTP methods Add HEAD as an allowed method for web console e2e tests * fix docs * fix security IT * Actually fix the web console e2e tests * Ignore icode coverage for nitialization classes * code review * Move shardSpec tests to core (#10079) * Move shardSpec tests to core * checkstyle * inject object mapper for testing * unused import * Fix native batch range partition segment sizing (#10089) * Fix native batch range partition segment sizing Fixes #10057. Native batch range partitioning was only considering the partition dimension value when grouping rows instead of using all of the row's partition values. Thus, for schemas with multiple dimensions, the rollup was overestimated, which would cause too many dimension values to be packed into the same range partition. The resulting segments would then be overly large (and not honor the target or max partition sizes). Main changes: - PartialDimensionDistributionTask: Consider all dimension values when grouping row - RangePartitionMultiPhaseParallelIndexingTest: Regression test by having input with rows that should roll up and rows that should not roll up * Use hadoop & native hash ingestion row group key * Fix nullhandling exception (#10095) Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com> * Make 0.19 brokers compatible with 0.18 router (#10091) * Make brokers backwards compatible In 0.19, Brokers gained the ability to serve segments. To support this change, a `BROKER` ServerType was added to `druid.server.coordination`. Druid nodes prior to this change do not know of this new server type and so they would fail to deserialize this node's announcement. This change makes it so that the broker only announces itself if the segment cache is configured on the broker. It is expected that a Druid admin will only configure the segment cache on the broker once the cluster has been upgraded to a version that supports a broker using the segment cache. * make code nicer * Add tests * Ignore icode coverage for nitialization classes * Revert "Ignore icode coverage for nitialization classes" This reverts commit aeec0c2ac2b07c1b9262e32201913c7194167271. * code review * Correct the position of the double quotation in distinctcount.md file (#10094) ``` "dimensions": "[sample_dim]" ``` should be ``` "dimensions": ["sample_dim"] ``` * QueryCountStatsMonitor can be injected in the Peon (#10092) * QueryCountStatsMonitor can be injected in the Peon This change fixes a dependency injection bug where there is a circular dependency on getting the MonitorScheduler when a user configures the QueryCountStatsMonitor to be used. * fix tests * Actually fix the tests this time * Information schema doc update (#10081) * add docs for IS_JOINABLE and IS_BROADCAST to INFORMATION_SCHEMA docs * fixes * oops * revert noise * missed one * spellbot * Remove payload field from table sys.segment (#9883) * remove payload field from table sys.segments * update doc * fix test * fix CI failure * add necessary fields * fix doc * fix comment * Web console: allow link overrides for docs, and more (#10100) * link overrides * change doc version * fix snapshots * Enabling Static Imports for Unit Testing DSLs (#331) (#9764) * Enabling Static Imports for Unit Testing DSLs (#331) Co-authored-by: mohammadshoaib <mohammadshoaib@miqdigital.com> * Feature 8885 - Enabling Static Imports for Unit Testing DSLs (#435) * Enabling Static Imports for Unit Testing DSLs * Using suppressions checkstyle to allow static imports only in the UTs Co-authored-by: mohammadshoaib <mohammadshoaib@miqdigital.com> * Removing the changes in the checkstyle because those are not needed Co-authored-by: mohammadshoaib <mohammadshoaib@miqdigital.com> * Prevent unknown complex types from breaking DruidSchema refresh (#9422) * Update web address to datasketches.apache.org (#10096) * Join filter pre-analysis simplifications and sanity checks. (#10104) * Join filter pre-analysis simplifications and sanity checks. - At pre-analysis time, only compute pre-analysis for the innermost root query, since this is the one that will run on the join that involves the base datasource. Previously, pre-analyses were computed for multiple levels of the query, some of which were unnecessary. - Remove JoinFilterPreAnalysisGroup and join query level gathering code, since they existed to support precomputation of multiple pre-analyses. - Embed JoinFilterPreAnalysisKey into JoinFilterPreAnalysis and use it to sanity check at processing time that the correct pre-analysis was done. Tangentially related changes: - Remove prioritizeAndLaneQuery functionality from LocalQuerySegmentWalker. The computed priority and lanes were not being used. - Add "getBaseQuery" method to DataSourceAnalysis to support identification of the proper subquery for filter pre-analysis. * Fix compilation errors. * Adjust tests. * Filter on metrics doc (#10087) * add note about filter on metrics to filter docs * edit doc to include having and filtered aggregator links * Fix UnknownTypeComplexColumn#makeVectorObjectSelector * Fix RetryQueryRunner to actually do the job (#10082) * Fix RetryQueryRunner to actually do the job * more javadoc * fix test and checkstyle * don't combine for testing * address comments * fix unit tests * always initialize response context in cachingClusteredClient * fix subquery * address comments * fix test * query id for builders * make queryId optional in the builders and ClusterQueryResult * fix test * suppress tests and unused methods * exclude groupBy builder * fix jacoco exclusion * add tests for builders * address comments * don't truncate * Closing yielder from ParallelMergeCombiningSequence should trigger cancellation (#10117) * cancel parallel merge combine sequence on yielder close * finish incomplete comment * Update core/src/test/java/org/apache/druid/java/util/common/guava/ParallelMergeCombiningSequenceTest.java Fixes checkstyle Co-authored-by: Jihoon Son <jihoonson@apache.org> * Revert "Fix UnknownTypeComplexColumn#makeVectorObjectSelector" (#10121) This reverts commit 7bb7489afc7a2cc496be93ae69681b6ab13a7c66. * update links datasketches.github.io to datasketches.apache.org (#10107) * update links datasketches.github.io to datasketches.apache.org * now with more apache * oops * oops * Fix Stack overflow with infinite loop in ReduceExpressionsRule of HepProgram (#10120) * Fix Stack overflow with SELECT ARRAY ['Hello', NULL] * address comments * fixes for ranger docs (#10109) * Fix UnknownComplexTypeColumn#makeVectorObjectSelector. Add a warning … (#10123) * Fix UnknownComplexTypeColumn#makeVectorObjectSelector. Add a warning message to indicate failure in deserializing. * support Aliyun OSS service as deep storage (#9898) * init commit, all tests passed * fix format Signed-off-by: frank chen <frank.chen021@outlook.com> * data stored successfully * modify config path * add doc * add aliyun-oss extension to project * remove descriptor deletion code to avoid warning message output by aliyun client * fix warnings reported by lgtm-com * fix ci warnings Signed-off-by: frank chen <frank.chen021@outlook.com> * fix errors reported by intellj inspection check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc spelling check Signed-off-by: frank chen <frank.chen021@outlook.com> * fix dependency warnings reported by ci Signed-off-by: frank chen <frank.chen021@outlook.com> * fix warnings reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com> * add package configuration to support showing extension info Signed-off-by: frank chen <frank.chen021@outlook.com> * add IT test cases and fix bugs Signed-off-by: frank chen <frank.chen021@outlook.com> * 1. code review comments adopted 2. change schema from 'aliyun-oss' to 'oss' Signed-off-by: frank chen <frank.chen021@outlook.com> * add license info Signed-off-by: frank chen <frank.chen021@outlook.com> * fix doc Signed-off-by: frank chen <frank.chen021@outlook.com> * exclude execution of IT testcases of OSS extension from CI Signed-off-by: frank chen <frank.chen021@outlook.com> * put the extensions under contrib group and add to distribution * fix names in test cases * add unit test to cover OssInputSource * fix names in test cases * fix dependency problem reported by CI Signed-off-by: frank chen <frank.chen021@outlook.com> * Clarify change in behavior for druid.server.maxSize (#10105) * Clarify maxSize docs * Add info about maxSize Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com> * Add DimFilter.toOptimizedFilter(), ensure that join filter pre-analysis operates on optimized filters (#10056) * Ensure that join filter pre-analysis operates on optimized filters, add DimFilter.toOptimizedFilter * Remove aggressive equality check that was used for testing * Use Suppliers.memoize * Checkstyle * Fix CachingClusteredClient when querying specific segments (#10125) * Fix CachingClusteredClient when querying specific segments * delete useless test * roll back timeout * Remove unsupported task types in doc (#10111) * VersionedIntervalTimeline: Fix thread-unsafe call to "lookup". (#10130) * bump version to 0.20.0-SNAPSHOT (#10124) * AbstractOptimizableDimFilter should be public (#10142) * mask secrets in MM task command log (#10128) * mask secrets in MM task command log * unit test for masked iterator * checkstyle fix * Update Jetty to 9.4.30.v20200611. (#10098) * Update Jetty to 9.4.30.v20200611. This is the latest version currently available in the 9.4.x line. * Various adjustments. * Class name fixes. * Remove unused HttpClientModule code. * Add coverage suppressions. * Another coverage suppression. * Fix wildcards. * ui: fix missing columns during Transform step (#10086) Co-authored-by: egor-ryashin <egor.ryashin@metamarkets.com> * Add availability and consistency docs. (#10149) * Add availability and consistency docs. Describes transactional ingestion and atomic replacement. Also, this patch deletes some bad advice from the javadocs for SegmentTransactionalInsertAction. * Fix missing word. * Update dictionary for spell check (#10152) * Fix avg sql aggregator (#10135) * new average aggregator * method to create count aggregator factory * test everything * update other usages * fix style * fix more tests * fix datasketches tests * Reduce memory footprint of integration test by not starting unneeded containers (#10150) * Reduce memory footprint of integration test * fix README * fix README * fix error in script * fix security IT * Add integration tests for all InputFormat (#10088) * Add integration tests for Avro OCF InputFormat * Add integration tests for Avro OCF InputFormat * add tests * fix bug * fix bug * fix failing tests * add comments * address comments * address comments * address comments * fix test data * reduce resource needed for IT * remove bug fix * fix checkstyle * add bug fix * Follow-up for RetryQueryRunner fix (#10144) * address comments; use guice instead of query context * typo * QueryResource tests * address comments * catch queryException * fix spell check * Fix documentation for Kinesis fetchThreads. (#10156) * Fix documentation for Kinesis fetchThreads The default was changed in #9819, but the documentation wasn't updated. * Add 'procs' to spelling. * renamed authenticationChain to authenticatorChain (#10143) * Fix flaky tests in DruidCoordinatorTest (#10157) * Fix flaky tests in DruidCoordinatorTest * Imporve fail msg * Fix flaky tests in DruidCoordinatorTest * Update ambari-metrics-common to version 2.6.1.0.0 (#10165) * Switch to apache version of ambari-metrics-common * Add test * Fix intellij inspection * Fix intellij inspection * Do not echo back username on auth failure (#10097) * Do not echo back username on auth failure * use bad username * Remove username from exception messages * fix tests * fix the tests * hopefully this time * this time the tests work * fixed this time * fix * upgrade to Jetty 9.4.30 * Unknown users echo back Unauthorized * fix * fix website build (#10172) * fix mvn website build to use mvn supplied nodejs, fix broken redirects, move block from custom.css to custom.scss so will be correctly generated * sidebar * fix lol * split web-console e2e-tests from unit tests (#10173) * split web-console e2e-test from unit test * fix stuff * smaller change * oops * Fix formatting in druid-pac4j documentation (#10174) Superfluous column broke table formatting. * Add additional properties for Kafka AdminClient and consumer from test config file (#10137) * Add kafka test configs from file for AdminClient and consumer * review comment * Add groupBy limitSpec to queryCache key (#10093) * Add groupBy limitSpec to queryCache key * Only add limitSpec to cache key if pushdown is set to true * review comment * Add validation for authenticator and authorizer name (#10106) * Add validation for authorizer name * fix deps * add javadocs * Do not use resource filters * Fix BasicAuthenticatorResource as well * Add integration tests * fix test * fix * JettyTest.testNumConnectionsMetricHttp is rarely flaky (#10169) * Change color of Run button for native queries (#10170) * Change color of Run button for native queries When a user tries to run a native query, change the color of the button to Druid's secondary color to indicate that the user is not running a SQL query. Before this change, the web-console would indicate this by changing the text of the button from Run (SQL queries) to Rune (native queries). Rune could be confusing to users as this appears to be a typo. * Update web-console/src/views/query-view/run-button/run-button.scss * Update web-console/src/views/query-view/run-button/run-button.scss * Update web-console/src/views/query-view/run-button/run-button.scss * code review * Add integration tests for Appends (#10186) * append test * add append IT * fix checkstyle * fix checkstyle * Remove parallel * fix checkstyle * fix * fix * address comments * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * update release process guide to include web-console versions (#10176) * Report missing segments when there is no segment for the query datasource in historicals (#10199) * Report missing segments when there is no segment for the query datasource in historicals * test * missing part for test * another test * Fix ITSqlInputSourceTest (#10194) * Fix ITSqlInputSourceTest.java * Fix ITSqlInputSourceTest.java * Fix ITSqlInputSourceTest.java * fix * fix * fix * fix * fix * fix * fix * fix * include staged maven artifacts in example vote thread (#10200) * ingestion and tutorial doc update (#10202) * Fix sys.servers table to not throw NPE and handle brokers/indexers/peons properly for broadcast segments (#10183) * Fix sys.servers table to not throw NPE and handle brokers/indexers/peons properly for broadcast segments * fix tests and add missing tests * revert null handling fix * unused import * move out util methods from DiscoveryDruidNode * Add integration tests for query retry on missing segments (#10171) * Add integration tests for query retry on missing segments * add missing dependencies; fix travis conf * address comments * Integration tests extension * remove unused dependency * remove druid_main * fix java agent port * Update RoaringBitmap to 0.9.0 (#9987) * Update QueryView to use latest DruidQueryToolkit (#10201) * Update to latest DruidQueryToolkit * add THEN keyword * do not crash on invalid JSON * add explicit example for jdbc query context on connection properties (#10182) * add explicit example for jdbc query context on connection properties * make comment clearer * Update sql.md * Update sql.md * Suppress CVE-2020-7692 (#10214) Druid is not a native app, so this CVE should not apply. * Fix timeseries query constructor when postAggregator has an expression reading timestamp result column (#10198) * Fix timeseries query constructor when postAggregator has an expression reading timestamp result column * fix npe * Fix postAgg referencing timestampResultField and add a test for it * fix test * doc * revert doc * Cluster wide default query context setting (#10208) * Cluster wide default query context setting * Cluster wide default query context setting * Cluster wide default query context setting * add docs * fix docs * update props * fix checkstyle * fix checkstyle * fix checkstyle * update docs * address comments * fix checkstyle * fix checkstyle * fix checkstyle * fix checkstyle * fix checkstyle * fix NPE * Add segment pruning for hash based shard spec (#9810) * Add segment pruning for hash based partitioning * Update doc * Add additional test * Address comments * Fix unit test failure Co-authored-by: Jian Wang <jwang@pinterest.com> * Support unit on byte-related properties (#10203) * support unit suffix on byte-related properties * add doc * change default value of byte-related properites in example files * fix coding style * fix doc * fix CI * suppress spelling errors * improve code according to comments * rename Bytes to HumanReadableBytes * add getBytesInInt to get value safely * improve doc * fix problem reported by CI * fix problem reported by CI * resolve code review comments * improve error message * improve code & doc according to comments * fix CI problem * improve doc * suppress spelling check errors * fill out missing test coverage for druid-datasketches postaggs (#9730) * fill out missing test coverage for druid-datasketches postaggs * fixup * fixup merge * oops * oops again * Add vectorization support for the longMin aggregator. (#10211) * Fix minor formatting in docs. * Add Nullhandling initialization for test to run from IDE. * Vectorize longMin aggregator. - A new vectorized class for the vectorized long min aggregator. - Changes to AggregatorFactory to support vectorize functionality. - Few changes to schema evolution test to add LongMinAggregatorFactory. * Add longSum to the supported vectorized aggregator implementations. * Add MIN() long min to calcite query test that can vectorize. * Add simple long aggregations test. * Fixup formatting per checkstyle guide. * fixup and add more tests for long min aggregator. * Override test for groupBy since timestamps are handled differently. * Null compatibility check in test. * Review comment: Add a test case to LongMinAggregationTest. * change search filter to includes (#10141) * Web console: Improve retention rules dialog in all sorts of ways (#10226) * improve ret rules * tidy up tests * Add "offset" parameter to GroupBy query. (#10235) * Add "offset" parameter to GroupBy query. It works by doing the query as normal and then throwing away the first "offset" number of rows on the broker. * Stabilize GroupBy sorts. * Fix inspections. * Fix suppression. * Fixups. * Move TopNSequence to druid-core. * Addl comments. * NumberedElement equals verification. * Changes from review. * Combine InDimFilter, InFilter. (#10119) * Combine InDimFilter, InFilter. There are two motivations: 1. Ensure that when HashJoinSegmentStorageAdapter compares its Filter to the original one, and it is an "in" type, the comparison is by reference and does not need to check deep equality. This is useful when the "in" filter is very large. 2. Simplify things. (There isn't a great reason for the DimFilter and Filter logic to be separate, and combining them reduces some duplication.) * Fix test. * improve JSON paste (#10256) * Set default server.maxsize to the sum of segment cache (#10255) * Default server.maxsize * Remove maxsize refs from config Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com> * Vectorization support for long, double, float min & max aggregators. (#10260) * LongMaxVectorAggregator support and test case. * DoubleMinVectorAggregator and test cases. * DoubleMaxVectorAggregator and unit test. * FloatMinVectorAggregator and FloatMaxVectorAggregator. * Documentation update to include the other vector aggregators. * Bug fix. * checkstyle formatting fixes. * CalciteQueryTest cases update. * Separate test classes for FloatMaxAggregation and FloatMniAggregation. * remove the cannotVectorize for float max/min aggregator in test. * Tests in GroupByQueryRunner, GroupByTimeseriesQueryRunner and TimeseriesQueryRunner. * Make stale bot less aggressive (#10261) * fix bug with expressions on sparse string realtime columns without explicit null valued rows (#10248) * fix bug with realtime expressions on sparse string columns * fix test * add comment back * push capabilities for dimensions to dimension indexers since they know things * style * style * fixes * getting a bit carried away * missed one * fix it * benchmark build fix * review stuffs * javadoc and comments * add comment * more strict check * fix missed usaged of impl instead of interface * Fix broken sampler for re-indexing (#10196) * Fix broken sampler for re-indexer When re-indexing a Druid datasource, the web-console would generate an invalid inputFormat since the type is not specified. * code review * Fix two id-over-maxId errors in StringDimensionIndexer. (#10245) 1) lookupId could return IDs beyond maxId if called with a recently added value. 2) getRow could return an ID for null beyond maxId, if null was recently encountered in a dimension that initially didn't appear at all. (In this case, the dictionary ID for null can be > 0). Also add a comment explaining how this stuff is supposed to work. * Clarify documentation on dimensions, dimensionExclusions. (#10265) In particular: exclusions are ignored if dimensions are set. * Fix javadoc mistake in DefaultLimitSpec. (#10269) Javadoc for getLimit should say it's a limit, not an offset. * Web console: fix json input (#10271) * fix json input * tidy up * add error extraction test * Allow forceLimitPushDown in SQL (#10253) * Allow forceLimitPushDown in SQL * fix test * fix test * review comments * fix test * add hasNulls to ColumnCapabilities, ColumnAnalysis (#10219) * add isNullable to ColumnCapabilities, ColumnAnalysis * better builder * fix segment metadata queries in integration tests * adjustments * cleanup * fix spotbugs * treat unknown as true in segmentmetadata * rename to hasNulls, add docs * fixup * test the dim indexer selector isNull fix for numeric columns * fixes * oof * Add "offset" parameter to the Scan query. (#10233) * Add "offset" parameter to the Scan query. It works by doing the query as normal and then throwing away the first "offset" number of rows on the broker. * Fix constructor call. * Fix up JSONs. * Fix call to ScanQuery. * Doc update. * Fix javadocs. * Spotbugs, LGTM suppressions. * Javadocs. * Fix suppression. * Stabilize Scan query result order, add tests. * Update LGTM comment. * Fixup. * Test different batch sizes too. * Nicer tests. * Fix comment. * remove DruidLeaderClient.goAsync(..) that does not follow redirect. Replace its usage by DruidLeaderClient.go(..) with InputStreamFullResponseHandler (#9717) * remove DruidLeaderClient.goAsync(..) that does not follow redirect. Replace its usage by DruidLeaadereClient.go(..) with InputStreamFullResponseHandler * remove ByteArrayResponseHolder dependency from JsonParserIterator * add UT to cover lines in InputStreamFullResponseHandler * refactor SystemSchema to reduce branches * further reduce branches * Revert "add UT to cover lines in InputStreamFullResponseHandler" This reverts commit 330aba3dd98ce15a13cd6ca607824bc07036ee81. * UTs for InputStreamFullResponseHandler * remove unused imports * Update Kafka dependencies to 2.6.0 (#10286) * update Kafka dependencies to Kafka 2.6.0 * switch to Scala 2.13 build of Kafka * update integration tests * update Kafka tutorial * typo fix from hear to here (#10292) Should be `There are no other changes that need to be made here` * Add note about aggregations on floats (#10285) * Add note about aggreations on floats Floating point math is known to be unstable. Due to the way aggregators work across segments it's possible for the same query operating on the same data to produce slightly different results. The same problem exists with any aggregators that are not commutative since the merge order across segments is not guaranteed. * Also talk about doubles * Apply suggestions from code review * Don't log the entire task spec (#10278) * Don't log the entire task spec * fix lgtm * fix serde * address comments and add tests * fix tests * remove unnecessary codes * fix connectionId issue with JDBC prepared statement queries and router (#10272) * fix router jdbc prepared statement connectionId issue * column metadata too * style * remove tls * try tls again * add keystore stuffs * use keyManager password * add unit test * simplify * Fix CombiningFirehose compatibility (#10264) * Fix CombiningFirehose * Add integration test * Fix path * Add full datasource name * Fix input location Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com> * Segment backed broadcast join IndexedTable (#10224) * Segment backed broadcast join IndexedTable * fix comments * fix tests * sharing is caring * fix test * i hope this doesnt fix it * filter by schema to maybe fix test * changes * close join stuffs so it does not leak, allow table to directly make selector factory * oops * update comment * review stuffs * better check * Add maxNumFiles to splitHintSpec (#10243) * Add maxNumFiles to splitHintSpec * missing link * fix build failure; use maxNumFiles for integration tests * spelling * lower default * Update docs/ingestion/native-batch.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * address comments; change default maxSplitSize * spelling * typos and doc * same change for segments splitHintSpec * fix build * fix build Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * Add SQL "OFFSET" clause. (#10279) * Add SQL "OFFSET" clause. Under the hood, this uses the new offset features from #10233 (Scan) and #10235 (GroupBy). Since Timeseries and TopN queries do not currently have an offset feature, SQL planning will switch from one of those to Scan or GroupBy if users add an OFFSET. Includes a refactoring to harmonize offset and limit planning using an OffsetLimit wrapper class. This is useful because it ensures that the various places that need to deal with offset and limit collapsing all behave the same way, using its "andThen" method. * Fix test and add another test. * introduce interning of internal files names in SmooshedFileMapper (#10295) * Redis cache extension enhancement (#10240) * support redis cluster * add 'password', 'database' properties * test cases passed * update doc * some improvements * fix CI * add more test cases to improve branch coverage * fix dependency check for test * resolve review comments * Optimize large InDimFilters (#10312) * Optimize large InDimFilters For large InDimFilters, in default mode, the filter does a linear check of the set to see if it contains either an empty or null. If it does, the empties are converted to nulls by passing through the entire list again. Instead of this, in default mode, we attempt to remove an empty string from the values that are passed to the InDimFilter. If an empty string was removed, we add null to the set * code review * Revert "code review" This reverts commit 61fe33ebf762764bb89108ddd966937f3313be71. * code review - less brittle * ExpressionFilter: Use index for expressions of single multi-value columns. (#10320) Previously, this was disallowed, because expressions treated multi-values as nulls. But now, if there's a single multi-value column that can be mapped over, it's okay to use the index. Expression selectors already do this. * Clarify SQL behavior for multi-value dimensions. (#10276) There are some known inconsistencies between SQL and native that users should be aware of. * Remove NUMERIC_HASHING_THRESHOLD (#10313) * Make NUMERIC_HASHING_THRESHOLD configurable Change the default numeric hashing threshold to 1 and make it configurable. Benchmarks attached to this PR show that binary searches are not more faster than doing a set contains check. The attached flamegraph shows the amount of time a query spent in the binary search. Given the benchmarks, we can expect to see roughly a 2x speed up in this part of the query which works out to ~ a 10% faster query in this instance. * Remove NUMERIC_HASHING_THRESHOLD * Remove stale docs * refactor internal type system (#9638) * better type tracking: add typed postaggs, finalized types for agg factories * more javadoc * adjustments * transition to getTypeName to be used exclusively for complex types * remove unused fn * adjust * more better * rename getTypeName to getComplexTypeName * setup expression post agg for type inference existing * more javadocs * fixup * oops * more test * more test * more comments/javadoc * nulls * explicitly handle only numeric and complex aggregators for incremental index * checkstyle * more tests * adjust * more tests to showcase difference in behavior * timeseries longsum array * Handle internal kinesis sequence numbers when reporting lag (#10315) * Handle internal kinesis sequence numbers when reporting lag * add unit test * Adding supported compression formats for native batch ingestion (#10306) * Adding supported compression formats for native batch ingestion * Update docs/ingestion/native-batch.md Co-authored-by: sthetland <steve.hetland@imply.io> * fix spellcheck Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: sthetland <steve.hetland@imply.io> * Add support for all partitioing schemes for auto compaction (#10307) * Add support for all partitioing schemes for auto compaction * annotate last compaction state for multi phase parallel indexing * fix build and tests * test * better home * Fix handling of 'join' on top of 'union' datasources. (#10318) * Fix handling of 'join' on top of 'union' datasources. The problem is that unions are typically rewritten into a series of individual queries on the underlying tables, but this isn't done when the union is wrapped in a join. The main changes are in UnionQueryRunner: 1) Replace an instanceof UnionQueryRunner check with DataSourceAnalysis. 2) Replace a "query.withDataSource" call with a new function, "Queries.withBaseDataSource". Together, these enable UnionQueryRunner to "see through" a join. * Tests. * Adjust heap sizes for integration tests. * Different approach, more tests. * Tweak. * Styling. * Move tools for indexing to TaskToolbox instead of injecting them in constructor (#10308) * Move tools for indexing to TaskToolbox instead of injecting them in constructor * oops, other changes * fix test * unnecessary new file * fix test * fix build * SQL support for union datasources. (#10324) * SQL support for union datasources. Exposed via the "UNION ALL" operator. This means that there are now two different implementations of UNION ALL: one at the top level of a query that works by concatenating subquery results, and one at the table level that works by creating a UnionDataSource. The SQL documentation is updated to discuss these two use cases and how they behave. Future work could unify these by building support for a native datasource that represents the union of multiple subqueries. (Today, UnionDataSource can only represent the union of tables, not subqueries.) * Fixes. * Error message for sanity check. * Additional test fixes. * Add some error messages. * Remove implied profanity from error messages. (#10270) i.e. WTF, WTH. * split up Expr.java (#10333) * Web console: add tile for Azure Event Hubs (via Kafka API) (#10317) * Add Azure Event Hubs * better note * update icon * add link to Docker quickstart in github README (#10299) Per suggestion in comment https://github.com/apache/druid/pull/9262#issuecomment-675732237, I think this should eventually result in the copy mirrored on dockerhub to also be updated, if I understand how things work. Only the github `README.md` has been updated, not the `README.template` used for src and bin packages because presumably if you are reading from either of those you are just going to run locally and so the local quickstart is appropriate. * optimize announceHistoricalSegments (#9935) * optimize announceHistoricalSegment * optimize announceHistoricalSegment * revert offline SegmentTransactionalInsertAction uses a separate lock * optimize segmentExistsBatch: Avoid too many elements in the in condition * add unit test && Modified according to cr Co-authored-by: xiangqiao <xiangqiao@kuaishou.com> * Fix VARIANCE aggregator comparator (#10340) * Fix VARIANCE aggregator comparator The comparator for the variance aggregator used to compare values using the count. This is now fixed to compare values using the variance. If the variance is equal, the count and sum are used as tie breakers. * fix tests + sql compatible mode * code review * more tests * fix last test * Add missing comma between JSON members in data-formats.md (#10343) * StringFirstAggregatorFactory: Fix incorrect "combine" method. (#10351) * StringFirstAggregatorFactory: Fix incorrect "combine" method. There was a test, but it was wrong. * Fix superclass. * fix NPE in StringGroupByColumnSelectorStrategy#bufferComparator (#10325) * fix NPE in StringGroupByColumnSelectorStrategy#bufferComparator * Add tests * javadocs * Ignore CVEs from htrace and ambari transitive deps (#10353) * Ignore CVEs from htrace and ambari transitive deps htrace CVEs are suppressed for now as addressing them requires updating the hadoop version. ambari CVEs are suppressed for now since ambari is updated to the latest version and is no longer actively maintained. * Fix compilation issue from ambari upgrade * Add missing test coverage * Fix result-level caching (#10341) * create baseSequence early * unit test * add comment and a new test * Fix stringFirst/stringLast rollup during ingestion (#10332) * Add IndexMergerRollupTest This changelist adds a test to merge indexes with StringFirst/StringLast aggregator. * Fix StringFirstAggregateCombiner/StringLastAggregateCombiner The segment-level type for stringFirst/stringLast is SerializablePairLongString, not String. This changelist fixes it. * Fix EarliestLatestAnySqlAggregator to handle COMPLEX type This changelist allows EarliestLatestAnySqlAggregator to accept COMPLEX type as an operand. For its return type, we set it to VARCHAR, since COMPLEX column is only generated by stringFirst/stringLast during ingestion rollup. * Return value with smaller timestamp in StringFirstAggregatorFactory.combine function * Add integration tests for stringFirst/stringLast during ingestion * Use one EarliestLatestReturnTypeInference instance Co-authored-by: Joy Kent <joy@automonic.ai> * Add vectorization for druid-histogram extension (#10304) * First draft * Remove redundant code from FixedBucketsHistogramAggregator classes * Add test cases for new classes * Fix tests in sql compatible mode * Typo fix * Fix comment * Add spelling * Vectorize only for supported types * Rename internal aggregator files * Fix tests * Fix doc for name of dynamic config to pause coordination (#10345) * Unit tests fail due to missing extend InitializedNullHandlingTest (#10382) * CsvInputFormatTest should extend InitializedNullHandlingTest * FirehoseFactoryToInputSourceAdaptorTest should extends InitializedNullHandlingTest * More structured way to handle parse exceptions (#10336) * More structured way to handle parse exceptions * checkstyle; add more tests * forbidden api; test * address comment; new test * address review comments * javadoc for parseException; remove redundant parseException in streaming ingestion * fix tests * unnecessary catch * unused imports * appenderator test * unused import * Fix typo (#10385) * Web console: improve query manager (convert to React hook) (#10360) * Better query running * update licenses * update tests * updated tests v2 * fade in cancel * add exemplary tests * update mkcomp * fix inconsistent state update * remove lastParsedQuery * work if not a valid literal * remove unused params * fix licenses * better state update * get error message * isEmpty tidy * add tests around error message highlighting * pull live query selector into a component * add LiveQueryModeSelector tests * update snapshots * TransformSpecTest should extends InitializedNullHandlingTest (#10392) * Support SearchQueryDimFilter in sql via new methods (#10350) * Support SearchQueryDimFilter in sql via new methods * Contains is a reserved word * revert unnecessary change * Fix toDruidExpression method * rename methods * java docs * Add native functions * revert change in dockerfile * remove changes from dockerfile * More tests * travis fix * Handle null values better * benchmark for indexed table experiments (#10327) * benchmark for indexed table experiments * fix style * teardown outside of measurement * add computed Expr output types (#10370) * push down ValueType to ExprType conversion, tidy up * determine expr output type for given input types * revert unintended name change * add nullable * tidy up * fixup * more better * fix signatures * naming things is hard * fix inspection * javadoc * make default implementation of Expr.getOutputType that returns null * rename method * more test * add output for contains expr macro, split operation and function auto conversion * allow vectorized query engines to utilize vectorized virtual columns (#10388) * allow vectorized query engines to utilize vectorized virtual column implementations * javadoc, refactor, checkstyle * intellij inspection and more javadoc * better * review stuffs * fix incorrect refactor, thanks tests * minor adjustments * Vectorized ANY aggregators (#10338) * WIP vectorized ANY aggregators * tests * fix aggs * cleanup * code review + tests * docs * use NilVectorSelector when needed * fix spellcheck * dont instantiate vectors * cleanup * Skip coverage check for tag builds (#10397) The code coverage diff calculation assumes the TRAVIS_BRANCH environment variable is the name of a branch; however, for tag builds it is the name of the tag so the diff calculation fails. Since builds triggered by tags do not have a code diff, the coverage check should be skipped to avoid the error and to save some CI resources. * Web console: Improve number alignment in tables (#10389) * Improve tables * removed unused state interfaces * better copy * one more functional component * updated e2e tests * extract braced text correctly * Integration tests and docs for auto compaction with different partitioning (#10354) * Working * add test * doc * fix test * split other integration test * exclude other-index from other tests * doc anchor fix * adjust task slots and number of merge tasks * spell check * reduce maxNumConcurrentSubTasks to 1 * maxNumConcurrentSubtasks for range partitinoing * reduce memory for historical * change group name * Support combining inputsource for parallel ingestion (#10387) * Add combining inputsource * Fix documentation Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com> * Disable sending server version in response headers (#9832) * Toggle sending of server version * Remove config Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com> * recreate the balancer executor only when needed (#10280) * recreate the balancer executor only when needed * fix UT error * shutdown the balancer executor in stopBeingLeader and stop * remove commented code * remove comments * Vectorized variance aggregators (#10390) * wip vectorize * close but not quite * faster * unit tests * fix complex types for variance * Adding more dimensions to the audit log entry (#10373) * Adding more dimensions to the audit log entry * Making adding payload in audit metric optional * Changing the name of the parameter to includePayloadAsDimensionInMetric. Adding a unit test * Fixing the intellij code introspection issues * Adding the missing sqlQueryContext api (#10368) * Adding the missing sqlQueryContext api * Adding a serialization test for DefaultRequestLogEvent * Fixing the unit test failure * Remove JODA Time Dependency from Avro Extensions (#10010) * Avoid large limits causing int overflow in buffer size checks (#10356) * Avoid large limits causing int overflow in buffer size checks * fix lgtm overflow warning Co-authored-by: Dylan <dwylie@spotx.tv> * Upgrade ORC to 1.5.10 version (#10291) * Auto-compaction snapshot status API (#10371) * Auto-compaction snapshot API * Auto-compaction snapshot API * Auto-compaction snapshot API * Auto-compaction snapshot API * Auto-compaction snapshot API * Auto-compaction snapshot API * Auto-compaction snapshot API * fix when not all compacted segments are iterated * add unit tests * add unit tests * add unit tests * add unit tests * add unit tests * add unit tests * add some tests to make code cov happy * address comments * address comments * address comments * address comments * make code coverage happy * address comments * address comments * address comments * address comments * Document change in results of groupBy queries with subtotalsSpec (#10405) * subtotalsSpec results with null values Document the format change in results of a groupBy query with a subtotalsSpec. This update applies to 0.18 and later. * Review catches * Web console: fix lookup edit dialog, allow column renaming (#10406) * column rename * update licenses file * remove empty file * update license file * move comment * Issue fix for CSV loading with header and skip header not parsing well. (#10398) * Web console: clean up styling imports (#10410) * fix styling for importing * fix quotes * Web console: add sort to tiers list (#10416) * add sort to tiers list * update snapshot * Include Sequence-building time in CPU time metric. (#10377) * Include Sequence-building time in CPU time metric. Meaningful work can be done while building Sequences, and we should count this work. On the Broker, this includes subquery processing work done by the mergeResults call of the GroupByQueryQueryToolChest. * Add test. * Web console: compaction dialog update (#10417) * compaction dialog update * fix test snapshot * Update web-console/src/dialogs/compaction-dialog/compaction-dialog.tsx Co-authored-by: Chi Cao Minh <chi.caominh@imply.io> * Update web-console/src/dialogs/compaction-dialog/compaction-dialog.tsx Co-authored-by: Chi Cao Minh <chi.caominh@imply.io> * feedback changes Co-authored-by: Chi Cao Minh <chi.caominh@imply.io> * vectorized expressions and expression virtual columns (#10401) * vectorized expression virtual columns * cleanup * fixes * preserve float if explicitly specified * oops * null handling fixes, more tests * what is an expression planner? * better names * remove unused method, add pi * move vector processor builders into static methods * reduce boilerplate * oops * more naming adjustments * changes * nullable * missing hex * more * Add last_compaction_state to sys.segments table (#10413) * Add is_compacted to sys.segments table * change is_compacted to last_compaction_state * fix tests * fix tests * address comments * add light weight version of /druid/coordinator/v1/lookups/nodeStatus (#10422) * add light weight version /druid/coordinator/v1/lookups/nodeStatus * review stuffs * better query view initial state (#10431) * Automatically determine numShards for parallel ingestion hash partitioning (#10419) * Automatically determine numShards for parallel ingestion hash partitioning * Fix inspection, tests, coverage * Docs and some PR comments * Adjust locking * Use HllSketch instead of HyperLogLogCollector * Fix tests * Address some PR comments * Fix granularity bug * Small doc fix * Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided (#10288) * Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided * query context * fix tests; add more test * javadoc * docs and more tests * remove default and hadoop tests * consistent name and fix javadoc * spelling and field name * default function for partitionsSpec * other comments * address comments * fix tests and spelling * test * doc * Web console autocompaction E2E test (#10425) Add an E2E test for the common case web console workflow of setting up autocompaction that changes the partitions from dynamic to hashed. Also fix an issue with the async test setup to properly wait for the web console to be ready. * vectorize remaining math expressions (#10429) * vectorize remaining math expressions * fixes * remove cannotVectorize() where no longer true * disable vectorized groupby for numeric columns with nulls * fixes * more timeout handling in JsonParserIterator (#10426) * add docs for kinesis lag metrics (#10435) * fix typo in docker/druid.sh (#10433) DRUID_NEWSIZE should not set MaxNewSize. * Add intent for web console IntervalInput (#10447) When using the web console to load data by reindexing from Druid, the `Datasource` and `Interval` inputs are required during the `Connect` step. Unlike the `Datasource` input, the `Interval` input did not have a blue outline to indicate that it was required as the `IntervalInput` component did not support an `intent` property. * Compaction config UI optional numShards (#10446) * Compaction config UI optional numShards Specifying `numShards` for hashed partitions is no longer required after https://github.com/apache/druid/pull/10419. Update the UI to make `numShards` an optional field for hash partitions. * Update snapshot * add vectorizeVirtualColumns query context parameter (#10432) * add vectorizeVirtualColumns query context parameter * oops * spelling * default to false, more docs * fix test * fix spelling * Remove Expr.visit. (#10437) * Remove Expr.visit. It isn't used and doesn't have tests. * Remove Visitor too. * Web console: Display compaction status (#10438) * init compaction status * % compacted * final UI tweaks * extracted utils, added tests * add tests to general foramt functions * Adding task slot count metrics to Druid Overlord (#10379) * Adding more worker metrics to Druid Overlord * Changing the nomenclature from worker to peon as that represents the metrics that we want to monitor better * Few more instance of worker usage replaced with peon * Modifying the peon idle count logic to only use eligible workers available capacity * Changing the naming to task slot count instead of peon * Adding some unit test coverage for the new test runner apis * Addressing Review Comments * Modifying the TaskSlotCountStatsProvider apis so that overlords which are not leader do not emit these metrics * Fixing the spelling issue in the docs * Setting the annotation Nullable on the TaskSlotCountStatsProvider methods * RowBasedIndexedTable: Add specialized index types for long keys. (#10430) * RowBasedIndexedTable: Add specialized index types for long keys. Two new index types are added: 1) Use an int-array-based index in cases where the difference between the min and max values isn't too large, and keys are unique. 2) Use a Long2ObjectOpenHashMap (instead of the prior Java HashMap) in all other cases. In addition: 1) RowBasedIndexBuilder, a new class, is responsible for picking which index implementation to use. 2) The IndexedTable.Index interface is extended to support using unboxed primitives in the unique-long-keys case, and callers are updated to use the new functionality. Other key types continue to use indexes backed by Java HashMaps. * Fixup logic. * Add tests. * vectorize constant expressions with optimized selectors (#10440) * Web console: switch to switches instead of checkboxes (#10454) * switch to switches * add img alt * add relative * change icons * update snapshot * Fix the offset setting in GoogleStorage#get (#10449) * Fix the offset in get of GCP object * upgrade compute dependency * fix version * review comments * missed * Fix the task id creation in CompactionTask (#10445) * Fix the task id creation in CompactionTask * review comments * Ignore test for range partitioning and segment lock * Web console reindexing E2E test (#10453) Add an E2E test for the web console workflow of reindexing a Druid datasource to change the secondary partitioning type. The new test changes dynamic to single dim partitions since the autocompaction test already does dynamic to hashed partitions. Also, run the web console E2E tests in parallel to reduce CI time and change naming convention for test datasources to make it easier to map them to the corresponding test run. Main changes: 1) web-consolee2e-tests/reindexing.spec.ts - new E2E test 2) web-console/e2e-tests/component/load-data/data-connector/reindex.ts - new data loader connector for druid input source 3) web-console/e2e-tests/component/load-data/config/partition.ts - move partition spec definitions from compaction.ts - add new single dim partition spec definition * Fix UI datasources view edit action compaction (#10459) Restore the web console's ability to view a datasource's compaction configuration via the "action" menu. Refactoring done in https://github.com/apache/druid/pull/10438 introduced a regression that always caused the default compaction configuration to be shown via the "action" menu instead. Regression test is added in e2e-tests/auto-compaction.spec.ts. * Allow using jsonpath predicates with AvroFlattener (#10330) * Improve UI E2E test usability (#10466) - Update playwright to latest version - Provide environment variable to disable/enable headless mode - Allow running E2E tests against any druid cluster running on standard ports (tutorial-batch.spec.ts now uses an absolute instead of relative path for the input data) - Provide environment variable to change target web console port - Druid setup does not need to download zookeeper * Web console: fix lookup edit dialog version setting (#10461) * fix lookup edit dialog * update snapshots * clean up test * fix array types from escaping into wider query engine (#10460) * fix array types from escaping into wider query engine * oops * adjust * fix lgtm * Update version to 0.21.0-SNAPSHOT (#10450) * [maven-release-plugin] prepare release druid-0.21.0 * [maven-release-plugin] prepare for next development iteration * Update web-console versions * Test UI to trigger auto compaction (#10469) In the web console E2E tests, Use the new UI to trigger auto compaction instead of calling the REST API directly so that the UI is covered by tests. * adjustments to Kafka integration tests to allow running against Azure Event Hubs streams (#10463) * adjustments to kafka integration tests to allow running against azure event hubs in kafka mode * oops * make better * more better * vectorized group by support for nullable numeric columns (#10441) * vectorized group by support for numeric null columns * revert unintended change * adjust * review stuffs * Close aggregators in HashVectorGrouper.close() (#10452) * Close aggregators in HashVectorGrouper.close() * reuse grouper * Add missing dependency * Web console: Don't include realtime segments in size calculations. (#10482) It's always zero, and so it messes up averages, mins, and counts. * Fix compaction task slot computation in auto compaction (#10479) * Fix compaction task slot computation in auto compaction * add tests for task counting * Improve test (#10480) * Web console: fix compaction status when no compaction config, and small cleanup (#10483) * move timed button to icons * cleanup redundant logic * fix compaction status text * remove extra style * Fix Avro support in Web Console (#10232) * Fix Avro OCF detection prefix and run formation detection on raw input * Support Avro Fixed and Enum types correctly * Check Avro version byte in format detection * Add test for AvroOCFReader.sample Ensures that the Sampler doesn't receive raw input that it can't serialize into JSON. * Document Avro type handling * Add TS unit tests for guessInputFormat * Suppress CVE-2018-11765 for hadoop dependencies (#10485) * Update README.md (#10357) Compile scss files before npm start. * Add…

* First draft * Remove redundant code from FixedBucketsHistogramAggregator classes * Add test cases for new classes * Fix tests in sql compatible mode * Typo fix * Fix comment * Add spelling * Vectorize only for supported types * Rename internal aggregator files * Fix tests

The patch uses the same "helper" approach as apache#10767 and apache#10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode.

* Vectorized versions of HllSketch aggregators. The patch uses the same "helper" approach as #10767 and #10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode. * Updates post-code-review. * Fix javadoc.

@suneet-s

* Update security overview with additional recommendations (apache#11016) * updatee security overview with additional recommendations for improved security * address first set of review questions * Update docs/operations/security-overview.md * Update docs/operations/security-overview.md * apply changes from review * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update security-overview.md fix additional comments & typos cc: @suneet-s, @jihoonsoon Co-authored-by: Suneet Saldanha <suneet@apache.org> * Enable rewriting certain inner joins as filters. (apache#11068) * Enable rewriting certain inner joins as filters. The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled. * Test improvements. * Test fixes. * Avoid slow size() call. * Remove invalid test. * Fix style. * Fix mistaken default. * Small fixes. * Fix logic error. * Doc updates for union datasources. (apache#11103) The main one is updating datasources.md to talk about SQL. (It still said that table unions are not supported in SQL.) Also, this doc update adds some clarifying details on limitations. * [Security] Bump netty4.version from 4.1.48.Final to 4.1.63.Final (apache#11117) * Vectorized versions of HllSketch aggregators. (apache#11115) * Vectorized versions of HllSketch aggregators. The patch uses the same "helper" approach as apache#10767 and apache#10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode. * Updates post-code-review. * Fix javadoc. * Web console: update dev dependencies (apache#11119) * Update some dev dependencies, prettify, tslint-fix * Sort tsconfig keys for easy comparison * Set noImplicitThis * Slightly more accurate types * Bump Jest and related * Bump react to latest on v16 * Bump node-sass, sass-loader for node14 support * Remove node-sass-chokidar (unused) * More unused dependencies * Fix blueprint imports * Webpack 5 * Update webpack config for 'process' usage * Update playwright-chromium * Emit esnext modules for tree shaking * Enable source maps in development * Dedupe * Bump babel and things * npm audit fix * Add .editorconfig file to match prettier settings * Update licenses (tslib is 0BSD as of 1.11.2) microsoft/tslib#96 * Require node >= 10 * Use Node 10 to run e2e tests * Use 'ws' transport mode for dev server (will be default in next version) * Remove an 'any' * No sourcemaps in prod * Exclude .editorconfig from license checks * Try nvm for setting node version Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Sandeep <isandeep41@gmail.com> Co-authored-by: John Gozde <john@gozde.ca>

abhishekagarwal87 and others added 2 commits August 19, 2020 23:47

First draft

cc38ecc

Merge branch 'master' of github.com:apache/druid into histogram-vecto…

55401cd

…rization

abhishekagarwal87 commented Aug 20, 2020

View reviewed changes

Remove redundant code from FixedBucketsHistogramAggregator classes

30aa251

suneet-s added Area - Querying Performance labels Aug 20, 2020

Add test cases for new classes

b47c906

abhishekagarwal87 changed the title ~~WIP: Add vectorization for druid-histogram extension~~ Add vectorization for druid-histogram extension Aug 21, 2020

abhishekagarwal87 marked this pull request as ready for review August 21, 2020 15:35

abhishekagarwal87 commented Aug 21, 2020

View reviewed changes

Fix tests in sql compatible mode

81e72a5

clintropolis reviewed Aug 26, 2020

View reviewed changes

abhishekagarwal87 added 2 commits August 27, 2020 19:38

Typo fix

a294a91

Typo fix

19b2b72

Fix comment

db59ddd

Add spelling

182b610

clintropolis reviewed Sep 2, 2020

View reviewed changes

abhishekagarwal87 added 2 commits September 3, 2020 15:27

Vectorize only for supported types

44f276f

Rename internal aggregator files

09ace2d

Fix tests

adfb135

clintropolis approved these changes Sep 9, 2020

View reviewed changes

jon-wei approved these changes Sep 9, 2020

View reviewed changes

jon-wei merged commit a5c46dc into apache:master Sep 9, 2020

jon-wei added this to the 0.20.0 milestone Sep 30, 2020

jon-wei mentioned this pull request Oct 2, 2020

[DRAFT] 0.20.0 Release Notes #10462

Closed

abhishekagarwal87 mentioned this pull request Jan 15, 2021

Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. #10767

Merged

gianm mentioned this pull request Apr 14, 2021

Vectorized versions of HllSketch aggregators. #11115

Merged

Add vectorization for druid-histogram extension #10304

Add vectorization for druid-histogram extension #10304

Conversation

abhishekagarwal87 commented Aug 20, 2020 • edited Loading

Description

Key changed/added classes in this PR

Choose a reason for hiding this comment

lgtm-com bot commented Aug 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Aug 21, 2020

lgtm-com bot commented Aug 25, 2020

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 commented Aug 27, 2020

lgtm-com bot commented Aug 27, 2020

lgtm-com bot commented Aug 28, 2020

lgtm-com bot commented Aug 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Sep 3, 2020

lgtm-com bot commented Sep 3, 2020

clintropolis left a comment

Choose a reason for hiding this comment

abhishekagarwal87 commented Aug 20, 2020 •

edited

Loading