Add additional type support for KLL Sketches #23114

ZacBlanco · 2024-07-02T06:18:00Z

Description

This PR adds support for additional types in the KLL aggregation functions. New types supported:

real
smallint
tinyint
date
time
timestamp
timestamp with timezone

Motivation and Context

More type support for histogram statistic use.

Impact

No API impact. Just additions to existing function surface

Test Plan

New tests for each supported type.

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

If release note is NOT required, use:

== NO RELEASE NOTE ==

tdcmeehan · 2024-07-12T15:27:14Z

...main/java/com/facebook/presto/operator/aggregation/sketch/kll/KllSketchAggregationState.java

+        else if (isLongDecimal(type)) {
+            DecimalType decimalType = (DecimalType) type;
+            return new SketchParameters<>(Double::compareTo, new ArrayOfDoublesSerDe(),
+                    (Object encodedDecimal) -> new SqlDecimal(Decimals.decodeUnscaledValue((Slice) encodedDecimal), decimalType.getScale(), decimalType.getPrecision()).toBigDecimal().doubleValue());


Does this faithfully retain precision for large decimal numbers?

It does not, but neither does the optimizer when calculating min/max values on statistics. To implement this without precision loss, I'd need to implement the ArrayOfItemsSerde class from Apache Datasketches for BigDecimal, maybe called ArrayOfBigDecimalSerde or something.

My issue with this is that the optimizer statistics only work on doubles currently, so it is losing precision already with decimal types. On top of that if we actually stored the BigDecimal representation, to properly query the sketch in the optimizer you would end up going from:

Presto decimal -> BigDecimal -> double -> and convert back to BigDecimal to query the sketch.

Since we already lose precision in the optimizer converting to double, I opted to go this route for now.

The other issue with doing this "properly" is that if someone were to consume the KLL sketch produced by Presto, in order to parse it they would need to declare a dependency on presto-main (or wherever I put the ArrayOfBigDecimalSerde in order to deserialize the sketch. I could also try to make that contribution to the datasketches repo and then upgrade the dependency on the next release, but that will draw out the timeline for histograms.

While this function is intended for the optimizer, it's also a public function and may be used on its own. This seems like a correctness bug when taken out of the context of usage within the optimizer. Would it vastly affects statistics gathering if we just don't allow it for big decimal types for now?

Many filters on TPC-DS queries are on decimal types. It is quite important that we support them

Another proposal is that we create a hidden function, similar to sum_data_size_for_stats and call is kll_sketch_for_stats. This version can be hidden and support decimals, while for the sketch_kll we leave out the decimal support.

We used to encode long decimals as binary blobs, before switching them to two long integers. Could we serialize them into something that is backed by a byte array, such as a String?

The problem is with the fact that it needs to be a type for which ArrayOfItemsSerde is implemented so that it can be inserted into the sketch without precision loss. Currently we have Double, Long, and String. There is no option for BigDecimal (unless we implement it)

For continuity: after some offline discussion, we will remove decimal support from this function and incorporate a change into the statistics portion of the SPI to execute some kind of project against the input argument to an aggregation function. This way we can leave out decimal support here and just cast the inputs for decimal-type columns to doubles

...main/java/com/facebook/presto/operator/aggregation/sketch/kll/KllSketchAggregationState.java

tdcmeehan · 2024-07-12T15:35:09Z

...main/java/com/facebook/presto/operator/aggregation/sketch/kll/KllSketchAggregationState.java

+        // later
+        if (type.equals(REAL)) {
+            return new SketchParameters<>(Double::compareTo, new ArrayOfDoublesSerDe(),
+                    (Object intValue) -> (double) Float.intBitsToFloat(((Long) intValue).intValue()));


I believe the physical type for floats is an int?

Suggested change

(Object intValue) -> (double) Float.intBitsToFloat(((Long) intValue).intValue()));

(Object intValue) -> (double) Float.intBitsToFloat((int) intValue));

From what I recall when writing this: for the REAL type, the backing primitive type is a long (RealType extends AbstractIntType), but you can't perform a direct cast from Object to int, so you need to cast it to the boxed type (Long) first.

aaneja

LGTM % one nit

...src/test/java/com/facebook/presto/operator/aggregation/TestKllSketchAggregationFunction.java

This adds support just for the aggregation functions. New types supported: 1. real 2. smallint 3. tinyint 5. date 6. time 7. timestamp 8. timestamp with timezone

ZacBlanco force-pushed the upstream-kll-add-types branch from 8608262 to 7d5df16 Compare July 2, 2024 14:50

ZacBlanco mentioned this pull request Jul 2, 2024

[Iceberg] Add Histogram Statistic Support #22365

Open

6 tasks

ZacBlanco force-pushed the upstream-kll-add-types branch 2 times, most recently from 408daf2 to f4ec10c Compare July 9, 2024 21:12

ZacBlanco marked this pull request as ready for review July 11, 2024 20:00

ZacBlanco requested a review from a team as a code owner July 11, 2024 20:00

ZacBlanco requested review from presto-oss and aaneja July 11, 2024 20:00

ZacBlanco force-pushed the upstream-kll-add-types branch from f4ec10c to 6fe3db3 Compare July 11, 2024 21:49

tdcmeehan requested changes Jul 12, 2024

View reviewed changes

ZacBlanco force-pushed the upstream-kll-add-types branch from 6fe3db3 to 9183f8b Compare July 13, 2024 01:53

ZacBlanco mentioned this pull request Jul 13, 2024

Add Sql-invoked function support for column statistics #23201

Merged

6 tasks

ZacBlanco force-pushed the upstream-kll-add-types branch 3 times, most recently from afafcd4 to 9ecfaf2 Compare July 16, 2024 20:05

aaneja previously approved these changes Jul 24, 2024

View reviewed changes

...src/test/java/com/facebook/presto/operator/aggregation/TestKllSketchAggregationFunction.java Outdated Show resolved Hide resolved

...src/test/java/com/facebook/presto/operator/aggregation/TestKllSketchAggregationFunction.java Outdated Show resolved Hide resolved

tdcmeehan self-assigned this Jul 24, 2024

Add additional type support for KLL Sketches

633530e

This adds support just for the aggregation functions. New types supported: 1. real 2. smallint 3. tinyint 5. date 6. time 7. timestamp 8. timestamp with timezone

ZacBlanco dismissed aaneja’s stale review via 633530e July 24, 2024 16:22

ZacBlanco force-pushed the upstream-kll-add-types branch from 9ecfaf2 to 633530e Compare July 24, 2024 16:22

tdcmeehan approved these changes Jul 24, 2024

View reviewed changes

ZacBlanco merged commit 772213c into prestodb:master Jul 30, 2024
56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional type support for KLL Sketches #23114

Add additional type support for KLL Sketches #23114

ZacBlanco commented Jul 2, 2024 •

edited

Loading

tdcmeehan Jul 12, 2024

ZacBlanco Jul 12, 2024 •

edited

Loading

tdcmeehan Jul 12, 2024

ZacBlanco Jul 12, 2024

ZacBlanco Jul 12, 2024 •

edited

Loading

tdcmeehan Jul 12, 2024

ZacBlanco Jul 12, 2024 •

edited

Loading

ZacBlanco Jul 12, 2024

tdcmeehan Jul 12, 2024

ZacBlanco Jul 12, 2024 •

edited

Loading

aaneja left a comment

	(Object intValue) -> (double) Float.intBitsToFloat(((Long) intValue).intValue()));
	(Object intValue) -> (double) Float.intBitsToFloat((int) intValue));

Add additional type support for KLL Sketches #23114

Add additional type support for KLL Sketches #23114

Conversation

ZacBlanco commented Jul 2, 2024 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

tdcmeehan Jul 12, 2024

Choose a reason for hiding this comment

ZacBlanco Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

tdcmeehan Jul 12, 2024

Choose a reason for hiding this comment

ZacBlanco Jul 12, 2024

Choose a reason for hiding this comment

ZacBlanco Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

tdcmeehan Jul 12, 2024

Choose a reason for hiding this comment

ZacBlanco Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

ZacBlanco Jul 12, 2024

Choose a reason for hiding this comment

tdcmeehan Jul 12, 2024

Choose a reason for hiding this comment

ZacBlanco Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

aaneja left a comment

Choose a reason for hiding this comment

ZacBlanco commented Jul 2, 2024 •

edited

Loading

ZacBlanco Jul 12, 2024 •

edited

Loading

ZacBlanco Jul 12, 2024 •

edited

Loading

ZacBlanco Jul 12, 2024 •

edited

Loading

ZacBlanco Jul 12, 2024 •

edited

Loading