-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional type support for KLL Sketches #23114
Conversation
8608262
to
7d5df16
Compare
408daf2
to
f4ec10c
Compare
f4ec10c
to
6fe3db3
Compare
else if (isLongDecimal(type)) { | ||
DecimalType decimalType = (DecimalType) type; | ||
return new SketchParameters<>(Double::compareTo, new ArrayOfDoublesSerDe(), | ||
(Object encodedDecimal) -> new SqlDecimal(Decimals.decodeUnscaledValue((Slice) encodedDecimal), decimalType.getScale(), decimalType.getPrecision()).toBigDecimal().doubleValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this faithfully retain precision for large decimal numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not, but neither does the optimizer when calculating min/max values on statistics. To implement this without precision loss, I'd need to implement the ArrayOfItemsSerde
class from Apache Datasketches for BigDecimal
, maybe called ArrayOfBigDecimalSerde
or something.
My issue with this is that the optimizer statistics only work on doubles currently, so it is losing precision already with decimal types. On top of that if we actually stored the BigDecimal representation, to properly query the sketch in the optimizer you would end up going from:
- Presto decimal -> BigDecimal -> double -> and convert back to BigDecimal to query the sketch.
Since we already lose precision in the optimizer converting to double, I opted to go this route for now.
The other issue with doing this "properly" is that if someone were to consume the KLL sketch produced by Presto, in order to parse it they would need to declare a dependency on presto-main
(or wherever I put the ArrayOfBigDecimalSerde
in order to deserialize the sketch. I could also try to make that contribution to the datasketches repo and then upgrade the dependency on the next release, but that will draw out the timeline for histograms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this function is intended for the optimizer, it's also a public function and may be used on its own. This seems like a correctness bug when taken out of the context of usage within the optimizer. Would it vastly affects statistics gathering if we just don't allow it for big decimal types for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many filters on TPC-DS queries are on decimal types. It is quite important that we support them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another proposal is that we create a hidden function, similar to sum_data_size_for_stats
and call is kll_sketch_for_stats
. This version can be hidden and support decimals, while for the sketch_kll
we leave out the decimal support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to encode long decimals as binary blobs, before switching them to two long integers. Could we serialize them into something that is backed by a byte array, such as a String?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is with the fact that it needs to be a type for which ArrayOfItemsSerde
is implemented so that it can be inserted into the sketch without precision loss. Currently we have Double
, Long
, and String
. There is no option for BigDecimal
(unless we implement it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For continuity: after some offline discussion, we will remove decimal support from this function and incorporate a change into the statistics portion of the SPI to execute some kind of project against the input argument to an aggregation function. This way we can leave out decimal support here and just cast the inputs for decimal-type columns to doubles
...main/java/com/facebook/presto/operator/aggregation/sketch/kll/KllSketchAggregationState.java
Outdated
Show resolved
Hide resolved
// later | ||
if (type.equals(REAL)) { | ||
return new SketchParameters<>(Double::compareTo, new ArrayOfDoublesSerDe(), | ||
(Object intValue) -> (double) Float.intBitsToFloat(((Long) intValue).intValue())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the physical type for floats is an int?
(Object intValue) -> (double) Float.intBitsToFloat(((Long) intValue).intValue())); | |
(Object intValue) -> (double) Float.intBitsToFloat((int) intValue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I recall when writing this: for the REAL
type, the backing primitive type is a long
(RealType
extends AbstractIntType
), but you can't perform a direct cast from Object
to int
, so you need to cast it to the boxed type (Long
) first.
6fe3db3
to
9183f8b
Compare
afafcd4
to
9ecfaf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % one nit
...src/test/java/com/facebook/presto/operator/aggregation/TestKllSketchAggregationFunction.java
Outdated
Show resolved
Hide resolved
...src/test/java/com/facebook/presto/operator/aggregation/TestKllSketchAggregationFunction.java
Outdated
Show resolved
Hide resolved
This adds support just for the aggregation functions. New types supported: 1. real 2. smallint 3. tinyint 5. date 6. time 7. timestamp 8. timestamp with timezone
9ecfaf2
to
633530e
Compare
Description
This PR adds support for additional types in the KLL aggregation functions. New types supported:
Motivation and Context
More type support for histogram statistic use.
Impact
No API impact. Just additions to existing function surface
Test Plan
Contributor checklist
Release Notes
If release note is NOT required, use: