Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional type support for KLL Sketches #23114

Merged
merged 1 commit into from
Jul 30, 2024

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Jul 2, 2024

Description

This PR adds support for additional types in the KLL aggregation functions. New types supported:

  1. real
  2. smallint
  3. tinyint
  4. date
  5. time
  6. timestamp
  7. timestamp with timezone

Motivation and Context

More type support for histogram statistic use.

Impact

No API impact. Just additions to existing function surface

Test Plan

  • New tests for each supported type.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

If release note is NOT required, use:

== NO RELEASE NOTE ==

@ZacBlanco ZacBlanco force-pushed the upstream-kll-add-types branch 2 times, most recently from 408daf2 to f4ec10c Compare July 9, 2024 21:12
@ZacBlanco ZacBlanco marked this pull request as ready for review July 11, 2024 20:00
@ZacBlanco ZacBlanco requested a review from a team as a code owner July 11, 2024 20:00
else if (isLongDecimal(type)) {
DecimalType decimalType = (DecimalType) type;
return new SketchParameters<>(Double::compareTo, new ArrayOfDoublesSerDe(),
(Object encodedDecimal) -> new SqlDecimal(Decimals.decodeUnscaledValue((Slice) encodedDecimal), decimalType.getScale(), decimalType.getPrecision()).toBigDecimal().doubleValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this faithfully retain precision for large decimal numbers?

Copy link
Contributor Author

@ZacBlanco ZacBlanco Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not, but neither does the optimizer when calculating min/max values on statistics. To implement this without precision loss, I'd need to implement the ArrayOfItemsSerde class from Apache Datasketches for BigDecimal, maybe called ArrayOfBigDecimalSerde or something.

My issue with this is that the optimizer statistics only work on doubles currently, so it is losing precision already with decimal types. On top of that if we actually stored the BigDecimal representation, to properly query the sketch in the optimizer you would end up going from:

  • Presto decimal -> BigDecimal -> double -> and convert back to BigDecimal to query the sketch.

Since we already lose precision in the optimizer converting to double, I opted to go this route for now.

The other issue with doing this "properly" is that if someone were to consume the KLL sketch produced by Presto, in order to parse it they would need to declare a dependency on presto-main (or wherever I put the ArrayOfBigDecimalSerde in order to deserialize the sketch. I could also try to make that contribution to the datasketches repo and then upgrade the dependency on the next release, but that will draw out the timeline for histograms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this function is intended for the optimizer, it's also a public function and may be used on its own. This seems like a correctness bug when taken out of the context of usage within the optimizer. Would it vastly affects statistics gathering if we just don't allow it for big decimal types for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many filters on TPC-DS queries are on decimal types. It is quite important that we support them

Copy link
Contributor Author

@ZacBlanco ZacBlanco Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another proposal is that we create a hidden function, similar to sum_data_size_for_stats and call is kll_sketch_for_stats. This version can be hidden and support decimals, while for the sketch_kll we leave out the decimal support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to encode long decimals as binary blobs, before switching them to two long integers. Could we serialize them into something that is backed by a byte array, such as a String?

Copy link
Contributor Author

@ZacBlanco ZacBlanco Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is with the fact that it needs to be a type for which ArrayOfItemsSerde is implemented so that it can be inserted into the sketch without precision loss. Currently we have Double, Long, and String. There is no option for BigDecimal (unless we implement it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For continuity: after some offline discussion, we will remove decimal support from this function and incorporate a change into the statistics portion of the SPI to execute some kind of project against the input argument to an aggregation function. This way we can leave out decimal support here and just cast the inputs for decimal-type columns to doubles

// later
if (type.equals(REAL)) {
return new SketchParameters<>(Double::compareTo, new ArrayOfDoublesSerDe(),
(Object intValue) -> (double) Float.intBitsToFloat(((Long) intValue).intValue()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the physical type for floats is an int?

Suggested change
(Object intValue) -> (double) Float.intBitsToFloat(((Long) intValue).intValue()));
(Object intValue) -> (double) Float.intBitsToFloat((int) intValue));

Copy link
Contributor Author

@ZacBlanco ZacBlanco Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I recall when writing this: for the REAL type, the backing primitive type is a long (RealType extends AbstractIntType), but you can't perform a direct cast from Object to int, so you need to cast it to the boxed type (Long) first.

aaneja
aaneja previously approved these changes Jul 24, 2024
Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % one nit

@tdcmeehan tdcmeehan self-assigned this Jul 24, 2024
This adds support just for the aggregation functions.
New types supported:

1. real
2. smallint
3. tinyint
5. date
6. time
7. timestamp
8. timestamp with timezone
@ZacBlanco ZacBlanco merged commit 772213c into prestodb:master Jul 30, 2024
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants