-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg] Add Histogram Statistic Support #22365
base: master
Are you sure you want to change the base?
[Iceberg] Add Histogram Statistic Support #22365
Conversation
58cfe30
to
e1a2caa
Compare
Consider this suggested change for the release note entry:
|
bac6b3f
to
0e3a87c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the docs! A few nits suggested for clarity and style. Let me know what you think.
Codenotify: Notifying subscribers in CODENOTIFY files for diff 27eb666...dfe82f6. No notifications. |
3d551e9
to
65b6fe8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local docs build, everything looks good. Thanks!
65b6fe8
to
0d6389a
Compare
0d6389a
to
24398a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a rough glance, and bring some questions for discussion. Will take a detailed look in the next one or two days.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergHiveMetadata.java
Outdated
Show resolved
Hide resolved
24398a5
to
56b79b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updated doc! Two nits about formatting, no issues with the content in any of the three doc files.
56b79b0
to
492ab5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Thanks for the quick fix! Pull updated branch, new local doc build, looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. The changes look good to me, only some nits.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/statistics/KllHistogram.java
Outdated
Show resolved
Hide resolved
...o-spi/src/test/java/com/facebook/presto/spi/statistics/TestDisjointRangeDomainHistogram.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/statistics/KllHistogram.java
Outdated
Show resolved
Hide resolved
0dceef8
to
3476da9
Compare
720d06f
to
e16dfe6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local docs build, everything looks good. Thanks!
The set of ColumnStatisticsMetadata defined by the Hive and non-Hive connectors are not equivalent. However, it is possible to collect the superset of the relevant metadata and use it for ANALYZE. The returned statistics just need to be filtered out to contain only the relevant column statistics. This may include duplicate calculations for some statistics. For example, with distinct values Iceberg puffin files can store the result of sketch_theta for distinct values, but the code path for storing the statistic in the HMS requires a direct value from approx_distinct. Thus, ANALYZE may compute a value twice.
56b02fd
to
d041947
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a short first pass
|
||
The set of statistics available for a particular query depends on the connector | ||
being used and can also vary by table or even by table layout. For example, the | ||
Hive connector does not currently provide statistics on data size. | ||
|
||
Table statistics can be displayed via the Presto SQL interface using the | ||
Table statistics can be displayed using a SQL statement with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can be fetched using the below SQL query
sounds better to me
presto-spi/src/main/java/com/facebook/presto/spi/statistics/HistogramCalculator.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/statistics/HistogramCalculator.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
public static ToStringHelper toStringHelper(Object self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's attribute this to Guava ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we just use Guava for all of this? We already depend on it. If there is some reason we want to fork this code, then full tests are needed. But it should be much easier just to call Guava
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guava dependency is not included in presto-common
as the presto-common
module is a dependency of the presto-spi
module which we avoid putting additional dependencies so that when creating connectors or other plugins there it is less likely for downstream users of presto to encounter issues or conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's reasonable. However if we're forking this in, we also MUST include the tests as well, and be prepared to support this for the foreseeable future. I'm not sure it's worth the effort. I'd be inclined to just manually build the strings in the toString method like we've been doing since 1995. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really need guava without disrupting the downstream users of the library - we could consider shading it (during the build process)? or even create a separate library e.g. presto-shaded-guava. There is an extra effort in testing and maintaining this code.
@ZacBlanco is this already considered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, let's not do that. Shading is so painful. I think I'm coming down pretty firmly on the side of don't use ToStringHelper, don't use Guava here, and just concatenate the strings inside toString. It's not like that's actually hard. This approach is way too complicated for the problem it solves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScrapCodes, as @elharo mentioned, shading is definitely a last resort and we should avoid it as much as possible. It's not worth doing. Guava is a huge codebase and would add a lot of bloat just for a few classes. I think vendoring these small utilities are useful though
} | ||
} | ||
|
||
public static ToStringHelper toStringHelper(Object self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we just use Guava for all of this? We already depend on it. If there is some reason we want to fork this code, then full tests are needed. But it should be much easier just to call Guava
presto-common/src/main/java/com/facebook/presto/common/predicate/Marker.java
Show resolved
Hide resolved
d041947
to
eae161e
Compare
} | ||
} | ||
|
||
public static ToStringHelper toStringHelper(Object self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's reasonable. However if we're forking this in, we also MUST include the tests as well, and be prepared to support this for the foreseeable future. I'm not sure it's worth the effort. I'd be inclined to just manually build the strings in the toString method like we've been doing since 1995. :-)
eae161e
to
ee5e3fd
Compare
presto-main/src/main/java/com/facebook/presto/sql/planner/StatisticsAggregationPlanner.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/statistics/TestKllHistogram.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/statistics/KllHistogram.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/test/java/com/facebook/presto/iceberg/statistics/TestKllHistogram.java
Show resolved
Hide resolved
} | ||
} | ||
|
||
public static ToStringHelper toStringHelper(Object self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really need guava without disrupting the downstream users of the library - we could consider shading it (during the build process)? or even create a separate library e.g. presto-shaded-guava. There is an extra effort in testing and maintaining this code.
@ZacBlanco is this already considered?
} | ||
} | ||
|
||
public static ToStringHelper toStringHelper(Object self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, let's not do that. Shading is so painful. I think I'm coming down pretty firmly on the side of don't use ToStringHelper, don't use Guava here, and just concatenate the strings inside toString. It's not like that's actually hard. This approach is way too complicated for the problem it solves.
ee5e3fd
to
4b9cf64
Compare
Utilizes the sketch_kll function to generate histograms and store them into the Iceberg table's puffin files for table-level statistic storage. Histograms are always collected by ANALYZE, but they are not used by the cost calculator unless enabled via optimizer.use-histograms
4b9cf64
to
dfe82f6
Compare
Description
This changes adds support for the histogram statistic type to Iceberg. There are 3 related commits
Motivation and Context
Histograms result in better accuracy for output row counts from filter nodes when calculating statistics. This should result in closer estimates in the CBO, resulting in better query plans.
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.