Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of parquet dictionary to domain conversion #15268

Merged
merged 3 commits into from
Dec 2, 2022

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Dec 1, 2022

Description

Improved logic for constructing domain from dictionary values in TupleDomainParquetPredicate by using the faster code path for constructing ValueSet from singleton values rather than Ranges.

Benchmark Mode Cnt Score Before Score After Units
BenchmarkTupleDomainParquetPredicate.domainFromDictionary avgt 10 354.194 ± 29.631 211.831 ± 8.793 ms/op

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Iceberg, Delta
* Improve performance of reading parquet filter for queries with filters. ({issue}`15268`)

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Quite considerable code duplication. is it worth it?


long minValue = min instanceof Binary ? getShortDecimalValue(((Binary) min).getBytes()) : asLong(min);
long maxValue = min instanceof Binary ? getShortDecimalValue(((Binary) max).getBytes()) : asLong(max);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a bug here, right?

can you please fix it as a prep commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible for the min and max values to be of different types in the file, it's more of a consistency/readability fix

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % @findepi comments %

SortedRangeSet#of(io.trino.spi.type.Type, java.util.Collection<?>) and SortedRangeSet#buildFromUnsortedRanges look similar (comparisons etc). Maybe it's possible to improve range based method

Provides compile time guarantee that all the cases are handled
@raunaqmorarka raunaqmorarka force-pushed the pqr-dict-domain branch 2 times, most recently from 8889bdf to a17d00d Compare December 2, 2022 08:38
@raunaqmorarka
Copy link
Member Author

raunaqmorarka commented Dec 2, 2022

lgtm % @findepi comments %

SortedRangeSet#of(io.trino.spi.type.Type, java.util.Collection<?>) and SortedRangeSet#buildFromUnsortedRanges look similar (comparisons etc). Maybe it's possible to improve range based method

I looked into what is slow here
Before
Screenshot 2022-12-02 at 12 51 23 PM

After
Screenshot 2022-12-02 at 12 59 49 PM

The problem is that Range#range performs a lookup of comparison operator for the type. This gets repeated unnecessarily for every range object.
Same problem was addressed in #10239 as well.
I've changed the approach here to use a SortedRanges builder which caches comparison operator and avoids that bottleneck. While this is still not exactly as fast the previous approach it's still significantly faster, far less code and helps multiple code paths potentially, so I've simplified the PR now.

{
private final Type type;
private final MethodHandle rangeComparisonOperator;
private final List<Range> ranges;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could probably optimize it further by using BlockBuilder directly

Improved logic for constructing domain from dictionary values in
TupleDomainParquetPredicate by providing builder in SortedRangeSet
which caches comparisonOperator and re-uses it for Range construction

Benchmark                                                  Mode  Cnt  Score Before      Score After      Units
BenchmarkTupleDomainParquetPredicate.domainFromDictionary  avgt   10  354.194 ± 29.631  246.769 ± 12.390 ms/op
@raunaqmorarka raunaqmorarka merged commit 56936ba into trinodb:master Dec 2, 2022
@raunaqmorarka raunaqmorarka deleted the pqr-dict-domain branch December 2, 2022 17:24
@github-actions github-actions bot added this to the 404 milestone Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants