-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow merging consecutive 'long' values in SortedRangeSet #3316
Conversation
presto-spi/src/main/java/io/prestosql/spi/predicate/Marker.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/io/prestosql/spi/predicate/SortedRangeSet.java
Outdated
Show resolved
Hide resolved
f5210e6
to
1f57c13
Compare
It also seems that statistics estimation is different after rewriting the query:
For small
|
I haven't looked at the plans, but is the new plan "better"? |
That seems like a bug (or inconsistency) in the cost estimator. For integer types, it's clearly equivalent. |
@martint there is no logic impl'd which could realize these are equivalent. Of course, something to improve, so the obvious question would be -- how big of a problem this is, how urgent. |
Given that plans changes, we probably should benchmark this PR |
I have ran the query on
SELECT
"ss_ticket_number"
, "ss_customer_sk"
, "ca_city" "bought_city"
, "sum"("ss_coupon_amt") "amt"
, "sum"("ss_net_profit") "profit"
FROM
store_sales
, date_dim
, store
, household_demographics
, customer_address
WHERE ("store_sales"."ss_sold_date_sk" = "date_dim"."d_date_sk")
AND ("store_sales"."ss_store_sk" = "store"."s_store_sk")
AND ("store_sales"."ss_hdemo_sk" = "household_demographics"."hd_demo_sk")
AND ("store_sales"."ss_addr_sk" = "customer_address"."ca_address_sk")
AND (("household_demographics"."hd_dep_count" = 4)
OR ("household_demographics"."hd_vehicle_count" = 3))
AND ("date_dim"."d_dow" IN (6 , 0))
AND ("date_dim"."d_year" IN (1999 , (1999 + 1) , (1999 + 2)))
AND ("store"."s_city" IN ('Fairview' , 'Midway' , 'Fairview' , 'Fairview' , 'Fairview'))
GROUP BY "ss_ticket_number", "ss_customer_sk", "ss_addr_sk", "ca_city";
SELECT
"ss_ticket_number"
, "ss_customer_sk"
, "ca_city" "bought_city"
, "sum"("ss_ext_sales_price") "extended_price"
, "sum"("ss_ext_list_price") "list_price"
, "sum"("ss_ext_tax") "extended_tax"
FROM
store_sales
, date_dim
, store
, household_demographics
, customer_address
WHERE ("store_sales"."ss_sold_date_sk" = "date_dim"."d_date_sk")
AND ("store_sales"."ss_store_sk" = "store"."s_store_sk")
AND ("store_sales"."ss_hdemo_sk" = "household_demographics"."hd_demo_sk")
AND ("store_sales"."ss_addr_sk" = "customer_address"."ca_address_sk")
AND ("date_dim"."d_dom" BETWEEN 1 AND 2)
AND (("household_demographics"."hd_dep_count" = 4)
OR ("household_demographics"."hd_vehicle_count" = 3))
AND ("date_dim"."d_year" IN (1999 , (1999 + 1) , (1999 + 2)))
AND ("store"."s_city" IN ('Midway' , 'Fairview'));
SELECT
"c_last_name"
, "c_first_name"
, "ca_city"
FROM
customer
, customer_address
WHERE ("c_current_addr_sk" = "ca_address_sk"); |
Squashed and rebased over the latest |
@martint is it sill LGTM? I wanted to run benchmarks on this |
Yup, please do |
Benchmarks. However, I see two problems with this PR:
|
Many thanks for the benchmarking! |
Would that mean we have a regression until those new PRs are done & merged? |
No, it would be indeed undesirable. |
Here is another benchmark. Benchmarks comparison-merge.pdf
Good idea |
9e42c28
to
63fe21b
Compare
Rebased over the latest version of #9868. |
8d533f0
to
6370392
Compare
6370392
to
0a6bcb1
Compare
0a6bcb1
to
007326a
Compare
007326a
to
cf445bf
Compare
cf445bf
to
63e9750
Compare
Rebased over latest #9868. |
63e9750
to
04bcad8
Compare
#9868 is merged - please take a look at this PR :) |
04bcad8
to
70838f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like current functionality requires multiple flags to unlock.
We will still fall back to min/max DF collection by default when there are many adjacent distinct values. So enable-large-dynamic-filters
also has to be enabled to avoid that.
In the current implementation of DFs, connectors already receive the full Domain without simplification. We'll save some memory and network communication costs with a more compact representation in Domain. But are these efficiency gains the main benefit or is this more about preventing Domain#simplify
call in connectors from loosing the granularity of information in the received Domain ?
Would these changes make it worthwhile to revise the default DF thresholds in DynamicFilterConfig to higher values ?
core/trino-main/src/main/java/io/trino/operator/DynamicFilterSourceOperator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/operator/TestDynamicFilterSourceOperator.java
Outdated
Show resolved
Hide resolved
/** | ||
* Try to return a more compact representation (if possible). | ||
*/ | ||
default ValueSet compact() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of compacting ranges seems very specific to SortedRangeSet and not easily applicable to every type of ValueSet.
So it seems more suitable to put this in Ranges
interface and use it through ValuesProcessor
where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of compacting ranges seems very specific to SortedRangeSet
agreed
note however, that the interface
/**
* Try to return a more compact representation (if possible).
*/
ValueSet compact();
is not range-specific and could live in ValueSet.
Current name is getCompactedRanges
, but it still is ValueSet
-breaking operation, which could be potentially extended to other ValueSet types.
I think the important question to ask is what "compact" actually means?
- occupies less memory? (eg SortedRangeSet uses dictionary for singleton-ranges, but only if constructed explicitly as such)
- in the first case, it's not ranges-specific
- fewer ranges?
- in the second case, it's ranges specific, but then why does it return a ValueSet, losing information it's ranges?
private static boolean areConsecutive(Type type, Object low, Object high) | ||
{ | ||
return type | ||
.getDiscreteValues(new Type.Range(low, high)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use comparison operator to detect adjacent values the way tryMergeWithNext
does ?
That would avoid relying on getDiscreteValues
which isn't implemented for many types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure - IIUC, tryMergeWithNext()
relies on the ranges sharing the low/high bound values, i.e. [1,5]
and (5,9]
can be merged into [1,9]
.
However, if we try to merge [1,5]
and [6,9]
, we should make sure that 5 and 6 are "consecutive" (i.e. there is no other value between them, and this requires calling Type#getDiscreteValues
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on Iceberg I had a need to find an adjacent value (not check whether two values are adjacent).
Here is my API proposal for that: #12797
Here, it seems we don't need new methods on Type, so I'd hold off and not add them.
70838f9
to
6814b56
Compare
Sorry for the delayed response - fixed the issues, and updated the PR.
We would prefer to not to call |
5a77f95
to
5ff4c69
Compare
core/trino-main/src/test/java/io/trino/operator/TestDynamicFilterSourceOperator.java
Outdated
Show resolved
Hide resolved
core/trino-spi/src/main/java/io/trino/spi/predicate/SortedRangeSet.java
Outdated
Show resolved
Hide resolved
private static boolean areConsecutive(Type type, Object low, Object high) | ||
{ | ||
return type | ||
.getDiscreteValues(new Type.Range(low, high)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Try to return a more compact representation (if possible). | ||
*/ | ||
ValueSet getCompactedRanges(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document whether the compacted version is supported to
- be equal (as in
Object.equals
) - contain same information, or can eg be slightly lossy (widening) for more efficient compaction
- what does is returned when compaction is not possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, higher level
- why is this a method on
Ranges
instead of onValueSet
? - why do we want to have such method at all? I would hope the
ValueSet
instance is "as compact as possible", always
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document whether the compacted version is supported to
Sounds good - done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a method on Ranges instead of on ValueSet?
Following #3316 (comment) comment.
why do we want to have such method at all? I would hope the ValueSet instance is "as compact as possible", always
I agree, but such compactions may cause CBO to change the plan (since we estimate the statistics differently depending on the predicate being a range or a discrete set of values), so I preferred to keep the existing code behavior - and add predicate compaction only to DF-related code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but such compactions may cause CBO to change the plan (since we estimate the statistics differently depending on the predicate being a range or a discrete set of values)
Is this something that could be improved on?
cc @sopel39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that could be improved on?
IMO it should be, I don't think there should be code duality (DF vs CBO)
Enable it for integral types, to be used in DF.
5ff4c69
to
23d6a6e
Compare
👋 @rzeyde-varada - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that. We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks. |
This can be useful when pushing down TupleDomains generated by dynamic filtering.
Fixes #6076