fix `push_down_filter` for pushing filters on grouping columns rather than aggregate columns #4447

jackwener · 2022-12-01T01:01:37Z

Which issue does this PR close?

Closes #4401.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener · 2022-12-01T01:05:03Z

datafusion/optimizer/src/push_down_filter.rs

-                        || !columns
-                            .intersection(&used_columns)
-                            .collect::<HashSet<_>>()
-                            .is_empty()


The original performance was bad

As I mentioned in the last PR, I think we do not need to check the aggregate Exprs, but just check the group by Exprs. In some cases, the same column can exist in both aggregate Exprs and group by Exprs, for example select count(distinct col_a), col_a from table group by col_a; . If there is a Filter applied to col_a, the Filter can still be pushed down even it is referred by the agg Exprs.

The logic should check all the columns used by the Filter predicate is the subset of the group by Exprs output Columns.

Yes. For push_down_filter through Agg, we can push Expr in groupby_expr.
Has add it.

datafusion/optimizer/src/push_down_filter.rs

liukun4515 · 2022-12-01T05:33:35Z

datafusion/optimizer/src/push_down_filter.rs

@@ -910,11 +922,9 @@ mod tests {
        // rewrite to CNF
        // (c = 1 OR c = 1) [can pushDown] AND (c = 1 OR b > 3) AND (b > 2 OR C = 1) AND (b > 2 OR b > 3)

-        let expected = "\
-        Filter: (test.c = Int64(1) OR b > Int64(3)) AND (b > Int64(2) OR test.c = Int64(1)) AND (b > Int64(2) OR b > Int64(3))\
+        let expected = "Filter: (test.c = Int64(1) OR test.c = Int64(1)) AND (test.c = Int64(1) OR b > Int64(3)) AND (b > Int64(2) OR test.c = Int64(1)) AND (b > Int64(2) OR b > Int64(3))\


cc @Ted-Jiang

jackwener · 2022-12-01T05:34:54Z

datafusion/optimizer/src/push_down_filter.rs

-        let expected = "\
-        Filter: (test.c = Int64(1) OR b > Int64(3)) AND (b > Int64(2) OR test.c = Int64(1)) AND (b > Int64(2) OR b > Int64(3))\
+        let expected = "Filter: (test.c = Int64(1) OR test.c = Int64(1)) AND (test.c = Int64(1) OR b > Int64(3)) AND (b > Int64(2) OR test.c = Int64(1)) AND (b > Int64(2) OR b > Int64(3))\
        \n  Aggregate: groupBy=[[test.a]], aggr=[[SUM(test.b) AS b]]\
-        \n    Filter: test.c = Int64(1) OR test.c = Int64(1)\
-        \n      TableScan: test";


Original plan is wrong.😂
I think we need to delete this wrong UT.

Filter include column that not in output of Aggregate.

Yes, it was my fault.
col_c should not exist in filter. Need delete it 😂
@jackwener

mingmwang · 2022-12-02T05:30:11Z

LGTM.

jackwener · 2022-12-03T01:25:04Z

@alamb @Dandandan PTAL

alamb

Looks great to me -- thank you @jackwener

alamb · 2022-12-03T10:42:24Z

datafusion/optimizer/src/push_down_filter.rs

                    }
                }

-                let child = match conjunction(push_predicates) {
+                // As for plan Filter: Column(a+b) > 0 -- Agg: groupby:[Column(a)+Column(b)]


Nice -- this is getting quite sophisticated.

alamb · 2022-12-03T10:43:19Z

datafusion/optimizer/src/push_down_filter.rs

+                // So we need create a replace_map, add {`a+b` --> Expr(Column(a)+Column(b))}
+                let mut replace_map = HashMap::new();
+                for expr in &agg.group_expr {
+                    replace_map.insert(expr.display_name()?, expr.clone());


Double checked that display_name is the right one: https://docs.rs/datafusion/14.0.0/datafusion/prelude/enum.Expr.html#method.display_name 👍

alamb · 2022-12-03T10:43:48Z

datafusion/optimizer/src/push_down_filter.rs

-            \n    TableScan: test";
+        let expected =
+            "Aggregate: groupBy=[[test.b + test.a]], aggr=[[SUM(test.a), test.b]]\
+        \n  Filter: test.b + test.a > Int64(10)\


👍 very nice

alamb · 2022-12-03T10:44:19Z

datafusion/optimizer/src/push_down_filter.rs

-        \n  Aggregate: groupBy=[[test.a]], aggr=[[SUM(test.b) AS b]]\
-        \n    Filter: test.c = Int64(1) OR test.c = Int64(1)\
-        \n      TableScan: test";
+            Filter: b > Int64(10)\


I agree this new plan is correct

alamb · 2022-12-03T10:44:57Z

datafusion/optimizer/tests/integration-test.rs

+    let expected = "Projection: c, COUNT(UInt8(1))\
+    \n  Projection: test.col_int32 + test.col_uint32 AS c, COUNT(UInt8(1))\
+    \n    Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS Int32)]], aggr=[[COUNT(UInt8(1))]]\
+    \n      Filter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\


ursabot · 2022-12-03T10:51:36Z

Benchmark runs are scheduled for baseline = 8db99d2 and contender = 0509692. 0509692 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the optimizer Optimizer rules label Dec 1, 2022

jackwener commented Dec 1, 2022

View reviewed changes

jackwener mentioned this pull request Dec 1, 2022

pyarrow CI failed #4448

Closed

mingmwang reviewed Dec 1, 2022

View reviewed changes

datafusion/optimizer/src/push_down_filter.rs Outdated Show resolved Hide resolved

jackwener added 4 commits December 1, 2022 12:46

fix push_down_filter push column instead of Expr.

c75e71c

remove collect to avoid performance loss

018b9a5

add UT

992f3af

enhance filter push through agg

80c025b

jackwener force-pushed the fix_bug branch from c61e06d to 80c025b Compare December 1, 2022 04:46

jackwener added 2 commits December 1, 2022 12:50

add comment

e51daa2

polish

c9c89c5

liukun4515 reviewed Dec 1, 2022

View reviewed changes

jackwener commented Dec 1, 2022

View reviewed changes

remove wrong UT.

f9a7072

alamb approved these changes Dec 3, 2022

View reviewed changes

alamb changed the title ~~fix push_down_filter push Expr instead of column.~~ fix push_down_filter for pushing filters on grouping columns rather than aggregate columns Dec 3, 2022

alamb merged commit 0509692 into apache:master Dec 3, 2022

jackwener deleted the fix_bug branch December 6, 2022 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix `push_down_filter` for pushing filters on grouping columns rather than aggregate columns #4447

fix `push_down_filter` for pushing filters on grouping columns rather than aggregate columns #4447

jackwener commented Dec 1, 2022

jackwener Dec 1, 2022

mingmwang Dec 1, 2022

mingmwang Dec 1, 2022

jackwener Dec 1, 2022

liukun4515 Dec 1, 2022

jackwener Dec 1, 2022 •

edited

Loading

Ted-Jiang Dec 1, 2022 •

edited

Loading

mingmwang commented Dec 2, 2022

jackwener commented Dec 3, 2022 •

edited

Loading

alamb left a comment

alamb Dec 3, 2022

alamb Dec 3, 2022

alamb Dec 3, 2022

alamb Dec 3, 2022

alamb Dec 3, 2022

ursabot commented Dec 3, 2022

fix push_down_filter for pushing filters on grouping columns rather than aggregate columns #4447

fix push_down_filter for pushing filters on grouping columns rather than aggregate columns #4447

Conversation

jackwener commented Dec 1, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackwener Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

Ted-Jiang Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

mingmwang commented Dec 2, 2022

jackwener commented Dec 3, 2022 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Dec 3, 2022

fix `push_down_filter` for pushing filters on grouping columns rather than aggregate columns #4447

fix `push_down_filter` for pushing filters on grouping columns rather than aggregate columns #4447

jackwener Dec 1, 2022 •

edited

Loading

Ted-Jiang Dec 1, 2022 •

edited

Loading

jackwener commented Dec 3, 2022 •

edited

Loading