Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting optimization rules: common_sub_expression_eliminate and push_down_projection #8296

Closed
jonahgao opened this issue Nov 22, 2023 · 6 comments · Fixed by #8340
Closed
Labels
bug Something isn't working

Comments

@jonahgao
Copy link
Member

jonahgao commented Nov 22, 2023

Describe the bug

Rule push_down_projection negates the optimization results of common_sub_expression_eliminate, but leaves the useless aliases introduced by common_sub_expression_eliminate. Those added aliases change the signature of the logical plan, also causing the optimizer to never reach the fixed point.

In summary, there are two problems:

  1. The optimization of common_sub_expression_elimination did not work.
  2. Ineffective optimization leads to the optimizer not being able to exit early until it reaches the limit of datafusion.optimizer.max_passes.

To Reproduce

Run explain verbose select a/2, a/2 + 1 from t in CLI

DataFusion CLI v33.0.0

❯ create table t(a bigint);
0 rows in set. Query took 0.006 seconds.

❯ explain verbose select a/2, a/2 + 1 from t;
| logical_plan after common_sub_expression_eliminate         | Projection: t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2), t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2) + Int64(1)                                                                         |
|                                                            |   Projection: t.a / Int64(2) AS t.a / Int64(2)Int64(2)t.a, t.a                                                                                                                          |
|                                                            |     TableScan: t
| ...
|
| logical_plan after push_down_projection                    | Projection: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1)                                                           |
|                                                            |   TableScan: t projection=[a]

❯ explain select a/2, a/2 + 1 from t;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                              |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1) |
|               |   TableScan: t projection=[a]                                                                                                                                     |
| physical_plan | ProjectionExec: expr=[a@0 / 2 as t.a / Int64(2), a@0 / 2 + 1 as t.a / Int64(2) + Int64(1)]                                                                        |
|               |   MemoryExec: partitions=1, partition_sizes=[0]                                                                                                                   |
|               |                                                                                                                                                                   |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+

common_sub_expression_eliminate generates a new child projection which includes common expressions under the original projection, and push_down_projection merges them into one.

The final projection is:

  • a/2: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2)
  • a/2+1: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1)

Three duplicate aliases appeared, corresponding to datafusion.optimizer.max_passes=3.

If I set datafusion.optimizer.max_passes=10

set datafusion.optimizer.max_passes=10;
0 rows in set. Query took 0.002 seconds.

❯ explain select a/2, a/2 + 1 from t;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                          |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1) |
|               |   TableScan: t projection=[a]                                                                                                                                                                                                                                                                                                                                                                                                 |
| physical_plan | ProjectionExec: expr=[a@0 / 2 as t.a / Int64(2), a@0 / 2 + 1 as t.a / Int64(2) + Int64(1)]                                                                                                                                                                                                                                                                                                                                    |
|               |   MemoryExec: partitions=1, partition_sizes=[0]                                                                                                                                                                                                                                                                                                                                                                               |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                               |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.025 seconds.

One of the expr in the final projection will be t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2).
There are too many unnecessary aliases.

Expected behavior

The logic merge_projection inside the rule push_down_projection should not undo common_sub_expression_eliminate.

Additional context

I plan to work on this in the next few days.

My initial thought is to fix it in the push_down_projection side.

Do not execute merge_projection if an expression in the child projection satisfies the following conditions:

  1. It has been referenced by the parent projection two or more times. Keep the optimization effect of common_sub_expression_eliminate.
  2. Its evaluation is non-trivial. Its type is not Column or Literal.
@jonahgao jonahgao added the bug Something isn't working label Nov 22, 2023
@mustafasrepo
Copy link
Contributor

@jonahgao I am working on a PR to optimize projections. (It is almost ready, in our interval review process. You can find it here). I think this PR will solve this problem. (I didn't verify it though). I suggest you to try whether that PR solves this issue. If that is the case, it should be ready in couple of days for upstream. In this way, we wouldn't duplicate efforts.

@jonahgao
Copy link
Member Author

jonahgao commented Nov 22, 2023

Thank you @mustafasrepo I have tested on this branch, and it looks really nice.

❯ explain  verbose select a/2, a/2+1 from t;
| logical_plan after common_sub_expression_eliminate         | Projection: t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2), t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2) + Int64(1)                                                                                        |
|                                                            |   Projection: t.a / Int64(2) AS t.a / Int64(2)Int64(2)t.a, t.a                                                                                                                                         |
|                                                            |     TableScan: t
| logical_plan after optimize_projections                    | Projection: t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2), t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2) + Int64(1)                                                                                        |
|                                                            |   Projection: t.a / Int64(2) AS t.a / Int64(2)Int64(2)t.a                                                                                                                                              |
|                                                            |     TableScan: t projection=[a]
| logical_plan after merge_projection                        | Projection: t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) + Int64(1)                                                                                                              |
|                                                            |   TableScan: t projection=[a]

The new optimize_projections rule no longer undo the optimization of common_sub_expression_eliminate.

However, the merge_projection rule still causes the same problem, because it contains the same merge logic as the old push_down_projection rule. The merge logic is the root cause.

I’m wondering if the new OptimizeProjections rule can also cover the functionality of MergeProjection. I believe that, in this way, we can solve this problem. Or wait until your PR is merged, and then reconsider this issue.

@mustafasrepo
Copy link
Contributor

Thank you @mustafasrepo I have tested on this branch, and it looks really nice.

❯ explain  verbose select a/2, a/2+1 from t;
| logical_plan after common_sub_expression_eliminate         | Projection: t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2), t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2) + Int64(1)                                                                                        |
|                                                            |   Projection: t.a / Int64(2) AS t.a / Int64(2)Int64(2)t.a, t.a                                                                                                                                         |
|                                                            |     TableScan: t
| logical_plan after optimize_projections                    | Projection: t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2), t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2) + Int64(1)                                                                                        |
|                                                            |   Projection: t.a / Int64(2) AS t.a / Int64(2)Int64(2)t.a                                                                                                                                              |
|                                                            |     TableScan: t projection=[a]
| logical_plan after merge_projection                        | Projection: t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) + Int64(1)                                                                                                              |
|                                                            |   TableScan: t projection=[a]

The new optimize_projections rule no longer undo the optimization of common_sub_expression_eliminate.

However, the merge_projection rule still causes the same problem, because it contains the same merge logic as the old push_down_projection rule. The merge logic is the root cause.

I’m wondering if the new OptimizeProjections rule can also cover the functionality of MergeProjection. I believe that, in this way, we can solve this problem. Or wait until your PR is merged, and then reconsider this issue.

I think we can cover merge functionality in the new rule also. I will try it out, then let you know about the result.

@mustafasrepo
Copy link
Contributor

I have examined how we can support merge_projections under this PR. It seems that we can support this feature, with this PR all of the optimizations related to the Projection is handled with a single rule.

Also as part of that PR, I have changed projection merge logic, such that it doesn't always merges consecutive projections(when previous projection is used to cache complex computation for subsequent projection). With this support, the problem in this issue is resolved also. See test

@jonahgao
Copy link
Member Author

I have examined how we can support merge_projections under this PR. It seems that we can support this feature, with this PR all of the optimizations related to the Projection is handled with a single rule.

Also as part of that PR, I have changed projection merge logic, such that it doesn't always merges consecutive projections(when previous projection is used to cache complex computation for subsequent projection). With this support, the problem in this issue is resolved also. See test

A great job 🎉 !
I will leave this issue here, waiting for your PR to be merged.

@mustafasrepo
Copy link
Contributor

@jonahgao I have opened the PR in upstream. You can find the upstream PR here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants