Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/design: add proposal docs for constraint propagation enhancement #7648

Merged
merged 5 commits into from
Sep 25, 2018

Conversation

bb7133
Copy link
Member

@bb7133 bb7133 commented Sep 9, 2018

What problem does this PR solve?

Add description for the proposal of constraint propagation enhancement

What is changed and how it works?

This is a doc change

Check List

Tests

  • No code

Code changes

  • No code change

Side effects

  • No

Related changes

@bb7133
Copy link
Member Author

bb7133 commented Sep 9, 2018

hi @zz-jason @shenli , PTAL, thanks!

@shenli shenli added component/docs contribution This PR is from a community contributor. labels Sep 9, 2018
@shenli
Copy link
Member

shenli commented Sep 9, 2018

@bb7133 Great! Thanks!

@shenli
Copy link
Member

shenli commented Sep 9, 2018

Related with #7098

@shenli
Copy link
Member

shenli commented Sep 9, 2018

@winoros @eurekaka PTAL


1. Find `column = constant` expression and substitue the constant for column, as well as try to fold the substituted constant expression if possible, for example:

Given `a = b and a = 2 and b = 3`, it becomes `2 = 3` after substitution and lead to a final `false` constant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/lead/leads

* `ast.GE('>=')`
* `ast.NE('!=')`

The `propagateConstantSolver` makes more detailed/explicit filters/constraints, which can be used within other optimization rule. For example, in predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source(TiKV), and thus reduce the amount of data in the whole data path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reduce/reduces


NOTE: in this case, `t1.a in (12, 13)` works on the result of the outer join and we have pushed it down to the outer table.

But we can further push this filter down to the inner table, since only the the records satisfy `t2.a in (12, 13)` could make join predicate `t1.a = t2.a` be positive in the join operator. So we can optimize this query to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/satisfy/satisfied ?


## Rationale

Constraint propagation is commonly used as logcial plan optimization in traditional databases, for example, [this doc](https://dev.mysql.com/doc/internals/en/optimizer-constant-propagation.html) explained some details of constant propagtions in MySQL. It is is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, et al, those engines usually query on huge amount of data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/It is is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, et al, those engines usually query on huge amount of data./It is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, etc. Those engines usually query on huge amount of data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraint propagation is commonly used as logical plan optimization in traditional databases. For example, this doc explains some details of constant propagations in MySQL. It is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, et al. Those engines usually query on a huge amount of data.


### Advantages:

Constraint propagation brings more detailed, explicit constraints to each data source involved in a query. With those constraints we can filter data as early as possible, and thus reduce disc/network IO and computational overheads during the execution of a query. In TiDB, most propagated filters can be pushed down to the storage level(TiKV) as a coprocessor task, and lead to the following benefits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/lead/leads or s/lead/can lead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraint propagation brings more detailed, explicit constraints to each data source involved in a query. With those constraints, we can filter data as early as possible, and thus reduce disc/network I/O and computational overheads during the execution of a query. In TiDB, most propagated filters can be pushed down to the storage level (TiKV) as a Coprocessor task, and lead to the following benefits:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all tips above for the grammar!


Constraint propagation brings more detailed, explicit constraints to each data source involved in a query. With those constraints we can filter data as early as possible, and thus reduce disc/network IO and computational overheads during the execution of a query. In TiDB, most propagated filters can be pushed down to the storage level(TiKV) as a coprocessor task, and lead to the following benefits:

* Apply the filters at each TiKV instance, which make the calculation distributed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply the filters at each TiKV instance, which make the calculation distributed.

This may not be correct. We actually push cop to regions rather than TiKV. Although, the benefit is indeed valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point!


* Apply the filters at each TiKV instance, which make the calculation distributed.

* When loading data, skip some table partitions if its data range doesn't pass the filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some table or partitions? Those are two different things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When loading data, skip some table partition if its data range doesn't pass the filter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was ambiguous, I was trying to say 'skip some partitions of table if the partitioning expression doesn't pass the filter', it is for the tables with user-defined partitions. Thanks!


For a query `select * from t0, t1 on t0.a = t1.a where t1.a < 5`, we get a propagation `t0.a < 5`, but if all `t0.a` is greater than 5, applying the filter brings unnecessary overheads.

Considering the trade-off, most of the time we gain benefits from constraint propagation and still treat it useful.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Considering the trade-off, most of the time we gain benefits from constraint propagation and still treat it useful./Considering the trade-off, we still gain a lot of benefits from constraint propagation in most of cases; hence it still can be treated it useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure~

@zhexuany
Copy link
Contributor

zhexuany commented Sep 9, 2018

Great Work. Thanks for your contribution.


For example,

`t1.a = t2.a and t1.a < 5` => `t2.a < 5`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simply describe the examples, maybe it's better to s/t1.a/a/ and s/t2.a/b/? By the way, how about using a table to describe the examples, for example:

origin filters propagated filters
t1.a = t2.a and t1.a < 5 t2.a < 5
t1.a = t2.a and t1.a in (12, 13) and t2.a in (14, 15) t1.a in (14, 15) and t2.a in (12, 13)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks!


`t1.a = t2.a and t1.a < sleep()` -- the expression has side effect

2. Infer NotNULL filters from comparison operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a not null filter for the involved column once a scalar expression is null rejected? In fact, there are many other expressions are null rejected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!


`a < 3 and 3 > a` -> `a < 3`

`a < 5 and a > 5` -> `False`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean constructing an expression which is always flase, for example, a = null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think replacing them with a constant False is fine

@@ -0,0 +1,189 @@
# Proposal: Enhance constraint propagation in TiDB logical plan

- Author(s): [@bb7133](https://github.com/bb7133), [@zz-jason](https://github.com/zz-jason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal was mainly completed by you, please remove me from the author list 😂

Copy link
Member Author

@bb7133 bb7133 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the proposal 4 is all from your issue :)

@zz-jason zz-jason added the sig/planner SIG: Planner label Sep 10, 2018

## Background

For now, most of the constraint propagation work in TiDB is done by `propagateConstantSolver`, it does:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Currently, most of the constraint propagation work in TiDB is done by propagateConstantSolver, which does:


For now, most of the constraint propagation work in TiDB is done by `propagateConstantSolver`, it does:

1. Find `column = constant` expression and substitue the constant for column, as well as try to fold the substituted constant expression if possible, for example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Find the column = constant expression and substitute the constant for the column, as well as try to fold the substituted constant expression if possible, for example:


Given `a = b and a = 2 and b = 3`, it becomes `2 = 3` after substitution and lead to a final `false` constant.

2. Find `column A = column B` expression(which happens in `join` statements mostly) and propagate expressions like `column op constant`(as well as `constant op column`) based on the equliaty relation, the supported operators are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Find the column A = column B expression (which happens in join statements mostly) and propagate expressions like column op **constant (as** well as constant op column) based on the equality relation. The supported operators are:

Note:

  • Do not combine two sentences with a comma
  • Add a space before "("

* `ast.GE('>=')`
* `ast.NE('!=')`

The `propagateConstantSolver` makes more detailed/explicit filters/constraints, which can be used within other optimization rule. For example, in predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source(TiKV), and thus reduce the amount of data in the whole data path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The propagateConstantSolver makes more detailed/explicit filters/constraints, which can be used within other optimization rules. For example, in the predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source (TiKV), and thus reduces the amount of data in the whole data path.


The `propagateConstantSolver` makes more detailed/explicit filters/constraints, which can be used within other optimization rule. For example, in predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source(TiKV), and thus reduce the amount of data in the whole data path.

We can further do the optimization by introduce more rules and infer/propagate more constraints from the existings ones, which helps us building better logical plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can further do the optimization by introducing more rules and inferring/propagating more constraints from the existing ones, which helps us build a better logical plan.


For a query `select * from t0, t1 on t0.a = t1.a where t1.a < 5`, we get a propagation `t0.a < 5`, but if all `t0.a` is greater than 5, applying the filter brings unnecessary overheads.

Considering the trade-off, most of the time we gain benefits from constraint propagation and still treat it useful.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the trade-off, most of the time we gain benefits from constraint propagation and still treat it as useful.


## Compatibility

All rules mentioned in this proposal are logical plan optimization, they should not change the semantic of a query, and thus dont't lead to any compatibility issue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All rules mentioned in this proposal are logical plan optimization, which does(?) not change the semantics of a query, and thus this proposal will not lead to any compatibility issue.


Here are rough ideas about possible implementations:

* For proposal #1, we can extend current `propagateConstantSolver` to support wider types of operators from column equiality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For proposal #1, we can extend the current propagateConstantSolver to support wider types of operators from column equality.


* For proposal #1, we can extend current `propagateConstantSolver` to support wider types of operators from column equiality.

* For proposal #2, `propagateConstantSolver` is also a applicable way to add `NotNULL` filter(`not(isnull())`), but should examine if the column doesn't have `NotNULL` constraint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"For proposal #2, propagateConstantSolver is also an applicable way to add NotNULL filter(not(isnull())), but should examine whether the column has the NotNULL constraint"?
or
"For proposal #2, propagateConstantSolver is also an applicable way to add NotNULL filter(not(isnull())), but should be examined if the column doesn't have the NotNULL constraint"?


* For proposal #2, `propagateConstantSolver` is also a applicable way to add `NotNULL` filter(`not(isnull())`), but should examine if the column doesn't have `NotNULL` constraint.

* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collecting and folding comparison constraints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For proposal #3, the ranger may be useful to help us collect and fold comparison constraints

Note: help sb. (to) do sth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CaitinChen , thank you very much for all the comments, I will the update doc one-by-one. Again, your help is really appreciated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bb7133 My pleasure ^_^

@bb7133
Copy link
Member Author

bb7133 commented Sep 11, 2018

hi @zhexuany @zz-jason @CaitinChen , comments addressed, and thanks for your help!


Here are proposed rules we can consider:

1. Infer more filters/constraints from column equality relation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be EXTREMELY careful here. Consider the following examples:

t1.a = t2.a and t1.a is null

we should NOT infer t2.a is null, it is not logically equivalent because null = null is false;

another example is:

t1.a = t2.a and cast(t1.a as char(10)) = '+0.0'

we should NOT infer cast(t2.a as char(10)) = '+0.0', it is not logically equivalent either because this would filter out tuples with t2.a = -0.0

These 2 cases should fail in current master branch logically, we need to fix them.

Maybe there are other failed cases as well, once more, we need to think over this carefully.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you're right! I tried to get some failed corner cases but didn't get the ones you commented. Thank you~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported this proposal back to original google doc: https://docs.google.com/document/d/1G3wVBaiza9GI5q9nwHLCB2I7r1HUbg6RpTNFpQWTFhQ/edit?usp=sharing, and made updates based on your cases. As you said, we should think very carefully about this part, please let me know if you have any new idea, thanks


* For proposal #2, `propagateConstantSolver` is also an applicable way to add `NotNULL` filter(`not(isnull())`), but should examine whether the column has the `NotNULL` constraint already.

* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collect and fold comparison constraints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, ranger only comes in when we are computing ranges for index scan, or for table scan with filter on RowID. If we want to impose general expression simplification, we need to extract infrastructure functionalities in util/ranger/points.go into a general module.


* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collect and fold comparison constraints

* For proposal #4, current rule [PredicatePushDown](https://github.com/pingcap/tidb/blob/b3d4ed79b978efadf2974f78db8eeb711509e545/plan/rule_predicate_push_down.go#L1) may be enhanced to archive it
Copy link
Contributor

@eurekaka eurekaka Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PredicatePushDown is another story. What we need to do here is to derive more logically equivalent conditions by constant propagation, and then PredicatePushDown can consume them. PredicatePushDown itself is fine I think. I am trying to figure out if there is a general approach to correctly propagate these condition over outer join, still work in progress.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this part of the doc are just some rough ideas. It would be great if we have a better solution, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/archive/achieve

@bb7133 bb7133 force-pushed the bb7133/proposal_doc branch 2 times, most recently from 982e7a3 to ca2a99e Compare September 18, 2018 16:41
@bb7133
Copy link
Member Author

bb7133 commented Sep 19, 2018

hi @eurekaka , I've updated the docs to address your comments. Thanks.

Copy link
Contributor

@eurekaka eurekaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM


* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collect and fold comparison constraints

* For proposal #4, current rule [PredicatePushDown](https://github.com/pingcap/tidb/blob/b3d4ed79b978efadf2974f78db8eeb711509e545/plan/rule_predicate_push_down.go#L1) may be enhanced to archive it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/archive/achieve

@bb7133
Copy link
Member Author

bb7133 commented Sep 22, 2018

Addressed, thanks a lot!

@zz-jason
Copy link
Member

LGTM

@zz-jason zz-jason added the status/LGT2 Indicates that a PR has LGTM 2. label Sep 25, 2018
@zz-jason zz-jason merged commit c8102f2 into pingcap:master Sep 25, 2018
@bb7133 bb7133 deleted the bb7133/proposal_doc branch March 7, 2019 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/docs contribution This PR is from a community contributor. sig/planner SIG: Planner status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants