Refactor Statistics, introduce precision estimates (`Exact`, `Inexact`, `Absent`) #7793

berkaysynnada · 2023-10-11T09:54:12Z

Which issue does this PR close?

Closes #7553.

Rationale for this change

This PR is a comprehensive refactor of the statistics() method of execs. The main goal is to enhance DataFusion by introducing an enumeration called "Precision" and revealing its benefit by inserting it into Statistics parameters.

This enumeration categorizes the level of information accuracy throughout the code. We're using it to categorize Statistics parameters more precisely and consistently across the codebase. Regarding this, a few algorithms have been enhanced as we have more detailed information now.

The starting point of this update is based on the problem in the issue that will be closed. As mentioned in the issue, statistics function is causing an inefficiency in FilterExec, and the error propagation is broken. That is fixed, and the cases in which errors can arise are restricted and clearly defined.

What changes are included in this PR?

statistics() method of exec's now returns a Result.
The option values of statistics parameters have been simplified, and the exactness of each field is now tracked separately.
Some optimizations on aggregations have been modified. Each aggregate now takes into consideration the corresponding statistics' exactness information, which advances capabilities of optimizations.
Statistical calculations such as join output statistics are handled more wisely.

Are these changes tested?

Yes. Existing tests are adapted to Precision. New tests are added for the new functions.

Are there any user-facing changes?

…d cases

crepererum · 2023-10-16T10:37:59Z

I like the idea of "precision", however I have a few concerns regarding the current implementation:

inexact: What does "inexact" provide if you don't have ANY bounds whatsoever? I think this only makes sense if you can specify if the "inexact" guess is a at least also an upper/lower bound of the possible values. E.g. "row counts" can be an inexact upper bound if you have a filter but you just pass through the counts of the infiltered data. So I propose
constraints: If you have min+max value, I think you rarely want a totally random guess but they probably should be ends of a range and constraint each other. Also see point above.

With that, I propose to change the Precision enum to:

Exact: speaks for itself, might be rare for floats though
UpperBound: The actual value is <= this given bound.
LowerBound: The actual value is >= this given bound.
Absent: speaks for itself

For min/max value this probably leads to min=LowerBound(...) and max=UpperBound(...) in many cases which also nicely fits into the interval analysis that also exists in the DF code base already.

A more drastic (but maybe better?) alternative is to keep Inexact but instead of filling it with T, fill it with Interval. This provides you bounds and we already have a bunch of code around that type.

ozankabak · 2023-10-16T11:10:09Z

Hi @crepererum, thanks for taking a look. We actually had similar thoughts and did consider whether we can use Intervals in a similar fashion. We ended up using intervals whenever we can, but there are cases where something like Precision in its current form remains necessary. Here is what we do as of this PR:

Whenever we can elicit lower/upper bounds for a quantity, we use Interval. No need for another type -- a singleton interval already denotes exact information and we can represent the absence of any information with a universal interval (an interval containing all values) too.
There are cases where we only get an inexact/approximate point estimate, and we have no information on bounds. In these cases, a type like Precision is necessary. There are various ways this can arise, but one good example is as follows: Let's say I have two zero-mean quantities A and B. I know the value of A exactly (say 10), and I don't have any information on B. When using intervals, the sum A + B ends up in the no information state, which loses information. In such cases, having a value of Inexact(10) works better for many purposes.

So basically Precision::Inexact is an arbitrarily wide Interval whose midpoint/mean value information is carried over throughout calculations, and we only use this construct in contexts where lower/upper bounds are unavailable.

crepererum · 2023-10-16T11:58:47Z

So in which case is an "arbitrarily wide Interval" better than "absent"?

ozankabak · 2023-10-16T12:08:39Z

Maybe an example involving selectivity will help: We often end up with an approximate point-wise estimate of selectivity that we get from input/output rows, but we have no rigorous bounds on it. The "inexact" mechanism comes into play when we use such a selectivity in any downstream calculations.

1) get_statistics_with_limit() is simplified 2) AnalysisContext { boundaries } does not need to be an Option, it is removed. 3) Aggregation statistics() has a special handling for 0 and 1 row cases.

crepererum · 2023-10-16T12:36:32Z

Which selectivity estimator provides an inexact point instead of a lower/upper bound?

berkaysynnada · 2023-10-16T13:46:50Z

Which selectivity estimator provides an inexact point instead of a lower/upper bound?

I can give an example of how FilterExec::statistics() currently works on a table having one column.

Let's assume that the number of rows from input statistics() is exactly 1000, and the range of the column values is [1,100]. The predicate is column_a>90.

After the analysis, the new range of the column would become [91,100], which estimates the selectivity as 0.1 (intervals are assumed to be uniform, they cannot be treated any differently yet). https://github.com/apache/arrow-datafusion/blob/fa2bb6c4e80d80e3d1d26ce85f7c21232f036dd5/datafusion/physical-expr/src/analysis.rs#L218-L223

Now, when we update the number of rows, we use that selectivity value and find the new row count as 1000*0.1=100.
https://github.com/apache/arrow-datafusion/blob/fa2bb6c4e80d80e3d1d26ce85f7c21232f036dd5/datafusion/physical-plan/src/filter.rs#L233-L235
These lines become:

https://github.com/synnada-ai/arrow-datafusion/blob/23c564b093f8a60b95bbf9f774a1afe9b2908bc7/datafusion/physical-plan/src/filter.rs#L221-L224

Though the input row count is exact, we cannot take that exactness information further. At that point, how can we define a range for 100 rows? That is why we needed such Inexact implementation.

crepererum · 2023-10-16T16:09:08Z

@berkaysynnada thank you for the nice example. W/o introducing even more complex statistics into DF, I don't see how to really express that estimate and I think the "inexact" precision proposed in this PR is a reasonable solution until we find something better ✔️

alamb

Thank you again @berkaysynnada -- this is really nice work, both codewise as well as implementation-wise.

Thanks also to everyone who helped review the code

datafusion/common/src/stats.rs

datafusion/core/src/datasource/statistics.rs

alamb · 2023-10-16T17:15:03Z

datafusion/physical-plan/src/aggregates/mod.rs

+                })
+            }
+            _ => {
+                // When the input row count is 0 or 1, we can adopt that statistic keeping its reliability.


This change seems related / may conflict with #7832 from @mustafasrepo

We have discussed that with @mustafasrepo, and his solution is temporary. This change should serve as the final version after merging the PRs.

datafusion/physical-plan/src/filter.rs

datafusion/core/src/datasource/statistics.rs

datafusion/physical-plan/src/lib.rs

ozankabak · 2023-10-16T18:30:05Z

Great, thanks everyone for the reviews! We will wait for a little bit more and merge this tomorrow if all still looks good

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb · 2023-10-17T15:51:55Z

Epic!

berkaysynnada and others added 30 commits September 14, 2023 17:26

analysis context refactored

95f419b

is_exact fix

62be1e2

Minor changes

76da729

minor changes

1bb940d

Minor changes

c380f4a

Minor changes

b801e46

datatype check added, statistics default removed

96b97c4

MemExec uses the stats of projections, agg optimize excludes unbounde…

dd5e4c9

…d cases

Merge branch 'apache_main' into refactor/analysis-context

628eff3

fix after merge

c891ea9

proto fix

2450ef2

Simplifications

a2d1f2e

statistics() returns result

fa72d41

Merge branch 'apache_main' into refactor/analysis-context

7036184

fix after merge

f2e99f9

Simplifications

571654f

Merge branch 'apache_main' into refactor/analysis-context

6fbaa7b

Merge branch 'apache_main' into refactor/analysis-context

db10487

Remove option from column stats

6ef4582

exact info added

6fa17a5

error in agg optimization

b56a81e

bugs are fixed

cb11054

negative expr support

946c341

Merge branch 'apache_main' into refactor/analysis-context

5829ca4

fix after merge

47dc4e7

Merge branch 'apache_main' into refactor/analysis-context

69ac12e

fix after merge

e8fa615

Minor changes, simplifications

91c9a6b

minor changes

cfe6fca

min max accs removed

8e95696

Update tests

ab8c0c7

mustafasrepo mentioned this pull request Oct 16, 2023

[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832

Merged

Remove panics

3471978

1 bug-fix, 2 code simplifications

23c564b

1) get_statistics_with_limit() is simplified 2) AnalysisContext { boundaries } does not need to be an Option, it is removed. 3) Aggregation statistics() has a special handling for 0 and 1 row cases.

berkaysynnada added 4 commits October 16, 2023 18:06

Merge branch 'apache_main' into refactor/analysis-context

645555a

conflict resolved

e104a21

Merge branch 'apache_main' into refactor/analysis-context

0865eb8

conflict resolved

81baf4c

alamb changed the title ~~Refactor Statistics~~ Refactor Statistics, introduce precision estimates (Exact, Inexact, Absent) Oct 16, 2023

alamb approved these changes Oct 16, 2023

View reviewed changes

berkaysynnada and others added 6 commits October 17, 2023 09:48

Update datafusion/physical-plan/src/filter.rs

a5ae0e2

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Simplify set_max/min helpers

ed4f759

fix vector copy, remove clones

5401051

Merge branch 'apache_main' into refactor/analysis-context

a02ea34

resolving conflict

426462b

remove clone

16d7882

mustafasrepo merged commit c9330bc into apache:main Oct 17, 2023
22 checks passed

alamb mentioned this pull request Oct 23, 2023

Minor: provide default implementation for ExecutionPlan::statistics #7911

Merged

msirek mentioned this pull request Nov 4, 2023

Fix incorrect results in COUNT(*) queries with LIMIT #8049

Merged

2 tasks

berkaysynnada deleted the refactor/analysis-context branch November 10, 2023 11:39

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Statistics, introduce precision estimates (`Exact`, `Inexact`, `Absent`) #7793

Refactor Statistics, introduce precision estimates (`Exact`, `Inexact`, `Absent`) #7793

berkaysynnada commented Oct 11, 2023 •

edited

Loading

crepererum commented Oct 16, 2023

ozankabak commented Oct 16, 2023 •

edited

Loading

crepererum commented Oct 16, 2023

ozankabak commented Oct 16, 2023

crepererum commented Oct 16, 2023

berkaysynnada commented Oct 16, 2023

crepererum commented Oct 16, 2023

alamb left a comment

alamb Oct 16, 2023

berkaysynnada Oct 17, 2023

ozankabak commented Oct 16, 2023

alamb commented Oct 17, 2023

Refactor Statistics, introduce precision estimates (Exact, Inexact, Absent) #7793

Refactor Statistics, introduce precision estimates (Exact, Inexact, Absent) #7793

Conversation

berkaysynnada commented Oct 11, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

crepererum commented Oct 16, 2023

ozankabak commented Oct 16, 2023 • edited Loading

crepererum commented Oct 16, 2023

ozankabak commented Oct 16, 2023

crepererum commented Oct 16, 2023

berkaysynnada commented Oct 16, 2023

crepererum commented Oct 16, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Oct 16, 2023

Choose a reason for hiding this comment

berkaysynnada Oct 17, 2023

Choose a reason for hiding this comment

ozankabak commented Oct 16, 2023

alamb commented Oct 17, 2023

Refactor Statistics, introduce precision estimates (`Exact`, `Inexact`, `Absent`) #7793

Refactor Statistics, introduce precision estimates (`Exact`, `Inexact`, `Absent`) #7793

berkaysynnada commented Oct 11, 2023 •

edited

Loading

ozankabak commented Oct 16, 2023 •

edited

Loading