-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add design doc for analyze predicate columns #28878
docs: add design doc for analyze predicate columns #28878
Conversation
[REVIEW NOTIFICATION] This pull request has not been approved. To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
|
||
### Modify Count And Outdated Statistics | ||
|
||
`modify_count` is currently at the table level. Each time when the table is analyzed, `modify_count` is set to 0. After introducing the `PREDICATE COLUMNS`/`COLUMNS ColumnNameList` option for `ANALYZE`, some columns' statistics are updated while others' statistics are not(maybe even not exist), but we still set `modify_count` to 0 after `ANALYZE` is finished, which breaks the table level meaning of `modify_count`. A method is to make each column have its own `modify_count` but we don't consider that currently since it involves many logic changes. Another method is to delete the outdated statistics of the columns which are excluded in the current `ANALYZE` statement but we don't adopt that since deleting the outdated statistics may bring the risk of changing plans. Therefore, our behavior is to set `modify_count` to 0 and remain outdated statistics, though it breaks the table level meaning of `modify_count`. It should be emphasized that users should list **all the columns whose statistics need to be collected** rather than parts of them, and TiDB will give a note-level warning to the client to address that each time when `ANALYZE COLUMNS ColumnNameList` is executed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a plan to improve modify_count
feature? IMO, setting modify_count
to 0 and remaining outdated statistics break the definition of modify_count
and are risky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is wired that modify_count
is about the whole table while we can analyze some of columns. We don't make each column/index have its own modify_count
because the following two reasons:
- It involves lots of logic changes.
- https://docs.google.com/document/d/1VjJjAkp_EzUBroOzkZHq9OszrrY-1gnMxT1N_KYFpZs/edit?disco=AAAAIYi-D-M
From another perspective, if we set modify_count
to 0 and delete outdated statistics, do we break the definition of modify_count
? If the answer is no and we regard outdated statistics as better pseudo statistics, setting modify_count
and remaining outdated statistics seems to make more sense.
The real risk of breaking the definition of modify_count
is that we use modfiy_count/count
to decide whether we need to analyze the table. After ANALYZE PREDICATE COLUMNS/COLUMNS c1, ..., cn
, modify_count
is set to 0. Then maybe some queries need statistics of a certain column, which are not collected in last analyze and the outdated statistics are used. The worrying thing is that now modify_count/count
is 0 and neither auto-analyze nor manual-analyze will be executed on the table(the user checks stats-health and finds it 100% healthy). Another similar case is that after ANALYZE
, modify_count
is set to 0 and later a new index is added. In this case, auto-analyze will be triggered by the newly added index while manual-analyze may be not(the user checks stats-health and finds it 100% healthy and may forget the statistics of the newly added index have not been collected yet).
Therefore, I think we can add another dimension to decide whether we need to analyze the table(i.e., whether the table is healthy) in addition to modify_count/count
. When outdated/pseudo statistics of the column/index are used, we record that and use a syntax such as show column_or_index_needed_to_be_analyzed
to show users the statistics of some columns/indexes need to be collected even though modify_count/count
is very small or even 0.
Besides, once outdated statistics of columns are used by optimizer, it can trigger auto-analyze(just like the newly added index can trigger auto-analyze).
|
||
Some other databases such as Redshift have the `ANALYZE` column option like `all columns`/`predicate columns`/`columns c1, ..., cn`. Redshift collects statistics up to 5x faster by analyzing only predicate columns. | ||
|
||
## Compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need add modify_count
here.
Co-authored-by: Yuanjia Zhang <qw4990@163.com>
@xuyifangreeneyes: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@xuyifangreeneyes: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
closed by #53511 Thanks! |
What problem does this PR solve?
Problem Summary:
It takes lots of time, memory and cpu to analyze large and wide tables.
What is changed and how it works?
Make a proposal to support
ANALYZE PREDICATE COLUMNS
orANALYZE COLUMNS c1, ..., cn
, which only collects statistics of the columns which are used(needed) by the optimizer and can reduce the cost ofANALYZE
.Check List
Tests
Side effects
Documentation
Release note