docs: add design doc for analyze predicate columns #28878

xuyifangreeneyes · 2021-10-15T10:25:13Z

What problem does this PR solve?

Problem Summary:

It takes lots of time, memory and cpu to analyze large and wide tables.

What is changed and how it works?

Make a proposal to support ANALYZE PREDICATE COLUMNS or ANALYZE COLUMNS c1, ..., cn, which only collects statistics of the columns which are used(needed) by the optimizer and can reduce the cost of ANALYZE.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

ti-chi-bot · 2021-10-15T10:25:14Z

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

xuyifangreeneyes · 2021-10-16T08:05:49Z

/cc @winoros @chrysan

xuyifangreeneyes · 2021-10-18T02:10:11Z

/cc @qw4990 @time-and-fate @Reminiscent @rebelice

docs/design/2021-10-15-analyze-predicate-columns.md

rebelice · 2021-11-16T11:22:00Z

docs/design/2021-10-15-analyze-predicate-columns.md

+
+### Modify Count And Outdated Statistics
+
+`modify_count` is currently at the table level. Each time when the table is analyzed, `modify_count` is set to 0. After introducing the `PREDICATE COLUMNS`/`COLUMNS ColumnNameList` option for `ANALYZE`, some columns' statistics are updated while others' statistics are not(maybe even not exist), but we still set `modify_count` to 0 after `ANALYZE` is finished, which breaks the table level meaning of `modify_count`. A method is to make each column have its own `modify_count` but we don't consider that currently since it involves many logic changes. Another method is to delete the outdated statistics of the columns which are excluded in the current `ANALYZE` statement but we don't adopt that since deleting the outdated statistics may bring the risk of changing plans. Therefore, our behavior is to set `modify_count` to 0 and remain outdated statistics, though it breaks the table level meaning of `modify_count`. It should be emphasized that users should list **all the columns whose statistics need to be collected** rather than parts of them, and TiDB will give a note-level warning to the client to address that each time when `ANALYZE COLUMNS ColumnNameList` is executed. 


Do we have a plan to improve modify_count feature? IMO, setting modify_count to 0 and remaining outdated statistics break the definition of modify_count and are risky.

I agree that it is wired that modify_count is about the whole table while we can analyze some of columns. We don't make each column/index have its own modify_count because the following two reasons:

It involves lots of logic changes.

https://docs.google.com/document/d/1VjJjAkp_EzUBroOzkZHq9OszrrY-1gnMxT1N_KYFpZs/edit?disco=AAAAIYi-D-M

From another perspective, if we set modify_count to 0 and delete outdated statistics, do we break the definition of modify_count? If the answer is no and we regard outdated statistics as better pseudo statistics, setting modify_count and remaining outdated statistics seems to make more sense.

The real risk of breaking the definition of modify_count is that we use modfiy_count/count to decide whether we need to analyze the table. After ANALYZE PREDICATE COLUMNS/COLUMNS c1, ..., cn, modify_count is set to 0. Then maybe some queries need statistics of a certain column, which are not collected in last analyze and the outdated statistics are used. The worrying thing is that now modify_count/count is 0 and neither auto-analyze nor manual-analyze will be executed on the table(the user checks stats-health and finds it 100% healthy). Another similar case is that after ANALYZE, modify_count is set to 0 and later a new index is added. In this case, auto-analyze will be triggered by the newly added index while manual-analyze may be not(the user checks stats-health and finds it 100% healthy and may forget the statistics of the newly added index have not been collected yet).

Therefore, I think we can add another dimension to decide whether we need to analyze the table(i.e., whether the table is healthy) in addition to modify_count/count. When outdated/pseudo statistics of the column/index are used, we record that and use a syntax such as show column_or_index_needed_to_be_analyzed to show users the statistics of some columns/indexes need to be collected even though modify_count/count is very small or even 0.
Besides, once outdated statistics of columns are used by optimizer, it can trigger auto-analyze(just like the newly added index can trigger auto-analyze).

rebelice · 2021-11-16T11:22:57Z

docs/design/2021-10-15-analyze-predicate-columns.md

+
+Some other databases such as Redshift have the `ANALYZE` column option like `all columns`/`predicate columns`/`columns c1, ..., cn`. Redshift collects statistics up to 5x faster by analyzing only predicate columns. 
+
+## Compatibility


I think we need add modify_count here.

Co-authored-by: Yuanjia Zhang <qw4990@163.com>

ti-chi-bot · 2023-07-06T10:23:17Z

@xuyifangreeneyes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-br-integration-test	`d3411e9`	link	true	`/test pull-br-integration-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ti-chi-bot · 2023-07-06T10:23:19Z

@xuyifangreeneyes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-br-integration-test	`d3411e9`	link	true	`/test pull-br-integration-test`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Rustin170506 · 2024-07-30T09:42:58Z

closed by #53511

Thanks!

add doc for analyze predicate columns

838cbde

ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 15, 2021

xuyifangreeneyes added 2 commits October 16, 2021 15:57

upd doc

36bdcdd

upd abstract

940d2e8

xuyifangreeneyes mentioned this pull request Oct 16, 2021

Tracking issue for analyze predicate/user-specified columns #27828

Open

14 tasks

ti-chi-bot requested review from chrysan and winoros October 16, 2021 08:05

ti-chi-bot requested review from qw4990, rebelice, Reminiscent and time-and-fate October 18, 2021 02:10

winoros reviewed Nov 11, 2021

View reviewed changes

docs/design/2021-10-15-analyze-predicate-columns.md Show resolved Hide resolved

xuyifangreeneyes and others added 2 commits November 15, 2021 12:13

update doc

c95536b

Merge branch 'master' into predicate-columns-doc

cbfd5f2

xuyifangreeneyes requested a review from winoros November 15, 2021 04:14

qw4990 reviewed Nov 16, 2021

View reviewed changes

docs/design/2021-10-15-analyze-predicate-columns.md Outdated Show resolved Hide resolved

qw4990 reviewed Nov 16, 2021

View reviewed changes

docs/design/2021-10-15-analyze-predicate-columns.md Show resolved Hide resolved

rebelice reviewed Nov 16, 2021

View reviewed changes

Update docs/design/2021-10-15-analyze-predicate-columns.md

d3411e9

Co-authored-by: Yuanjia Zhang <qw4990@163.com>

Rustin170506 closed this Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add design doc for analyze predicate columns #28878

docs: add design doc for analyze predicate columns #28878

xuyifangreeneyes commented Oct 15, 2021 •

edited

Loading

ti-chi-bot commented Oct 15, 2021

xuyifangreeneyes commented Oct 16, 2021

xuyifangreeneyes commented Oct 18, 2021

rebelice Nov 16, 2021

xuyifangreeneyes Nov 16, 2021

rebelice Nov 16, 2021

ti-chi-bot bot commented Jul 6, 2023

ti-chi-bot commented Jul 6, 2023

Rustin170506 commented Jul 30, 2024


		### Modify Count And Outdated Statistics

		`modify_count` is currently at the table level. Each time when the table is analyzed, `modify_count` is set to 0. After introducing the `PREDICATE COLUMNS`/`COLUMNS ColumnNameList` option for `ANALYZE`, some columns' statistics are updated while others' statistics are not(maybe even not exist), but we still set `modify_count` to 0 after `ANALYZE` is finished, which breaks the table level meaning of `modify_count`. A method is to make each column have its own `modify_count` but we don't consider that currently since it involves many logic changes. Another method is to delete the outdated statistics of the columns which are excluded in the current `ANALYZE` statement but we don't adopt that since deleting the outdated statistics may bring the risk of changing plans. Therefore, our behavior is to set `modify_count` to 0 and remain outdated statistics, though it breaks the table level meaning of `modify_count`. It should be emphasized that users should list all the columns whose statistics need to be collected rather than parts of them, and TiDB will give a note-level warning to the client to address that each time when `ANALYZE COLUMNS ColumnNameList` is executed.


		Some other databases such as Redshift have the `ANALYZE` column option like `all columns`/`predicate columns`/`columns c1, ..., cn`. Redshift collects statistics up to 5x faster by analyzing only predicate columns.

		## Compatibility

docs: add design doc for analyze predicate columns #28878

docs: add design doc for analyze predicate columns #28878

Conversation

xuyifangreeneyes commented Oct 15, 2021 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Oct 15, 2021

xuyifangreeneyes commented Oct 16, 2021

xuyifangreeneyes commented Oct 18, 2021

rebelice Nov 16, 2021

Choose a reason for hiding this comment

xuyifangreeneyes Nov 16, 2021

Choose a reason for hiding this comment

rebelice Nov 16, 2021

Choose a reason for hiding this comment

ti-chi-bot bot commented Jul 6, 2023

ti-chi-bot commented Jul 6, 2023

Rustin170506 commented Jul 30, 2024

xuyifangreeneyes commented Oct 15, 2021 •

edited

Loading