Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add design doc for analyze predicate columns #28878

Closed

Conversation

xuyifangreeneyes
Copy link
Contributor

@xuyifangreeneyes xuyifangreeneyes commented Oct 15, 2021

What problem does this PR solve?

Problem Summary:

It takes lots of time, memory and cpu to analyze large and wide tables.

What is changed and how it works?

Make a proposal to support ANALYZE PREDICATE COLUMNS or ANALYZE COLUMNS c1, ..., cn, which only collects statistics of the columns which are used(needed) by the optimizer and can reduce the cost of ANALYZE.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 15, 2021
@xuyifangreeneyes
Copy link
Contributor Author

/cc @winoros @chrysan

@xuyifangreeneyes
Copy link
Contributor Author


### Modify Count And Outdated Statistics

`modify_count` is currently at the table level. Each time when the table is analyzed, `modify_count` is set to 0. After introducing the `PREDICATE COLUMNS`/`COLUMNS ColumnNameList` option for `ANALYZE`, some columns' statistics are updated while others' statistics are not(maybe even not exist), but we still set `modify_count` to 0 after `ANALYZE` is finished, which breaks the table level meaning of `modify_count`. A method is to make each column have its own `modify_count` but we don't consider that currently since it involves many logic changes. Another method is to delete the outdated statistics of the columns which are excluded in the current `ANALYZE` statement but we don't adopt that since deleting the outdated statistics may bring the risk of changing plans. Therefore, our behavior is to set `modify_count` to 0 and remain outdated statistics, though it breaks the table level meaning of `modify_count`. It should be emphasized that users should list **all the columns whose statistics need to be collected** rather than parts of them, and TiDB will give a note-level warning to the client to address that each time when `ANALYZE COLUMNS ColumnNameList` is executed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a plan to improve modify_count feature? IMO, setting modify_count to 0 and remaining outdated statistics break the definition of modify_count and are risky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is wired that modify_count is about the whole table while we can analyze some of columns. We don't make each column/index have its own modify_count because the following two reasons:

  1. It involves lots of logic changes.
  2. https://docs.google.com/document/d/1VjJjAkp_EzUBroOzkZHq9OszrrY-1gnMxT1N_KYFpZs/edit?disco=AAAAIYi-D-M

From another perspective, if we set modify_count to 0 and delete outdated statistics, do we break the definition of modify_count? If the answer is no and we regard outdated statistics as better pseudo statistics, setting modify_count and remaining outdated statistics seems to make more sense.

The real risk of breaking the definition of modify_count is that we use modfiy_count/count to decide whether we need to analyze the table. After ANALYZE PREDICATE COLUMNS/COLUMNS c1, ..., cn, modify_count is set to 0. Then maybe some queries need statistics of a certain column, which are not collected in last analyze and the outdated statistics are used. The worrying thing is that now modify_count/count is 0 and neither auto-analyze nor manual-analyze will be executed on the table(the user checks stats-health and finds it 100% healthy). Another similar case is that after ANALYZE, modify_count is set to 0 and later a new index is added. In this case, auto-analyze will be triggered by the newly added index while manual-analyze may be not(the user checks stats-health and finds it 100% healthy and may forget the statistics of the newly added index have not been collected yet).

Therefore, I think we can add another dimension to decide whether we need to analyze the table(i.e., whether the table is healthy) in addition to modify_count/count. When outdated/pseudo statistics of the column/index are used, we record that and use a syntax such as show column_or_index_needed_to_be_analyzed to show users the statistics of some columns/indexes need to be collected even though modify_count/count is very small or even 0.
Besides, once outdated statistics of columns are used by optimizer, it can trigger auto-analyze(just like the newly added index can trigger auto-analyze).


Some other databases such as Redshift have the `ANALYZE` column option like `all columns`/`predicate columns`/`columns c1, ..., cn`. Redshift collects statistics up to 5x faster by analyzing only predicate columns.

## Compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need add modify_count here.

Co-authored-by: Yuanjia Zhang <qw4990@163.com>
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jul 6, 2023

@xuyifangreeneyes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-br-integration-test d3411e9 link true /test pull-br-integration-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ti-chi-bot
Copy link
Member

@xuyifangreeneyes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-br-integration-test d3411e9 link true /test pull-br-integration-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Rustin170506
Copy link
Member

closed by #53511

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants