Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics: remove statistics.Column.Count #43033

Merged
merged 3 commits into from
Apr 14, 2023

Conversation

xuyifangreeneyes
Copy link
Contributor

@xuyifangreeneyes xuyifangreeneyes commented Apr 13, 2023

What problem does this PR solve?

Issue Number: ref #42160 close #44404

Problem Summary:

What is changed and how it works?

Remove statistics.Column.Count.

Before the PR, in order to maintain statistics.Column.Count, we need to read mysql.stats_top_n and mysql.stats_buckets for each column when init stats, which is time-consuming. On the other hand, the usage of statistics.Column.Count is limited. We modify (*DataSource).getColumnNDV to get rid of statistics.Column.Count.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Apr 13, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • time-and-fate
  • winoros

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 13, 2023
if analyzeCount > 0 {
factor := float64(ds.statisticTable.RealtimeCount) / hist.TotalRowCount()
ndv *= factor
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the way to calculate column NDV is changed. We assume that the change is ok for most cases.

@@ -3205,8 +3205,8 @@ func TestIssue32632(t *testing.T) {
"`S_ACCTBAL` decimal(15,2) NOT NULL," +
"`S_COMMENT` varchar(101) NOT NULL," +
"PRIMARY KEY (`S_SUPPKEY`) /*T![clustered_index] CLUSTERED */)")
tk.MustExec("analyze table partsupp;")
tk.MustExec("analyze table supplier;")
Copy link
Contributor Author

@xuyifangreeneyes xuyifangreeneyes Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't modify the test, I would get the error as follows:

Diff:
--- Expected
+++ Actual
@@ -4,4 +4,4 @@
 [    └─HashAgg 1.00 mpp[tiflash]  funcs:sum(test.partsupp.ps_supplycost)->Column#15]
-[      └─Projection 12500.00 mpp[tiflash]  test.partsupp.ps_supplycost]
-[        └─HashJoin 12500.00 mpp[tiflash]  inner join, equal:[eq(test.supplier.s_suppkey, test.partsupp.ps_suppkey)]]
+[      └─Projection 8000000000.00 mpp[tiflash]  test.partsupp.ps_supplycost]
+[        └─HashJoin 8000000000.00 mpp[tiflash]  inner join, equal:[eq(test.supplier.s_suppkey, test.partsupp.ps_suppkey)]]
 [          ├─ExchangeReceiver(Build) 10000.00 mpp[tiflash]  ]

Before the PR, in getColumnNDV, since hist.Count is 0, we use float64(ds.statisticTable.RealtimeCount) * distinctFactor to calculate NDV. After the PR, if we still analyze partsupp and supplier when they don't have any data, since hist.IsStatsInitialized() is true, we use float64(hist.Histogram.NDV) to calculate NDV. In another word, before the PR we use hist.Count > 0 to decide whether the table is analyzed while after the PR we use hist.IsStatsInitialized() to decide that. I think the latter is more reasonable. I remove the two analyze commands to keep cardinality estimation the same as before. Note that it is rare in the reality that RealtimeCount reaches 800000/10000 but auto analyze is not triggered.

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 13, 2023
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 14, 2023
@xuyifangreeneyes
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: c57ea0d

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 14, 2023
@xuyifangreeneyes
Copy link
Contributor Author

/test unit-test

@hawkingrei
Copy link
Member

/test all

@xuyifangreeneyes
Copy link
Contributor Author

/test unit-test

@ti-chi-bot ti-chi-bot merged commit 579f47e into pingcap:master Apr 14, 2023
@xuyifangreeneyes xuyifangreeneyes deleted the remove-column-count branch April 14, 2023 12:14
@ti-chi-bot
Copy link
Member

@xuyifangreeneyes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
idc-jenkins-ci-tidb/unit-test 6073af3 link true /test unit-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@xuyifangreeneyes
Copy link
Contributor Author

/cherry-pick release-6.5

@ti-chi-bot
Copy link
Member

@xuyifangreeneyes: new pull request created to branch release-6.5: #44405.

In response to this:

/cherry-pick release-6.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jun 5, 2023
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

calculating topn count in InitStats takes too much time
5 participants